Commit Graph

217 Commits

Author SHA1 Message Date
Tristan Lee
b8ddc400f3 updated documentation, minor fixes like excluding very long cookiestring from docs 2023-08-03 01:59:30 -05:00
Tristan Lee
e2142966e7 refactored tests to reduce redundancy, got tests workig for Telegram, Bitchute, Gettr, and Rumble 2023-08-03 00:53:38 -05:00
Tristan Lee
bd67806ed2 got Telegram scraper tests all working 2023-08-01 10:46:50 -05:00
Tristan Lee
249f411a1d fixed some issues with Telegram tests 2023-07-27 13:07:44 -05:00
Logan Williams
99cc4d80b2 Cache screenname ID lookup 2023-05-04 16:23:24 +02:00
Logan Williams
ca6e284cb3 Cache reply_to post IDs too 2023-05-04 16:14:03 +02:00
Logan Williams
91de6482e0 Add rather hacky bulk insert functionality 2023-05-04 15:26:52 +02:00
Logan Williams
f9bf2bc2ee Merge branch 'main' of github.com:bellingcat/cisticola 2023-05-04 14:06:59 +02:00
Logan Williams
ebbc6b69dd Add new function for insert post (faster/bulk) 2023-05-04 14:04:55 +02:00
Logan Williams
9dbf05fccb Streamline logging; fix markdown formating in Telegram 2023-05-04 10:00:14 +00:00
Logan Williams
2320ea1efd Use telethon session CLI argument always; improvements to Telegram transformer (author id/username for chats, min_id via CLI argument, use the same session) 2023-03-04 09:51:15 +01:00
Logan Williams
7d55eace3d Update platform_id when it is empty 2023-03-03 15:28:30 +01:00
Logan Williams
eced79b278 Fix issue with insert_or_select 2023-03-03 10:47:21 +01:00
Logan Williams
793a783963 Revert to previous insert or select behavior 2023-03-02 23:04:34 +01:00
Logan Williams
d2db83ae93 Update other channel properties too for a linked channel 2023-03-02 17:03:24 +01:00
Logan Williams
3a6905e9c1 Adjust logic for changing source label 2023-03-02 16:48:15 +01:00
Logan Williams
7b2c597a24 Update channel source, only if non-researcher 2023-03-02 16:45:38 +01:00
Logan Williams
64aef7238c Merge branch 'main' of github.com:bellingcat/cisticola 2023-03-02 16:34:42 +01:00
Logan Williams
ffa8cdd8c6 Fix bitchute transformer 2023-03-02 16:33:48 +01:00
Logan Williams
7adf51b5d1 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2023-03-02 15:28:24 +00:00
Logan Williams
6226a9a76e Ignore lobs 2023-03-02 15:26:38 +00:00
Logan Williams
531059ca02 Support related Telegram chats (associated discussion groups) 2023-03-02 16:21:43 +01:00
Logan Williams
351e471ff4 Change log retention and hackily improve transform speed 2023-01-26 13:21:07 +00:00
Logan Williams
5c4dd51435 Fix issues with Gsheet sync 2023-01-11 14:44:17 +00:00
Logan Williams
3ec6f50213 Merge pull request #67 from bellingcat/sync-bug-fixes
fixed channel sync bugs
2022-10-27 09:27:02 +02:00
Tristan Lee
6dc61af7a5 fixed problem from gspread update where empty columns raised error, fixed problem where sync tried to process empty channel 2022-10-26 14:47:59 -05:00
Tristan Lee
10b33b3dbb Merge pull request #66 from bellingcat/country-language-searching
Updated ORM and sync to improve filtering by language and country
2022-10-26 12:25:59 -05:00
Tristan Lee
d9e2250c5a added country index 2022-10-26 08:42:35 -05:00
Tristan Lee
5a53ebacd0 removed special case 2022-10-26 08:22:13 -05:00
Tristan Lee
3bb5af11e6 changed ORM and Google Sheet sync to reflect converting channels.country to JSONB array, added index for detected_language 2022-10-26 08:16:49 -05:00
Logan Williams
90d1d0f29f Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-10-26 13:12:12 +00:00
Logan Williams
b023e8044c Scrape snowball_complete sampled channels 2022-10-26 13:11:20 +00:00
Logan Williams
c15022402d Add an option to scape posts older than the database record as well as newer (Telegram only) 2022-09-05 13:48:01 +00:00
Logan Williams
f000c6246e Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-08-29 09:11:21 +00:00
Logan Williams
1a29c06062 Fix case where post is dummy (-1) 2022-08-29 09:11:06 +00:00
Logan Williams
86656f8ba3 Scrape snowball_it channels too 2022-08-26 15:56:46 +02:00
Logan Williams
a01d139bef Remove normalized_url column from channel creation 2022-08-24 15:35:08 +02:00
Logan Williams
4a17c3475d Add explicit source column to gsheet 2022-08-24 15:32:19 +02:00
Logan Williams
002e9458f5 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-08-01 10:00:07 +00:00
Logan Williams
f3997ff6ae Catch errors in Bitchute channel profile scraper; add multi index on posts forwarded from/channel 2022-08-01 09:58:52 +00:00
Logan Williams
7d72c0de05 Add index for network analysis 2022-07-29 12:16:17 +02:00
Logan Williams
3a04fb51d4 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-07-28 09:18:34 +00:00
Logan Williams
fee216386b Fix issue with chronological media archiving 2022-07-28 09:18:07 +00:00
Logan Williams
d05584a09f Minor bug fixes; helper tool for Telethon sessions 2022-07-28 08:42:59 +00:00
Logan Williams
ee24367caa Add features for running archive-media simultaneously 2022-07-20 09:26:47 +00:00
Logan Williams
fbb846b8d6 Fix two small bugs with media archiving 2022-07-05 13:30:39 +02:00
Logan Williams
b99958a894 Merge pull request #58 from bellingcat/media-etl
Media ETL
2022-07-05 11:51:08 +02:00
Logan Williams
51e5ca1f04 Use smaller batches for now 2022-07-05 09:48:57 +00:00
Logan Williams
6149c4279d Add some more fields to media DB, fix bugs in testing 2022-07-05 11:11:43 +02:00
Logan Williams
4ddd8d6b63 Only select untransformed media; simplify insert function 2022-07-05 10:03:38 +02:00