241 Commits

Author SHA1 Message Date
Tristan Lee
10b33b3dbb Merge pull request #66 from bellingcat/country-language-searching
Updated ORM and sync to improve filtering by language and country
2022-10-26 12:25:59 -05:00
Tristan Lee
d9e2250c5a added country index 2022-10-26 08:42:35 -05:00
Tristan Lee
5a53ebacd0 removed special case 2022-10-26 08:22:13 -05:00
Tristan Lee
3bb5af11e6 changed ORM and Google Sheet sync to reflect converting channels.country to JSONB array, added index for detected_language 2022-10-26 08:16:49 -05:00
Logan Williams
90d1d0f29f Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-10-26 13:12:12 +00:00
Logan Williams
b023e8044c Scrape snowball_complete sampled channels 2022-10-26 13:11:20 +00:00
Logan Williams
c15022402d Add an option to scape posts older than the database record as well as newer (Telegram only) 2022-09-05 13:48:01 +00:00
Logan Williams
f000c6246e Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-08-29 09:11:21 +00:00
Logan Williams
1a29c06062 Fix case where post is dummy (-1) 2022-08-29 09:11:06 +00:00
Logan Williams
86656f8ba3 Scrape snowball_it channels too 2022-08-26 15:56:46 +02:00
Logan Williams
a01d139bef Remove normalized_url column from channel creation 2022-08-24 15:35:08 +02:00
Logan Williams
4a17c3475d Add explicit source column to gsheet 2022-08-24 15:32:19 +02:00
Logan Williams
002e9458f5 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-08-01 10:00:07 +00:00
Logan Williams
f3997ff6ae Catch errors in Bitchute channel profile scraper; add multi index on posts forwarded from/channel 2022-08-01 09:58:52 +00:00
Logan Williams
7d72c0de05 Add index for network analysis 2022-07-29 12:16:17 +02:00
Logan Williams
3a04fb51d4 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-07-28 09:18:34 +00:00
Logan Williams
fee216386b Fix issue with chronological media archiving 2022-07-28 09:18:07 +00:00
Logan Williams
d05584a09f Minor bug fixes; helper tool for Telethon sessions 2022-07-28 08:42:59 +00:00
Logan Williams
ee24367caa Add features for running archive-media simultaneously 2022-07-20 09:26:47 +00:00
Logan Williams
fbb846b8d6 Fix two small bugs with media archiving 2022-07-05 13:30:39 +02:00
Logan Williams
b99958a894 Merge pull request #58 from bellingcat/media-etl
Media ETL
2022-07-05 11:51:08 +02:00
Logan Williams
51e5ca1f04 Use smaller batches for now 2022-07-05 09:48:57 +00:00
Logan Williams
6149c4279d Add some more fields to media DB, fix bugs in testing 2022-07-05 11:11:43 +02:00
Logan Williams
4ddd8d6b63 Only select untransformed media; simplify insert function 2022-07-05 10:03:38 +02:00
Logan Williams
9948af2c4a Media archiving ETL working for Telegram 2022-07-05 10:03:36 +02:00
Logan Williams
c24babb081 Fix bugs in Gettr/Rumble transformers, avoid offset in batch requests 2022-07-04 14:30:40 +00:00
Logan Williams
ed4723ed1e Fix merge error 2022-06-30 11:04:41 +00:00
Logan Williams
589ac3ba5b Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-06-24 10:34:25 +00:00
Logan Williams
7215469a74 Correct logs for transforming posts 2022-06-24 10:32:27 +00:00
Logan Williams
fe0f4f9e2c Merge pull request #62 from bellingcat/other-transformer-fixes
Fixed broken channel_info transformers, added Telegram post transformer fields
2022-06-24 11:00:50 +02:00
Tristan Lee
289a47d7b1 tested telegram transformers and implemented vk transformers 2022-06-23 15:06:10 -05:00
Tristan Lee
bb2e2806e6 got post transformers and channel_info transformers working for Rumble, Bitchute, Gettr 2022-06-21 19:05:41 -05:00
Tristan Lee
619fe42a31 got transformers for Bitchute, Rumble, and Gettr working for all raw_posts. 2022-06-20 21:45:41 -05:00
Tristan Lee
a2a7882f1c fixed Gettr and Bitchute info transformers, added missing or incorrect TelegramTransformer fields, added Telegram mentions to the transformer. 2022-06-13 13:42:33 -05:00
Logan Williams
6e962de244 Don't scrape channel info unless specifically scraping channel info 2022-06-10 08:41:45 +00:00
Logan Williams
6183972b1a Merge branch 'more-channel-info-transformers' of https://github.com/bellingcat/cisticola into main 2022-06-10 08:07:02 +00:00
Logan Williams
d83a13e0cc Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-06-09 14:11:52 +00:00
Logan Williams
fba3a661c7 Sleep after every gsheet API call 2022-06-09 14:11:29 +00:00
Logan Williams
6294ea7ea7 Increment TelegramTelethonScraper version 2022-06-09 15:30:56 +02:00
Logan Williams
92d4839b5e Revise Telethon scraper to use the same client connection 2022-06-09 10:01:27 +02:00
Logan Williams
9a30ecb243 Stop overwriting media when a large file is detected 2022-06-08 17:01:28 +02:00
Logan Williams
708d952937 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-06-08 08:11:34 +00:00
Logan Williams
bf96f248c1 Merge pull request #57 from bellingcat/synchronization-improvements
Google sheet channel synchronization improvements
2022-06-08 10:08:23 +02:00
Logan Williams
143b20fc56 Update Pipfile to load pyexiftool from PyPi
Co-authored-by: Tristan Lee <tristan@bellingcat.com>
2022-06-08 10:08:18 +02:00
Logan Williams
f932034ab2 Add a bit more logging 2022-06-07 13:14:48 +02:00
Logan Williams
39358c7f23 Update platform ID and screenname when synchronizing with gsheet; highlight dupes 2022-06-06 16:36:39 +02:00
Logan Williams
4d22838a94 Fixes to Pipfile; easier spacy setup 2022-06-06 16:35:52 +02:00
Tristan Lee
f4072183be added transformer for Gettr 2022-05-20 02:22:34 -05:00
Tristan Lee
591f1986e8 added Rumble transformers and test 2022-05-19 19:40:48 -05:00
Tristan Lee
e2094522c9 updated Bitchute transformer and addewd test 2022-05-19 18:13:50 -05:00