Commit Graph

188 Commits

Author SHA1 Message Date
Tristan Lee
6dc61af7a5 fixed problem from gspread update where empty columns raised error, fixed problem where sync tried to process empty channel 2022-10-26 14:47:59 -05:00
Tristan Lee
d9e2250c5a added country index 2022-10-26 08:42:35 -05:00
Tristan Lee
5a53ebacd0 removed special case 2022-10-26 08:22:13 -05:00
Tristan Lee
3bb5af11e6 changed ORM and Google Sheet sync to reflect converting channels.country to JSONB array, added index for detected_language 2022-10-26 08:16:49 -05:00
Logan Williams
f000c6246e Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-08-29 09:11:21 +00:00
Logan Williams
1a29c06062 Fix case where post is dummy (-1) 2022-08-29 09:11:06 +00:00
Logan Williams
86656f8ba3 Scrape snowball_it channels too 2022-08-26 15:56:46 +02:00
Logan Williams
a01d139bef Remove normalized_url column from channel creation 2022-08-24 15:35:08 +02:00
Logan Williams
4a17c3475d Add explicit source column to gsheet 2022-08-24 15:32:19 +02:00
Logan Williams
002e9458f5 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-08-01 10:00:07 +00:00
Logan Williams
f3997ff6ae Catch errors in Bitchute channel profile scraper; add multi index on posts forwarded from/channel 2022-08-01 09:58:52 +00:00
Logan Williams
7d72c0de05 Add index for network analysis 2022-07-29 12:16:17 +02:00
Logan Williams
3a04fb51d4 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-07-28 09:18:34 +00:00
Logan Williams
fee216386b Fix issue with chronological media archiving 2022-07-28 09:18:07 +00:00
Logan Williams
d05584a09f Minor bug fixes; helper tool for Telethon sessions 2022-07-28 08:42:59 +00:00
Logan Williams
ee24367caa Add features for running archive-media simultaneously 2022-07-20 09:26:47 +00:00
Logan Williams
fbb846b8d6 Fix two small bugs with media archiving 2022-07-05 13:30:39 +02:00
Logan Williams
b99958a894 Merge pull request #58 from bellingcat/media-etl
Media ETL
2022-07-05 11:51:08 +02:00
Logan Williams
51e5ca1f04 Use smaller batches for now 2022-07-05 09:48:57 +00:00
Logan Williams
6149c4279d Add some more fields to media DB, fix bugs in testing 2022-07-05 11:11:43 +02:00
Logan Williams
4ddd8d6b63 Only select untransformed media; simplify insert function 2022-07-05 10:03:38 +02:00
Logan Williams
9948af2c4a Media archiving ETL working for Telegram 2022-07-05 10:03:36 +02:00
Logan Williams
c24babb081 Fix bugs in Gettr/Rumble transformers, avoid offset in batch requests 2022-07-04 14:30:40 +00:00
Logan Williams
ed4723ed1e Fix merge error 2022-06-30 11:04:41 +00:00
Logan Williams
589ac3ba5b Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-06-24 10:34:25 +00:00
Logan Williams
7215469a74 Correct logs for transforming posts 2022-06-24 10:32:27 +00:00
Logan Williams
fe0f4f9e2c Merge pull request #62 from bellingcat/other-transformer-fixes
Fixed broken channel_info transformers, added Telegram post transformer fields
2022-06-24 11:00:50 +02:00
Tristan Lee
289a47d7b1 tested telegram transformers and implemented vk transformers 2022-06-23 15:06:10 -05:00
Tristan Lee
bb2e2806e6 got post transformers and channel_info transformers working for Rumble, Bitchute, Gettr 2022-06-21 19:05:41 -05:00
Tristan Lee
619fe42a31 got transformers for Bitchute, Rumble, and Gettr working for all raw_posts. 2022-06-20 21:45:41 -05:00
Tristan Lee
a2a7882f1c fixed Gettr and Bitchute info transformers, added missing or incorrect TelegramTransformer fields, added Telegram mentions to the transformer. 2022-06-13 13:42:33 -05:00
Logan Williams
6e962de244 Don't scrape channel info unless specifically scraping channel info 2022-06-10 08:41:45 +00:00
Logan Williams
6183972b1a Merge branch 'more-channel-info-transformers' of https://github.com/bellingcat/cisticola into main 2022-06-10 08:07:02 +00:00
Logan Williams
d83a13e0cc Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-06-09 14:11:52 +00:00
Logan Williams
fba3a661c7 Sleep after every gsheet API call 2022-06-09 14:11:29 +00:00
Logan Williams
6294ea7ea7 Increment TelegramTelethonScraper version 2022-06-09 15:30:56 +02:00
Logan Williams
92d4839b5e Revise Telethon scraper to use the same client connection 2022-06-09 10:01:27 +02:00
Logan Williams
9a30ecb243 Stop overwriting media when a large file is detected 2022-06-08 17:01:28 +02:00
Logan Williams
708d952937 Merge branch 'main' of https://github.com/bellingcat/cisticola into main 2022-06-08 08:11:34 +00:00
Logan Williams
bf96f248c1 Merge pull request #57 from bellingcat/synchronization-improvements
Google sheet channel synchronization improvements
2022-06-08 10:08:23 +02:00
Logan Williams
143b20fc56 Update Pipfile to load pyexiftool from PyPi
Co-authored-by: Tristan Lee <tristan@bellingcat.com>
2022-06-08 10:08:18 +02:00
Logan Williams
f932034ab2 Add a bit more logging 2022-06-07 13:14:48 +02:00
Logan Williams
39358c7f23 Update platform ID and screenname when synchronizing with gsheet; highlight dupes 2022-06-06 16:36:39 +02:00
Logan Williams
4d22838a94 Fixes to Pipfile; easier spacy setup 2022-06-06 16:35:52 +02:00
Tristan Lee
f4072183be added transformer for Gettr 2022-05-20 02:22:34 -05:00
Tristan Lee
591f1986e8 added Rumble transformers and test 2022-05-19 19:40:48 -05:00
Tristan Lee
e2094522c9 updated Bitchute transformer and addewd test 2022-05-19 18:13:50 -05:00
Tristan Lee
f0414a4f4d updated transformer tests 2022-05-19 16:34:19 -05:00
Tristan Lee
424c063ef2 Merge pull request #55 from bellingcat/channel-info-transformers
Transformers for raw channel info
2022-05-18 06:58:18 -07:00
Logan Williams
c279ced73d Minor bug fixes from testing 2022-05-18 09:29:53 +01:00