Commit Graph

34 Commits

Author SHA1 Message Date
Logan Williams
6149c4279d Add some more fields to media DB, fix bugs in testing 2022-07-05 11:11:43 +02:00
Logan Williams
6183972b1a Merge branch 'more-channel-info-transformers' of https://github.com/bellingcat/cisticola into main 2022-06-10 08:07:02 +00:00
Logan Williams
4d22838a94 Fixes to Pipfile; easier spacy setup 2022-06-06 16:35:52 +02:00
Tristan Lee
f0414a4f4d updated transformer tests 2022-05-19 16:34:19 -05:00
Logan Williams
8535a87def Ad hoc changes to transformers 2022-04-16 13:46:26 +00:00
Logan Williams
4c221d1133 Transformer for Telegram, base transformer NLP hydration; no media 2022-04-14 11:45:09 +02:00
Logan Williams
bbb9d283d5 Add RumbleScraper, YoutubeScraper, and BitchuteScraper to the active scrapers 2022-04-12 14:55:45 +02:00
Logan Williams
fccbad7a93 Remove 200 post limit; add log rotation 2022-04-03 16:32:00 +00:00
Tristan Lee
8ecb904249 merged main 2022-04-01 02:05:25 -05:00
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
2dc9213d64 Use new RawChannelInfo class 2022-03-31 15:17:25 +02:00
Logan Williams
1c1ff7fb6f Fix bug with Telethon scraper and certain media; add media_archived flag to TwitterScraper 2022-03-31 08:15:09 +02:00
Tristan Lee
1f99e52436 refactored Gab scraper to use gabber instead of garc 2022-03-30 08:05:10 -05:00
Tristan Lee
b805d50132 made tesets work, fixed several issues with Rumble scraper 2022-03-29 16:09:51 -05:00
Tristan Lee
16870d7daa implemented methods for extracting profile metadata (still need to test) 2022-03-28 20:16:59 -05:00
Tristan Lee
ee9a8c10dd merged main into branch 2022-03-15 09:16:11 -05:00
Tristan Lee
c3eab2f176 merged main 2022-03-14 18:19:57 -05:00
Tristan Lee
e4cf9daf73 added docstrings, improved Sphinx docs 2022-03-14 18:04:27 -05:00
Tristan Lee
750f0cc887 added scraper for Instagram 2022-03-14 10:28:10 -05:00
Tristan Lee
5783206ad8 implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers 2022-03-10 10:20:49 -06:00
Tristan Lee
739e1d8484 added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram 2022-03-09 12:12:01 -06:00
Tristan Lee
cd5f68e9e5 added basic unit tests 2022-03-04 12:36:09 -06:00
Tristan Lee
ee4d64750b added prototype Rumble scraper 2022-02-28 18:38:33 -06:00
Tristan Lee
bc840e631d added Gab scraper 2022-02-28 12:11:21 -06:00
Tristan Lee
7a257ea9f5 included comments in Odysee scraper 2022-02-28 09:15:09 -06:00
Tristan Lee
47dad8fb00 added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster 2022-02-25 20:28:00 -06:00
Tristan Lee
ef83cc4b0a converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url. 2022-02-25 13:43:30 -06:00
Logan Williams
e64d845002 Archive media in Twitter scraper 2022-02-24 18:48:48 +01:00
Logan Williams
6092e4caa5 Add method for archiving media, reoranize scraper base classes 2022-02-24 16:36:55 +01:00
Logan Williams
e3d29bf811 Add documentation generation with Sphinx 2022-02-21 17:52:38 +01:00
Tristan Lee
139459e3b2 implemented Bitchute scraper 2022-02-18 12:45:10 -06:00
Tristan Lee
4668d4df11 implemented Gettr scraper 2022-02-18 10:13:37 -06:00
Logan Williams
0e5f9f77f3 Configure pipenv 2022-02-18 15:05:02 +01:00