Commit Graph

41 Commits

Author SHA1 Message Date
Logan Williams
b6386747d4 Add indices on appropriate columns; limit # of posts to archive 2022-04-04 10:54:27 +00:00
Logan Williams
fccbad7a93 Remove 200 post limit; add log rotation 2022-04-03 16:32:00 +00:00
Logan Williams
96db662572 Don't add a timestamp to media that failed to archive 2022-04-03 14:16:03 +02:00
Logan Williams
ecae1aad05 Catch exceptions in archive_files so that archiver continues to run 2022-04-03 14:12:23 +02:00
Logan Williams
a82ec15f0e Change archived_media to be timestamp for all scrapers 2022-04-03 12:02:27 +02:00
Logan Williams
01bbabe0cb Fix issues with new datetime baed 'media_archived' column 2022-04-02 18:45:08 +00:00
Logan Williams
63633617d2 Configure with Telethon and VK only 2022-04-02 18:34:14 +00:00
Tristan Lee
0bab20e371 ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null. 2022-04-01 17:03:02 -05:00
Tristan Lee
8ecb904249 merged main 2022-04-01 02:05:25 -05:00
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
7f87b03de5 Add option to clear registered scrapers, necessary for tests 2022-03-31 16:17:35 +02:00
Logan Williams
a5cffa615f Fix Twitter profile scraper, catch exceptions in controller 2022-03-31 15:37:58 +02:00
Logan Williams
cff1953d21 Initial CLI tool 2022-03-31 08:15:11 +02:00
Tristan Lee
b805d50132 made tesets work, fixed several issues with Rumble scraper 2022-03-29 16:09:51 -05:00
Logan Williams
a80dbddbbc Add snscrape delayed media archiving support; add explicit bool 2022-03-28 11:42:15 +02:00
Logan Williams
63fdae9f1b Implement media archiving after the initial scrape for Twitter and Telethon 2022-03-24 16:52:11 +01:00
Logan Williams
2a3b5c8200 Merge branch 'main' into channel-db 2022-03-22 11:49:07 +01:00
Logan Williams
806f07f458 Add functions for scraping based on Channel database 2022-03-22 11:26:46 +01:00
Tristan Lee
d68d76c0ab added missing docstrings, created Makefile target for sphinx-apidoc, added quickstart page for installation and configuration instructions 2022-03-15 12:40:18 -05:00
Tristan Lee
ee9a8c10dd merged main into branch 2022-03-15 09:16:11 -05:00
Tristan Lee
a3c859ec79 added more docstrings and comments 2022-03-14 19:38:33 -05:00
Tristan Lee
c3eab2f176 merged main 2022-03-14 18:19:57 -05:00
Tristan Lee
e4cf9daf73 added docstrings, improved Sphinx docs 2022-03-14 18:04:27 -05:00
Tristan Lee
965bf1e2dc added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better 2022-03-11 17:19:52 -06:00
Tristan Lee
821c39004b incorporated vkontakte scraper 2022-03-10 22:32:39 -06:00
Tristan Lee
5783206ad8 implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers 2022-03-10 10:20:49 -06:00
Logan Williams
fa5037d67c Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction 2022-03-10 15:34:24 +01:00
Tristan Lee
6cf3b8842d renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test 2022-03-09 13:19:35 -06:00
Tristan Lee
739e1d8484 added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram 2022-03-09 12:12:01 -06:00
Tristan Lee
506fb54a53 added wrapper for requests that retries after encountering exception 2022-03-07 13:28:33 -06:00
Tristan Lee
c21e43ddfa refactored import structure 2022-03-04 10:55:54 -06:00
Tristan Lee
75240bb060 fixed various bugs related to archived URL creation and media downloading. Things seem to work well now 2022-03-01 15:58:18 -06:00
Tristan Lee
f3d9dc91c6 changed URL parsing to use urllib 2022-03-01 14:13:04 -06:00
Tristan Lee
bc840e631d added Gab scraper 2022-02-28 12:11:21 -06:00
Tristan Lee
47dad8fb00 added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster 2022-02-25 20:28:00 -06:00
Tristan Lee
ef83cc4b0a converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url. 2022-02-25 13:43:30 -06:00
Logan Williams
8ab56ff5ba Remove MAX_POSTS, auto detect MIME type
Co-authored-by: Tristan Lee <tristan@bellingcat.com>
2022-02-25 08:52:42 +01:00
Logan Williams
3480452fac Fix type hints 2022-02-24 20:36:23 +01:00
Logan Williams
214287b7a8 Archive media in dictionary 2022-02-24 17:35:24 +01:00
Logan Williams
a87cfd570a Add Telegram channel scraper 2022-02-24 16:37:13 +01:00