Commit Graph

114 Commits

Author SHA1 Message Date
Logan Williams
d5f6ce485b Merge pull request #46 from bellingcat/next-release
Release 2022-04-12
v2022-04-12
2022-04-12 14:59:01 +02:00
Logan Williams
d1f9dd0e01 Limit max # of archived files per session 2022-04-12 12:57:04 +00:00
Logan Williams
bbb9d283d5 Add RumbleScraper, YoutubeScraper, and BitchuteScraper to the active scrapers 2022-04-12 14:55:45 +02:00
Logan Williams
6f11b88f94 Use Youtube cookie for Rumble too 2022-04-12 14:55:18 +02:00
Logan Williams
7b8236e6db No recursive retries 2022-04-12 14:55:05 +02:00
Logan Williams
b596d3e055 Merge pull request #42 from bellingcat/youtube-age-restricted
Enable download of age-restricted videos on YouTube
2022-04-12 11:14:45 +02:00
Logan Williams
1f7f957e62 Merge pull request #44 from bellingcat/bitchute-error
Catch errors while retrieving Bitchute videos
2022-04-12 11:13:52 +02:00
Logan Williams
e05f69bbee Merge pull request #38 from bellingcat/youtube-dl-retry
Added 'retries' argument to youtube_dl options
2022-04-12 11:11:51 +02:00
Tristan Lee
1f667d532e made get_videos_user use request_from_bitchute requests wrapper to catch errors 2022-04-06 11:40:43 -05:00
Tristan Lee
f17800b797 added required YOUTUBE_COOKIESTRING environment variable to be used by YoutubeScraper 2022-04-05 21:22:41 -05:00
Tristan Lee
a204041480 made requested changes to scraper version numbers 2022-04-05 17:03:45 -05:00
Logan Williams
36c81c8e17 Merge pull request #40 from bellingcat/next-release
Add indices on appropriate columns; limit # of posts to archive
v2022-04-04
2022-04-04 13:03:16 +02:00
Logan Williams
b6386747d4 Add indices on appropriate columns; limit # of posts to archive 2022-04-04 10:54:27 +00:00
Tristan Lee
ed74c5692b merged main 2022-04-03 19:35:16 -05:00
Tristan Lee
c7253148d1 added 'retries' argument to youtube-dl options, and made options consistent across youtube-dl instances. 2022-04-03 19:31:32 -05:00
Logan Williams
fccbad7a93 Remove 200 post limit; add log rotation v2022-04-03 2022-04-03 16:32:00 +00:00
Logan Williams
4c580519dd Remove Rumble scraper 2022-04-03 15:59:39 +02:00
Logan Williams
0140b09ee8 Release Telethon, VK, and Gettr as 0.0.1; specify unrelease 0.0.0 otherwise 2022-04-03 15:29:24 +02:00
Logan Williams
96db662572 Don't add a timestamp to media that failed to archive 2022-04-03 14:16:03 +02:00
Logan Williams
ecae1aad05 Catch exceptions in archive_files so that archiver continues to run 2022-04-03 14:12:23 +02:00
Logan Williams
9c838aae39 Update media_archived column even when TG post has no media 2022-04-03 13:29:10 +02:00
Logan Williams
57b9082271 Remove Odysee scraper due to errors 2022-04-03 13:26:05 +02:00
Logan Williams
a82ec15f0e Change archived_media to be timestamp for all scrapers 2022-04-03 12:02:27 +02:00
Logan Williams
8ee20a239c Merge branch 'main' into initial-release 2022-04-03 11:35:12 +02:00
Tristan Lee
90c99aec00 ensured that Gettr username is lowercase for API requests to work correctly 2022-04-02 22:36:25 -05:00
Tristan Lee
b0a52e5ad7 handled case where Rumble video has no view information displayed 2022-04-02 21:26:29 -05:00
Logan Williams
01bbabe0cb Fix issues with new datetime baed 'media_archived' column 2022-04-02 18:45:08 +00:00
Logan Williams
63633617d2 Configure with Telethon and VK only 2022-04-02 18:34:14 +00:00
Logan Williams
0099558c68 Merge pull request #26 from bellingcat/deferred-media-archiving
Implemented deferred media archiving for all scrapers
2022-04-02 14:15:35 +02:00
Tristan Lee
0bab20e371 ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null. 2022-04-01 17:03:02 -05:00
Tristan Lee
8ecb904249 merged main 2022-04-01 02:05:25 -05:00
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
16aad4ef2c TelegramTelethonScraper: Using the username is fine. 2022-03-31 16:50:20 +02:00
Logan Williams
94cf6c3d84 TelegramTelethonScraper: Use channel_id when channel has been previously encountered 2022-03-31 16:37:54 +02:00
Logan Williams
061af984ee Merge pull request #20 from bellingcat/separate-media-archiving
WIP: Separate media archiving and CLI
2022-03-31 16:28:30 +02:00
Logan Williams
7f87b03de5 Add option to clear registered scrapers, necessary for tests 2022-03-31 16:17:35 +02:00
Logan Williams
c8d1b96e3f Fix bug in handling retweets without media 2022-03-31 15:51:17 +02:00
Logan Williams
a5cffa615f Fix Twitter profile scraper, catch exceptions in controller 2022-03-31 15:37:58 +02:00
Logan Williams
2dc9213d64 Use new RawChannelInfo class 2022-03-31 15:17:25 +02:00
Logan Williams
61c99d33f6 Add Postgres support with psycopg2 2022-03-31 08:15:53 +02:00
Logan Williams
cff1953d21 Initial CLI tool 2022-03-31 08:15:11 +02:00
Logan Williams
1c1ff7fb6f Fix bug with Telethon scraper and certain media; add media_archived flag to TwitterScraper 2022-03-31 08:15:09 +02:00
Logan Williams
19056a1d9a Merge pull request #23 from bellingcat/profile
Added methods for retrieving channel profile metadata, refactored Gab scraper to use gabber
2022-03-31 08:13:17 +02:00
Tristan Lee
b7871b060d added capability to scrape Gab group posts 2022-03-30 09:11:07 -05:00
Tristan Lee
1f99e52436 refactored Gab scraper to use gabber instead of garc 2022-03-30 08:05:10 -05:00
Tristan Lee
b805d50132 made tesets work, fixed several issues with Rumble scraper 2022-03-29 16:09:51 -05:00
Tristan Lee
67d1abf024 added methods for extracting channel profile metadata, and tests 2022-03-28 21:11:34 -05:00
Tristan Lee
ea40ea2640 merged main 2022-03-28 20:22:34 -05:00
Tristan Lee
5d6473e946 Merge pull request #19 from bellingcat/separate-media-archiving
Separate media archiving
2022-03-28 20:20:57 -05:00