Commit Graph

121 Commits

Author SHA1 Message Date
Logan Williams
4c221d1133 Transformer for Telegram, base transformer NLP hydration; no media 2022-04-14 11:45:09 +02:00
Logan Williams
59bab0d812 Disable Youtube scraper for now 2022-04-13 10:12:20 +02:00
Logan Williams
a2e62cc489 Merge pull request #52 from bellingcat/next-release
Next release
2022-04-13 10:11:49 +02:00
Logan Williams
d96b8177a5 Merge pull request #49 from bellingcat/sync-channels
Synchronize channels as well as adding new ones
2022-04-13 10:11:34 +02:00
Logan Williams
e7c3771788 Merge pull request #51 from bellingcat/channel-info
Channel info
2022-04-13 10:11:21 +02:00
Logan Williams
a0dbe7d92b Catch errors in channel info 2022-04-13 10:10:29 +02:00
Logan Williams
209152ea69 Synchronize channels that have changed info 2022-04-12 18:13:52 +02:00
Logan Williams
d5f6ce485b Merge pull request #46 from bellingcat/next-release
Release 2022-04-12
v2022-04-12
2022-04-12 14:59:01 +02:00
Logan Williams
d1f9dd0e01 Limit max # of archived files per session 2022-04-12 12:57:04 +00:00
Logan Williams
bbb9d283d5 Add RumbleScraper, YoutubeScraper, and BitchuteScraper to the active scrapers 2022-04-12 14:55:45 +02:00
Logan Williams
6f11b88f94 Use Youtube cookie for Rumble too 2022-04-12 14:55:18 +02:00
Logan Williams
7b8236e6db No recursive retries 2022-04-12 14:55:05 +02:00
Logan Williams
b596d3e055 Merge pull request #42 from bellingcat/youtube-age-restricted
Enable download of age-restricted videos on YouTube
2022-04-12 11:14:45 +02:00
Logan Williams
1f7f957e62 Merge pull request #44 from bellingcat/bitchute-error
Catch errors while retrieving Bitchute videos
2022-04-12 11:13:52 +02:00
Logan Williams
e05f69bbee Merge pull request #38 from bellingcat/youtube-dl-retry
Added 'retries' argument to youtube_dl options
2022-04-12 11:11:51 +02:00
Tristan Lee
1f667d532e made get_videos_user use request_from_bitchute requests wrapper to catch errors 2022-04-06 11:40:43 -05:00
Tristan Lee
f17800b797 added required YOUTUBE_COOKIESTRING environment variable to be used by YoutubeScraper 2022-04-05 21:22:41 -05:00
Tristan Lee
a204041480 made requested changes to scraper version numbers 2022-04-05 17:03:45 -05:00
Logan Williams
36c81c8e17 Merge pull request #40 from bellingcat/next-release
Add indices on appropriate columns; limit # of posts to archive
v2022-04-04
2022-04-04 13:03:16 +02:00
Logan Williams
b6386747d4 Add indices on appropriate columns; limit # of posts to archive 2022-04-04 10:54:27 +00:00
Tristan Lee
ed74c5692b merged main 2022-04-03 19:35:16 -05:00
Tristan Lee
c7253148d1 added 'retries' argument to youtube-dl options, and made options consistent across youtube-dl instances. 2022-04-03 19:31:32 -05:00
Logan Williams
fccbad7a93 Remove 200 post limit; add log rotation v2022-04-03 2022-04-03 16:32:00 +00:00
Logan Williams
4c580519dd Remove Rumble scraper 2022-04-03 15:59:39 +02:00
Logan Williams
0140b09ee8 Release Telethon, VK, and Gettr as 0.0.1; specify unrelease 0.0.0 otherwise 2022-04-03 15:29:24 +02:00
Logan Williams
96db662572 Don't add a timestamp to media that failed to archive 2022-04-03 14:16:03 +02:00
Logan Williams
ecae1aad05 Catch exceptions in archive_files so that archiver continues to run 2022-04-03 14:12:23 +02:00
Logan Williams
9c838aae39 Update media_archived column even when TG post has no media 2022-04-03 13:29:10 +02:00
Logan Williams
57b9082271 Remove Odysee scraper due to errors 2022-04-03 13:26:05 +02:00
Logan Williams
a82ec15f0e Change archived_media to be timestamp for all scrapers 2022-04-03 12:02:27 +02:00
Logan Williams
8ee20a239c Merge branch 'main' into initial-release 2022-04-03 11:35:12 +02:00
Tristan Lee
90c99aec00 ensured that Gettr username is lowercase for API requests to work correctly 2022-04-02 22:36:25 -05:00
Tristan Lee
b0a52e5ad7 handled case where Rumble video has no view information displayed 2022-04-02 21:26:29 -05:00
Logan Williams
01bbabe0cb Fix issues with new datetime baed 'media_archived' column 2022-04-02 18:45:08 +00:00
Logan Williams
63633617d2 Configure with Telethon and VK only 2022-04-02 18:34:14 +00:00
Logan Williams
0099558c68 Merge pull request #26 from bellingcat/deferred-media-archiving
Implemented deferred media archiving for all scrapers
2022-04-02 14:15:35 +02:00
Tristan Lee
0bab20e371 ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null. 2022-04-01 17:03:02 -05:00
Tristan Lee
8ecb904249 merged main 2022-04-01 02:05:25 -05:00
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
16aad4ef2c TelegramTelethonScraper: Using the username is fine. 2022-03-31 16:50:20 +02:00
Logan Williams
94cf6c3d84 TelegramTelethonScraper: Use channel_id when channel has been previously encountered 2022-03-31 16:37:54 +02:00
Logan Williams
061af984ee Merge pull request #20 from bellingcat/separate-media-archiving
WIP: Separate media archiving and CLI
2022-03-31 16:28:30 +02:00
Logan Williams
7f87b03de5 Add option to clear registered scrapers, necessary for tests 2022-03-31 16:17:35 +02:00
Logan Williams
c8d1b96e3f Fix bug in handling retweets without media 2022-03-31 15:51:17 +02:00
Logan Williams
a5cffa615f Fix Twitter profile scraper, catch exceptions in controller 2022-03-31 15:37:58 +02:00
Logan Williams
2dc9213d64 Use new RawChannelInfo class 2022-03-31 15:17:25 +02:00
Logan Williams
61c99d33f6 Add Postgres support with psycopg2 2022-03-31 08:15:53 +02:00
Logan Williams
cff1953d21 Initial CLI tool 2022-03-31 08:15:11 +02:00
Logan Williams
1c1ff7fb6f Fix bug with Telethon scraper and certain media; add media_archived flag to TwitterScraper 2022-03-31 08:15:09 +02:00