Commit Graph

113 Commits

Author SHA1 Message Date
Logan Williams
39358c7f23 Update platform ID and screenname when synchronizing with gsheet; highlight dupes 2022-06-06 16:36:39 +02:00
Tristan Lee
424c063ef2 Merge pull request #55 from bellingcat/channel-info-transformers
Transformers for raw channel info
2022-05-18 06:58:18 -07:00
Logan Williams
c279ced73d Minor bug fixes from testing 2022-05-18 09:29:53 +01:00
Logan Williams
6145fd0b6b Add Telegram transformer for channel info 2022-05-18 09:17:49 +01:00
Tristan Lee
317da2c9d4 Merge pull request #54 from bellingcat/transformers
Functional Telegram transformer
2022-05-17 10:18:12 -07:00
Logan Williams
9869612b67 Merge pull request #50 from bellingcat/odysee-refactor
Implemented Polyphemus refactoring changes into Odysee scraper
2022-05-16 12:25:35 +01:00
Logan Williams
7f55b721dd Bug fixes in transformers 2022-05-13 15:39:01 +00:00
Logan Williams
34da733e7c Add date_transformed; refinements to telethon transformer 2022-05-12 13:03:30 +00:00
Logan Williams
ab482443db Merge branch 'main' of https://github.com/bellingcat/cisticola into transformers 2022-04-16 13:55:23 +00:00
Logan Williams
8535a87def Ad hoc changes to transformers 2022-04-16 13:46:26 +00:00
Logan Williams
38e0104078 Separate logging; limit Telegram archive file size 2022-04-14 10:43:27 +00:00
Logan Williams
4c221d1133 Transformer for Telegram, base transformer NLP hydration; no media 2022-04-14 11:45:09 +02:00
Logan Williams
1ac8d6c603 Close sessions; sort channel info by least recently archived 2022-04-13 10:38:08 +00:00
Logan Williams
a0dbe7d92b Catch errors in channel info 2022-04-13 10:10:29 +02:00
Tristan Lee
27b51267a7 fixed bugs from incorporating polyphemus refactoring changes 2022-04-13 00:02:12 -05:00
Tristan Lee
ef7afc0715 Merge branch 'main' into odysee-refactor 2022-04-12 23:26:18 -05:00
Tristan Lee
dfc5b77726 incorporated polyphemus refactoring changes 2022-04-12 23:23:21 -05:00
Logan Williams
d5f6ce485b Merge pull request #46 from bellingcat/next-release
Release 2022-04-12
2022-04-12 14:59:01 +02:00
Logan Williams
d1f9dd0e01 Limit max # of archived files per session 2022-04-12 12:57:04 +00:00
Logan Williams
bbb9d283d5 Add RumbleScraper, YoutubeScraper, and BitchuteScraper to the active scrapers 2022-04-12 14:55:45 +02:00
Logan Williams
6f11b88f94 Use Youtube cookie for Rumble too 2022-04-12 14:55:18 +02:00
Logan Williams
7b8236e6db No recursive retries 2022-04-12 14:55:05 +02:00
Logan Williams
b596d3e055 Merge pull request #42 from bellingcat/youtube-age-restricted
Enable download of age-restricted videos on YouTube
2022-04-12 11:14:45 +02:00
Logan Williams
1f7f957e62 Merge pull request #44 from bellingcat/bitchute-error
Catch errors while retrieving Bitchute videos
2022-04-12 11:13:52 +02:00
Logan Williams
e05f69bbee Merge pull request #38 from bellingcat/youtube-dl-retry
Added 'retries' argument to youtube_dl options
2022-04-12 11:11:51 +02:00
Tristan Lee
1f667d532e made get_videos_user use request_from_bitchute requests wrapper to catch errors 2022-04-06 11:40:43 -05:00
Tristan Lee
f17800b797 added required YOUTUBE_COOKIESTRING environment variable to be used by YoutubeScraper 2022-04-05 21:22:41 -05:00
Tristan Lee
a204041480 made requested changes to scraper version numbers 2022-04-05 17:03:45 -05:00
Logan Williams
b6386747d4 Add indices on appropriate columns; limit # of posts to archive 2022-04-04 10:54:27 +00:00
Tristan Lee
ed74c5692b merged main 2022-04-03 19:35:16 -05:00
Tristan Lee
c7253148d1 added 'retries' argument to youtube-dl options, and made options consistent across youtube-dl instances. 2022-04-03 19:31:32 -05:00
Logan Williams
fccbad7a93 Remove 200 post limit; add log rotation 2022-04-03 16:32:00 +00:00
Logan Williams
0140b09ee8 Release Telethon, VK, and Gettr as 0.0.1; specify unrelease 0.0.0 otherwise 2022-04-03 15:29:24 +02:00
Logan Williams
96db662572 Don't add a timestamp to media that failed to archive 2022-04-03 14:16:03 +02:00
Logan Williams
ecae1aad05 Catch exceptions in archive_files so that archiver continues to run 2022-04-03 14:12:23 +02:00
Logan Williams
9c838aae39 Update media_archived column even when TG post has no media 2022-04-03 13:29:10 +02:00
Logan Williams
a82ec15f0e Change archived_media to be timestamp for all scrapers 2022-04-03 12:02:27 +02:00
Logan Williams
8ee20a239c Merge branch 'main' into initial-release 2022-04-03 11:35:12 +02:00
Tristan Lee
90c99aec00 ensured that Gettr username is lowercase for API requests to work correctly 2022-04-02 22:36:25 -05:00
Tristan Lee
b0a52e5ad7 handled case where Rumble video has no view information displayed 2022-04-02 21:26:29 -05:00
Logan Williams
01bbabe0cb Fix issues with new datetime baed 'media_archived' column 2022-04-02 18:45:08 +00:00
Logan Williams
63633617d2 Configure with Telethon and VK only 2022-04-02 18:34:14 +00:00
Tristan Lee
0bab20e371 ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null. 2022-04-01 17:03:02 -05:00
Tristan Lee
8ecb904249 merged main 2022-04-01 02:05:25 -05:00
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
16aad4ef2c TelegramTelethonScraper: Using the username is fine. 2022-03-31 16:50:20 +02:00
Logan Williams
94cf6c3d84 TelegramTelethonScraper: Use channel_id when channel has been previously encountered 2022-03-31 16:37:54 +02:00
Logan Williams
7f87b03de5 Add option to clear registered scrapers, necessary for tests 2022-03-31 16:17:35 +02:00
Logan Williams
c8d1b96e3f Fix bug in handling retweets without media 2022-03-31 15:51:17 +02:00