Logan Williams
|
b6386747d4
|
Add indices on appropriate columns; limit # of posts to archive
|
2022-04-04 10:54:27 +00:00 |
|
Logan Williams
|
fccbad7a93
|
Remove 200 post limit; add log rotation
|
2022-04-03 16:32:00 +00:00 |
|
Logan Williams
|
96db662572
|
Don't add a timestamp to media that failed to archive
|
2022-04-03 14:16:03 +02:00 |
|
Logan Williams
|
ecae1aad05
|
Catch exceptions in archive_files so that archiver continues to run
|
2022-04-03 14:12:23 +02:00 |
|
Logan Williams
|
a82ec15f0e
|
Change archived_media to be timestamp for all scrapers
|
2022-04-03 12:02:27 +02:00 |
|
Logan Williams
|
01bbabe0cb
|
Fix issues with new datetime baed 'media_archived' column
|
2022-04-02 18:45:08 +00:00 |
|
Logan Williams
|
63633617d2
|
Configure with Telethon and VK only
|
2022-04-02 18:34:14 +00:00 |
|
Tristan Lee
|
0bab20e371
|
ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null.
|
2022-04-01 17:03:02 -05:00 |
|
Tristan Lee
|
8ecb904249
|
merged main
|
2022-04-01 02:05:25 -05:00 |
|
Tristan Lee
|
282f33eff3
|
implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method
|
2022-04-01 01:30:49 -05:00 |
|
Logan Williams
|
d20db5f828
|
Catch exceptions in get_posts so that archiving continues despites errors
|
2022-03-31 20:27:18 +02:00 |
|
Logan Williams
|
7f87b03de5
|
Add option to clear registered scrapers, necessary for tests
|
2022-03-31 16:17:35 +02:00 |
|
Logan Williams
|
a5cffa615f
|
Fix Twitter profile scraper, catch exceptions in controller
|
2022-03-31 15:37:58 +02:00 |
|
Logan Williams
|
cff1953d21
|
Initial CLI tool
|
2022-03-31 08:15:11 +02:00 |
|
Tristan Lee
|
b805d50132
|
made tesets work, fixed several issues with Rumble scraper
|
2022-03-29 16:09:51 -05:00 |
|
Logan Williams
|
a80dbddbbc
|
Add snscrape delayed media archiving support; add explicit bool
|
2022-03-28 11:42:15 +02:00 |
|
Logan Williams
|
63fdae9f1b
|
Implement media archiving after the initial scrape for Twitter and Telethon
|
2022-03-24 16:52:11 +01:00 |
|
Logan Williams
|
2a3b5c8200
|
Merge branch 'main' into channel-db
|
2022-03-22 11:49:07 +01:00 |
|
Logan Williams
|
806f07f458
|
Add functions for scraping based on Channel database
|
2022-03-22 11:26:46 +01:00 |
|
Tristan Lee
|
d68d76c0ab
|
added missing docstrings, created Makefile target for sphinx-apidoc, added quickstart page for installation and configuration instructions
|
2022-03-15 12:40:18 -05:00 |
|
Tristan Lee
|
ee9a8c10dd
|
merged main into branch
|
2022-03-15 09:16:11 -05:00 |
|
Tristan Lee
|
a3c859ec79
|
added more docstrings and comments
|
2022-03-14 19:38:33 -05:00 |
|
Tristan Lee
|
c3eab2f176
|
merged main
|
2022-03-14 18:19:57 -05:00 |
|
Tristan Lee
|
e4cf9daf73
|
added docstrings, improved Sphinx docs
|
2022-03-14 18:04:27 -05:00 |
|
Tristan Lee
|
965bf1e2dc
|
added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better
|
2022-03-11 17:19:52 -06:00 |
|
Tristan Lee
|
821c39004b
|
incorporated vkontakte scraper
|
2022-03-10 22:32:39 -06:00 |
|
Tristan Lee
|
5783206ad8
|
implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers
|
2022-03-10 10:20:49 -06:00 |
|
Logan Williams
|
fa5037d67c
|
Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction
|
2022-03-10 15:34:24 +01:00 |
|
Tristan Lee
|
6cf3b8842d
|
renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test
|
2022-03-09 13:19:35 -06:00 |
|
Tristan Lee
|
739e1d8484
|
added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram
|
2022-03-09 12:12:01 -06:00 |
|
Tristan Lee
|
506fb54a53
|
added wrapper for requests that retries after encountering exception
|
2022-03-07 13:28:33 -06:00 |
|
Tristan Lee
|
c21e43ddfa
|
refactored import structure
|
2022-03-04 10:55:54 -06:00 |
|
Tristan Lee
|
75240bb060
|
fixed various bugs related to archived URL creation and media downloading. Things seem to work well now
|
2022-03-01 15:58:18 -06:00 |
|
Tristan Lee
|
f3d9dc91c6
|
changed URL parsing to use urllib
|
2022-03-01 14:13:04 -06:00 |
|
Tristan Lee
|
bc840e631d
|
added Gab scraper
|
2022-02-28 12:11:21 -06:00 |
|
Tristan Lee
|
47dad8fb00
|
added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster
|
2022-02-25 20:28:00 -06:00 |
|
Tristan Lee
|
ef83cc4b0a
|
converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url.
|
2022-02-25 13:43:30 -06:00 |
|
Logan Williams
|
8ab56ff5ba
|
Remove MAX_POSTS, auto detect MIME type
Co-authored-by: Tristan Lee <tristan@bellingcat.com>
|
2022-02-25 08:52:42 +01:00 |
|
Logan Williams
|
3480452fac
|
Fix type hints
|
2022-02-24 20:36:23 +01:00 |
|
Logan Williams
|
214287b7a8
|
Archive media in dictionary
|
2022-02-24 17:35:24 +01:00 |
|
Logan Williams
|
a87cfd570a
|
Add Telegram channel scraper
|
2022-02-24 16:37:13 +01:00 |
|