241 Commits

Author SHA1 Message Date
Tristan Lee
965bf1e2dc added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better 2022-03-11 17:19:52 -06:00
Tristan Lee
821c39004b incorporated vkontakte scraper 2022-03-10 22:32:39 -06:00
Tristan Lee
3d919316a9 added Bitchute scraper, minor change to Bitchute scraper to correctly extract author name and id 2022-03-10 13:03:01 -06:00
Tristan Lee
5783206ad8 implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers 2022-03-10 10:20:49 -06:00
Logan Williams
fa5037d67c Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction 2022-03-10 15:34:24 +01:00
Tristan Lee
6cf3b8842d renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test 2022-03-09 13:19:35 -06:00
Tristan Lee
739e1d8484 added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram 2022-03-09 12:12:01 -06:00
Tristan Lee
506fb54a53 added wrapper for requests that retries after encountering exception 2022-03-07 13:28:33 -06:00
Logan Williams
253a9bea49 Merge pull request #3 from bellingcat/media
Expanding media archiving, implementing Odysee scraper
2022-03-07 11:47:08 +01:00
Tristan Lee
cd5f68e9e5 added basic unit tests 2022-03-04 12:36:09 -06:00
Tristan Lee
c21e43ddfa refactored import structure 2022-03-04 10:55:54 -06:00
Tristan Lee
75240bb060 fixed various bugs related to archived URL creation and media downloading. Things seem to work well now 2022-03-01 15:58:18 -06:00
Tristan Lee
f3d9dc91c6 changed URL parsing to use urllib 2022-03-01 14:13:04 -06:00
Tristan Lee
ee4d64750b added prototype Rumble scraper 2022-02-28 18:38:33 -06:00
Tristan Lee
bc840e631d added Gab scraper 2022-02-28 12:11:21 -06:00
Tristan Lee
7a257ea9f5 included comments in Odysee scraper 2022-02-28 09:15:09 -06:00
Tristan Lee
36fb95d9ae Merge branch 'main' into media 2022-02-25 20:30:36 -06:00
Tristan Lee
47dad8fb00 added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster 2022-02-25 20:28:00 -06:00
Tristan Lee
ef83cc4b0a converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url. 2022-02-25 13:43:30 -06:00
Tristan Lee
bd7bbdf993 Merge pull request #2 from bellingcat/media
WIP: Archiving media, organization improvements
2022-02-25 08:26:58 -06:00
Logan Williams
8ab56ff5ba Remove MAX_POSTS, auto detect MIME type
Co-authored-by: Tristan Lee <tristan@bellingcat.com>
2022-02-25 08:52:42 +01:00
Logan Williams
e6085689b5 On second thought, don't share secrets 2022-02-24 20:47:46 +01:00
Logan Williams
3480452fac Fix type hints 2022-02-24 20:36:23 +01:00
Logan Williams
1ad7c8bc11 Search for since per-channel 2022-02-24 20:26:10 +01:00
Logan Williams
0b1c175dd9 Modify GettrScraper to yield results, archive media (videos incomplete) 2022-02-24 20:25:14 +01:00
Logan Williams
456d592792 Use user id for TwitterScraper 2022-02-24 20:24:03 +01:00
Logan Williams
d159c09aa4 yield data rather than returning a list 2022-02-24 18:58:08 +01:00
Logan Williams
d163e6b3d9 Fix logging logic in scraper controller 2022-02-24 18:49:06 +01:00
Logan Williams
e64d845002 Archive media in Twitter scraper 2022-02-24 18:48:48 +01:00
Logan Williams
214287b7a8 Archive media in dictionary 2022-02-24 17:35:24 +01:00
Logan Williams
a87cfd570a Add Telegram channel scraper 2022-02-24 16:37:13 +01:00
Logan Williams
6092e4caa5 Add method for archiving media, reoranize scraper base classes 2022-02-24 16:36:55 +01:00
Logan Williams
e09e0f5202 Merge pull request #1 from bellingcat/add-docs
Add documentation generation with Sphinx
2022-02-22 15:19:54 +01:00
Tristan Lee
9fe3d90b0b fixed warnings from sphinx-build, made build path consistent with gitignore (removed sphinx build directory from version control) 2022-02-21 16:13:16 -06:00
Logan Williams
e3d29bf811 Add documentation generation with Sphinx 2022-02-21 17:52:38 +01:00
Tristan Lee
139459e3b2 implemented Bitchute scraper 2022-02-18 12:45:10 -06:00
Tristan Lee
4668d4df11 implemented Gettr scraper 2022-02-18 10:13:37 -06:00
Logan Williams
0e5f9f77f3 Configure pipenv 2022-02-18 15:05:02 +01:00
Logan Williams
b824b98a95 Reorganize transformer defition location 2022-02-18 14:57:10 +01:00
Logan Williams
c5d49ef521 Reorganize class definitions slightly 2022-02-18 14:14:25 +01:00
Logan Williams
82ad210b8e Initial commit 2022-02-18 14:01:49 +01:00