Tristan Lee
|
965bf1e2dc
|
added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better
|
2022-03-11 17:19:52 -06:00 |
|
Tristan Lee
|
821c39004b
|
incorporated vkontakte scraper
|
2022-03-10 22:32:39 -06:00 |
|
Tristan Lee
|
3d919316a9
|
added Bitchute scraper, minor change to Bitchute scraper to correctly extract author name and id
|
2022-03-10 13:03:01 -06:00 |
|
Tristan Lee
|
5783206ad8
|
implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers
|
2022-03-10 10:20:49 -06:00 |
|
Logan Williams
|
fa5037d67c
|
Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction
|
2022-03-10 15:34:24 +01:00 |
|
Tristan Lee
|
6cf3b8842d
|
renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test
|
2022-03-09 13:19:35 -06:00 |
|
Tristan Lee
|
739e1d8484
|
added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram
|
2022-03-09 12:12:01 -06:00 |
|
Tristan Lee
|
506fb54a53
|
added wrapper for requests that retries after encountering exception
|
2022-03-07 13:28:33 -06:00 |
|
Logan Williams
|
253a9bea49
|
Merge pull request #3 from bellingcat/media
Expanding media archiving, implementing Odysee scraper
|
2022-03-07 11:47:08 +01:00 |
|
Tristan Lee
|
cd5f68e9e5
|
added basic unit tests
|
2022-03-04 12:36:09 -06:00 |
|
Tristan Lee
|
c21e43ddfa
|
refactored import structure
|
2022-03-04 10:55:54 -06:00 |
|
Tristan Lee
|
75240bb060
|
fixed various bugs related to archived URL creation and media downloading. Things seem to work well now
|
2022-03-01 15:58:18 -06:00 |
|
Tristan Lee
|
f3d9dc91c6
|
changed URL parsing to use urllib
|
2022-03-01 14:13:04 -06:00 |
|
Tristan Lee
|
ee4d64750b
|
added prototype Rumble scraper
|
2022-02-28 18:38:33 -06:00 |
|
Tristan Lee
|
bc840e631d
|
added Gab scraper
|
2022-02-28 12:11:21 -06:00 |
|
Tristan Lee
|
7a257ea9f5
|
included comments in Odysee scraper
|
2022-02-28 09:15:09 -06:00 |
|
Tristan Lee
|
36fb95d9ae
|
Merge branch 'main' into media
|
2022-02-25 20:30:36 -06:00 |
|
Tristan Lee
|
47dad8fb00
|
added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster
|
2022-02-25 20:28:00 -06:00 |
|
Tristan Lee
|
ef83cc4b0a
|
converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url.
|
2022-02-25 13:43:30 -06:00 |
|
Tristan Lee
|
bd7bbdf993
|
Merge pull request #2 from bellingcat/media
WIP: Archiving media, organization improvements
|
2022-02-25 08:26:58 -06:00 |
|
Logan Williams
|
8ab56ff5ba
|
Remove MAX_POSTS, auto detect MIME type
Co-authored-by: Tristan Lee <tristan@bellingcat.com>
|
2022-02-25 08:52:42 +01:00 |
|
Logan Williams
|
e6085689b5
|
On second thought, don't share secrets
|
2022-02-24 20:47:46 +01:00 |
|
Logan Williams
|
3480452fac
|
Fix type hints
|
2022-02-24 20:36:23 +01:00 |
|
Logan Williams
|
1ad7c8bc11
|
Search for since per-channel
|
2022-02-24 20:26:10 +01:00 |
|
Logan Williams
|
0b1c175dd9
|
Modify GettrScraper to yield results, archive media (videos incomplete)
|
2022-02-24 20:25:14 +01:00 |
|
Logan Williams
|
456d592792
|
Use user id for TwitterScraper
|
2022-02-24 20:24:03 +01:00 |
|
Logan Williams
|
d159c09aa4
|
yield data rather than returning a list
|
2022-02-24 18:58:08 +01:00 |
|
Logan Williams
|
d163e6b3d9
|
Fix logging logic in scraper controller
|
2022-02-24 18:49:06 +01:00 |
|
Logan Williams
|
e64d845002
|
Archive media in Twitter scraper
|
2022-02-24 18:48:48 +01:00 |
|
Logan Williams
|
214287b7a8
|
Archive media in dictionary
|
2022-02-24 17:35:24 +01:00 |
|
Logan Williams
|
a87cfd570a
|
Add Telegram channel scraper
|
2022-02-24 16:37:13 +01:00 |
|
Logan Williams
|
6092e4caa5
|
Add method for archiving media, reoranize scraper base classes
|
2022-02-24 16:36:55 +01:00 |
|
Logan Williams
|
e09e0f5202
|
Merge pull request #1 from bellingcat/add-docs
Add documentation generation with Sphinx
|
2022-02-22 15:19:54 +01:00 |
|
Tristan Lee
|
9fe3d90b0b
|
fixed warnings from sphinx-build, made build path consistent with gitignore (removed sphinx build directory from version control)
|
2022-02-21 16:13:16 -06:00 |
|
Logan Williams
|
e3d29bf811
|
Add documentation generation with Sphinx
|
2022-02-21 17:52:38 +01:00 |
|
Tristan Lee
|
139459e3b2
|
implemented Bitchute scraper
|
2022-02-18 12:45:10 -06:00 |
|
Tristan Lee
|
4668d4df11
|
implemented Gettr scraper
|
2022-02-18 10:13:37 -06:00 |
|
Logan Williams
|
0e5f9f77f3
|
Configure pipenv
|
2022-02-18 15:05:02 +01:00 |
|
Logan Williams
|
b824b98a95
|
Reorganize transformer defition location
|
2022-02-18 14:57:10 +01:00 |
|
Logan Williams
|
c5d49ef521
|
Reorganize class definitions slightly
|
2022-02-18 14:14:25 +01:00 |
|
Logan Williams
|
82ad210b8e
|
Initial commit
|
2022-02-18 14:01:49 +01:00 |
|