Commit Graph

24 Commits

Author SHA1 Message Date
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
2dc9213d64 Use new RawChannelInfo class 2022-03-31 15:17:25 +02:00
Logan Williams
61c99d33f6 Add Postgres support with psycopg2 2022-03-31 08:15:53 +02:00
Tristan Lee
1f99e52436 refactored Gab scraper to use gabber instead of garc 2022-03-30 08:05:10 -05:00
Logan Williams
63fdae9f1b Implement media archiving after the initial scrape for Twitter and Telethon 2022-03-24 16:52:11 +01:00
Logan Williams
2a3b5c8200 Merge branch 'main' into channel-db 2022-03-22 11:49:07 +01:00
Logan Williams
c0a094eefa Load channels from google sheet in test.py 2022-03-22 11:37:47 +01:00
Tristan Lee
ee9a8c10dd merged main into branch 2022-03-15 09:16:11 -05:00
Tristan Lee
c3eab2f176 merged main 2022-03-14 18:19:57 -05:00
Tristan Lee
e4cf9daf73 added docstrings, improved Sphinx docs 2022-03-14 18:04:27 -05:00
Tristan Lee
750f0cc887 added scraper for Instagram 2022-03-14 10:28:10 -05:00
Tristan Lee
965bf1e2dc added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better 2022-03-11 17:19:52 -06:00
Logan Williams
fa5037d67c Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction 2022-03-10 15:34:24 +01:00
Tristan Lee
739e1d8484 added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram 2022-03-09 12:12:01 -06:00
Tristan Lee
cd5f68e9e5 added basic unit tests 2022-03-04 12:36:09 -06:00
Tristan Lee
ee4d64750b added prototype Rumble scraper 2022-02-28 18:38:33 -06:00
Tristan Lee
bc840e631d added Gab scraper 2022-02-28 12:11:21 -06:00
Tristan Lee
47dad8fb00 added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster 2022-02-25 20:28:00 -06:00
Tristan Lee
ef83cc4b0a converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url. 2022-02-25 13:43:30 -06:00
Logan Williams
6092e4caa5 Add method for archiving media, reoranize scraper base classes 2022-02-24 16:36:55 +01:00
Logan Williams
e3d29bf811 Add documentation generation with Sphinx 2022-02-21 17:52:38 +01:00
Tristan Lee
139459e3b2 implemented Bitchute scraper 2022-02-18 12:45:10 -06:00
Tristan Lee
4668d4df11 implemented Gettr scraper 2022-02-18 10:13:37 -06:00
Logan Williams
0e5f9f77f3 Configure pipenv 2022-02-18 15:05:02 +01:00