Commit Graph

82 Commits

Author SHA1 Message Date
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
16aad4ef2c TelegramTelethonScraper: Using the username is fine. 2022-03-31 16:50:20 +02:00
Logan Williams
94cf6c3d84 TelegramTelethonScraper: Use channel_id when channel has been previously encountered 2022-03-31 16:37:54 +02:00
Logan Williams
061af984ee Merge pull request #20 from bellingcat/separate-media-archiving
WIP: Separate media archiving and CLI
2022-03-31 16:28:30 +02:00
Logan Williams
7f87b03de5 Add option to clear registered scrapers, necessary for tests 2022-03-31 16:17:35 +02:00
Logan Williams
c8d1b96e3f Fix bug in handling retweets without media 2022-03-31 15:51:17 +02:00
Logan Williams
a5cffa615f Fix Twitter profile scraper, catch exceptions in controller 2022-03-31 15:37:58 +02:00
Logan Williams
2dc9213d64 Use new RawChannelInfo class 2022-03-31 15:17:25 +02:00
Logan Williams
61c99d33f6 Add Postgres support with psycopg2 2022-03-31 08:15:53 +02:00
Logan Williams
cff1953d21 Initial CLI tool 2022-03-31 08:15:11 +02:00
Logan Williams
1c1ff7fb6f Fix bug with Telethon scraper and certain media; add media_archived flag to TwitterScraper 2022-03-31 08:15:09 +02:00
Logan Williams
19056a1d9a Merge pull request #23 from bellingcat/profile
Added methods for retrieving channel profile metadata, refactored Gab scraper to use gabber
2022-03-31 08:13:17 +02:00
Tristan Lee
b7871b060d added capability to scrape Gab group posts 2022-03-30 09:11:07 -05:00
Tristan Lee
1f99e52436 refactored Gab scraper to use gabber instead of garc 2022-03-30 08:05:10 -05:00
Tristan Lee
b805d50132 made tesets work, fixed several issues with Rumble scraper 2022-03-29 16:09:51 -05:00
Tristan Lee
67d1abf024 added methods for extracting channel profile metadata, and tests 2022-03-28 21:11:34 -05:00
Tristan Lee
ea40ea2640 merged main 2022-03-28 20:22:34 -05:00
Tristan Lee
5d6473e946 Merge pull request #19 from bellingcat/separate-media-archiving
Separate media archiving
2022-03-28 20:20:57 -05:00
Tristan Lee
16870d7daa implemented methods for extracting profile metadata (still need to test) 2022-03-28 20:16:59 -05:00
Logan Williams
a80dbddbbc Add snscrape delayed media archiving support; add explicit bool 2022-03-28 11:42:15 +02:00
Tristan Lee
d68cbd207a Merge pull request #17 from bellingcat/channel-db
Add Channel object to ORM, store in DB
2022-03-24 13:07:03 -05:00
Logan Williams
63fdae9f1b Implement media archiving after the initial scrape for Twitter and Telethon 2022-03-24 16:52:11 +01:00
Logan Williams
65edde6d20 Fix bug after merge 2022-03-22 11:56:28 +01:00
Logan Williams
2a3b5c8200 Merge branch 'main' into channel-db 2022-03-22 11:49:07 +01:00
Logan Williams
fa516da763 Rename TransformedResult to the clearer Post 2022-03-22 11:41:55 +01:00
Logan Williams
c0a094eefa Load channels from google sheet in test.py 2022-03-22 11:37:47 +01:00
Logan Williams
571b019137 Fix tests for Twitter transformer 2022-03-22 11:33:27 +01:00
Logan Williams
806f07f458 Add functions for scraping based on Channel database 2022-03-22 11:26:46 +01:00
Logan Williams
885b4687ce Add ORM for Channel class; update foreign key relations; add platform_id to TransformedResult 2022-03-22 11:25:52 +01:00
Logan Williams
d5bf3629c2 Merge pull request #16 from bellingcat/docs
Docs
2022-03-16 15:20:51 +01:00
Tristan Lee
93554b19e9 fixed typo 2022-03-15 13:05:41 -05:00
Tristan Lee
d68d76c0ab added missing docstrings, created Makefile target for sphinx-apidoc, added quickstart page for installation and configuration instructions 2022-03-15 12:40:18 -05:00
Tristan Lee
ee9a8c10dd merged main into branch 2022-03-15 09:16:11 -05:00
Tristan Lee
e287fd03d9 merged scraper into main and fixed minor merge conflict 2022-03-15 09:12:12 -05:00
Tristan Lee
a3c859ec79 added more docstrings and comments 2022-03-14 19:38:33 -05:00
Tristan Lee
c3eab2f176 merged main 2022-03-14 18:19:57 -05:00
Tristan Lee
e4cf9daf73 added docstrings, improved Sphinx docs 2022-03-14 18:04:27 -05:00
Tristan Lee
db03cbf141 Merge pull request #13 from bellingcat/transformer
Merged Transformer branch into main, including example of Transformer instance for Twitter and associated test
2022-03-14 11:13:57 -05:00
Tristan Lee
750f0cc887 added scraper for Instagram 2022-03-14 10:28:10 -05:00
Logan Williams
fe0d762df0 Add Transformer and ETLController docstrings 2022-03-14 14:02:57 +01:00
Logan Williams
fd4b617743 Add TwitterTransformer test 2022-03-14 13:39:10 +01:00
Tristan Lee
965bf1e2dc added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better 2022-03-11 17:19:52 -06:00
Tristan Lee
821c39004b incorporated vkontakte scraper 2022-03-10 22:32:39 -06:00
Tristan Lee
3d919316a9 added Bitchute scraper, minor change to Bitchute scraper to correctly extract author name and id 2022-03-10 13:03:01 -06:00
Tristan Lee
5783206ad8 implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers 2022-03-10 10:20:49 -06:00
Logan Williams
fa5037d67c Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction 2022-03-10 15:34:24 +01:00
Tristan Lee
6cf3b8842d renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test 2022-03-09 13:19:35 -06:00
Tristan Lee
739e1d8484 added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram 2022-03-09 12:12:01 -06:00
Tristan Lee
506fb54a53 added wrapper for requests that retries after encountering exception 2022-03-07 13:28:33 -06:00
Logan Williams
253a9bea49 Merge pull request #3 from bellingcat/media
Expanding media archiving, implementing Odysee scraper
2022-03-07 11:47:08 +01:00