241 Commits

Author SHA1 Message Date
Logan Williams
8ee20a239c Merge branch 'main' into initial-release 2022-04-03 11:35:12 +02:00
Tristan Lee
90c99aec00 ensured that Gettr username is lowercase for API requests to work correctly 2022-04-02 22:36:25 -05:00
Tristan Lee
b0a52e5ad7 handled case where Rumble video has no view information displayed 2022-04-02 21:26:29 -05:00
Logan Williams
01bbabe0cb Fix issues with new datetime baed 'media_archived' column 2022-04-02 18:45:08 +00:00
Logan Williams
63633617d2 Configure with Telethon and VK only 2022-04-02 18:34:14 +00:00
Logan Williams
0099558c68 Merge pull request #26 from bellingcat/deferred-media-archiving
Implemented deferred media archiving for all scrapers
2022-04-02 14:15:35 +02:00
Tristan Lee
0bab20e371 ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null. 2022-04-01 17:03:02 -05:00
Tristan Lee
8ecb904249 merged main 2022-04-01 02:05:25 -05:00
Tristan Lee
282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method 2022-04-01 01:30:49 -05:00
Logan Williams
d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors 2022-03-31 20:27:18 +02:00
Logan Williams
16aad4ef2c TelegramTelethonScraper: Using the username is fine. 2022-03-31 16:50:20 +02:00
Logan Williams
94cf6c3d84 TelegramTelethonScraper: Use channel_id when channel has been previously encountered 2022-03-31 16:37:54 +02:00
Logan Williams
061af984ee Merge pull request #20 from bellingcat/separate-media-archiving
WIP: Separate media archiving and CLI
2022-03-31 16:28:30 +02:00
Logan Williams
7f87b03de5 Add option to clear registered scrapers, necessary for tests 2022-03-31 16:17:35 +02:00
Logan Williams
c8d1b96e3f Fix bug in handling retweets without media 2022-03-31 15:51:17 +02:00
Logan Williams
a5cffa615f Fix Twitter profile scraper, catch exceptions in controller 2022-03-31 15:37:58 +02:00
Logan Williams
2dc9213d64 Use new RawChannelInfo class 2022-03-31 15:17:25 +02:00
Logan Williams
61c99d33f6 Add Postgres support with psycopg2 2022-03-31 08:15:53 +02:00
Logan Williams
cff1953d21 Initial CLI tool 2022-03-31 08:15:11 +02:00
Logan Williams
1c1ff7fb6f Fix bug with Telethon scraper and certain media; add media_archived flag to TwitterScraper 2022-03-31 08:15:09 +02:00
Logan Williams
19056a1d9a Merge pull request #23 from bellingcat/profile
Added methods for retrieving channel profile metadata, refactored Gab scraper to use gabber
2022-03-31 08:13:17 +02:00
Tristan Lee
b7871b060d added capability to scrape Gab group posts 2022-03-30 09:11:07 -05:00
Tristan Lee
1f99e52436 refactored Gab scraper to use gabber instead of garc 2022-03-30 08:05:10 -05:00
Tristan Lee
b805d50132 made tesets work, fixed several issues with Rumble scraper 2022-03-29 16:09:51 -05:00
Tristan Lee
67d1abf024 added methods for extracting channel profile metadata, and tests 2022-03-28 21:11:34 -05:00
Tristan Lee
ea40ea2640 merged main 2022-03-28 20:22:34 -05:00
Tristan Lee
5d6473e946 Merge pull request #19 from bellingcat/separate-media-archiving
Separate media archiving
2022-03-28 20:20:57 -05:00
Tristan Lee
16870d7daa implemented methods for extracting profile metadata (still need to test) 2022-03-28 20:16:59 -05:00
Logan Williams
a80dbddbbc Add snscrape delayed media archiving support; add explicit bool 2022-03-28 11:42:15 +02:00
Tristan Lee
d68cbd207a Merge pull request #17 from bellingcat/channel-db
Add Channel object to ORM, store in DB
2022-03-24 13:07:03 -05:00
Logan Williams
63fdae9f1b Implement media archiving after the initial scrape for Twitter and Telethon 2022-03-24 16:52:11 +01:00
Logan Williams
65edde6d20 Fix bug after merge 2022-03-22 11:56:28 +01:00
Logan Williams
2a3b5c8200 Merge branch 'main' into channel-db 2022-03-22 11:49:07 +01:00
Logan Williams
fa516da763 Rename TransformedResult to the clearer Post 2022-03-22 11:41:55 +01:00
Logan Williams
c0a094eefa Load channels from google sheet in test.py 2022-03-22 11:37:47 +01:00
Logan Williams
571b019137 Fix tests for Twitter transformer 2022-03-22 11:33:27 +01:00
Logan Williams
806f07f458 Add functions for scraping based on Channel database 2022-03-22 11:26:46 +01:00
Logan Williams
885b4687ce Add ORM for Channel class; update foreign key relations; add platform_id to TransformedResult 2022-03-22 11:25:52 +01:00
Logan Williams
d5bf3629c2 Merge pull request #16 from bellingcat/docs
Docs
2022-03-16 15:20:51 +01:00
Tristan Lee
93554b19e9 fixed typo 2022-03-15 13:05:41 -05:00
Tristan Lee
d68d76c0ab added missing docstrings, created Makefile target for sphinx-apidoc, added quickstart page for installation and configuration instructions 2022-03-15 12:40:18 -05:00
Tristan Lee
ee9a8c10dd merged main into branch 2022-03-15 09:16:11 -05:00
Tristan Lee
e287fd03d9 merged scraper into main and fixed minor merge conflict 2022-03-15 09:12:12 -05:00
Tristan Lee
a3c859ec79 added more docstrings and comments 2022-03-14 19:38:33 -05:00
Tristan Lee
c3eab2f176 merged main 2022-03-14 18:19:57 -05:00
Tristan Lee
e4cf9daf73 added docstrings, improved Sphinx docs 2022-03-14 18:04:27 -05:00
Tristan Lee
db03cbf141 Merge pull request #13 from bellingcat/transformer
Merged Transformer branch into main, including example of Transformer instance for Twitter and associated test
2022-03-14 11:13:57 -05:00
Tristan Lee
750f0cc887 added scraper for Instagram 2022-03-14 10:28:10 -05:00
Logan Williams
fe0d762df0 Add Transformer and ETLController docstrings 2022-03-14 14:02:57 +01:00
Logan Williams
fd4b617743 Add TwitterTransformer test 2022-03-14 13:39:10 +01:00