Logan Williams
|
19056a1d9a
|
Merge pull request #23 from bellingcat/profile
Added methods for retrieving channel profile metadata, refactored Gab scraper to use gabber
|
2022-03-31 08:13:17 +02:00 |
|
Tristan Lee
|
b7871b060d
|
added capability to scrape Gab group posts
|
2022-03-30 09:11:07 -05:00 |
|
Tristan Lee
|
1f99e52436
|
refactored Gab scraper to use gabber instead of garc
|
2022-03-30 08:05:10 -05:00 |
|
Tristan Lee
|
b805d50132
|
made tesets work, fixed several issues with Rumble scraper
|
2022-03-29 16:09:51 -05:00 |
|
Tristan Lee
|
67d1abf024
|
added methods for extracting channel profile metadata, and tests
|
2022-03-28 21:11:34 -05:00 |
|
Tristan Lee
|
ea40ea2640
|
merged main
|
2022-03-28 20:22:34 -05:00 |
|
Tristan Lee
|
5d6473e946
|
Merge pull request #19 from bellingcat/separate-media-archiving
Separate media archiving
|
2022-03-28 20:20:57 -05:00 |
|
Tristan Lee
|
16870d7daa
|
implemented methods for extracting profile metadata (still need to test)
|
2022-03-28 20:16:59 -05:00 |
|
Logan Williams
|
a80dbddbbc
|
Add snscrape delayed media archiving support; add explicit bool
|
2022-03-28 11:42:15 +02:00 |
|
Tristan Lee
|
d68cbd207a
|
Merge pull request #17 from bellingcat/channel-db
Add Channel object to ORM, store in DB
|
2022-03-24 13:07:03 -05:00 |
|
Logan Williams
|
63fdae9f1b
|
Implement media archiving after the initial scrape for Twitter and Telethon
|
2022-03-24 16:52:11 +01:00 |
|
Logan Williams
|
65edde6d20
|
Fix bug after merge
|
2022-03-22 11:56:28 +01:00 |
|
Logan Williams
|
2a3b5c8200
|
Merge branch 'main' into channel-db
|
2022-03-22 11:49:07 +01:00 |
|
Logan Williams
|
fa516da763
|
Rename TransformedResult to the clearer Post
|
2022-03-22 11:41:55 +01:00 |
|
Logan Williams
|
c0a094eefa
|
Load channels from google sheet in test.py
|
2022-03-22 11:37:47 +01:00 |
|
Logan Williams
|
571b019137
|
Fix tests for Twitter transformer
|
2022-03-22 11:33:27 +01:00 |
|
Logan Williams
|
806f07f458
|
Add functions for scraping based on Channel database
|
2022-03-22 11:26:46 +01:00 |
|
Logan Williams
|
885b4687ce
|
Add ORM for Channel class; update foreign key relations; add platform_id to TransformedResult
|
2022-03-22 11:25:52 +01:00 |
|
Logan Williams
|
d5bf3629c2
|
Merge pull request #16 from bellingcat/docs
Docs
|
2022-03-16 15:20:51 +01:00 |
|
Tristan Lee
|
93554b19e9
|
fixed typo
|
2022-03-15 13:05:41 -05:00 |
|
Tristan Lee
|
d68d76c0ab
|
added missing docstrings, created Makefile target for sphinx-apidoc, added quickstart page for installation and configuration instructions
|
2022-03-15 12:40:18 -05:00 |
|
Tristan Lee
|
ee9a8c10dd
|
merged main into branch
|
2022-03-15 09:16:11 -05:00 |
|
Tristan Lee
|
e287fd03d9
|
merged scraper into main and fixed minor merge conflict
|
2022-03-15 09:12:12 -05:00 |
|
Tristan Lee
|
a3c859ec79
|
added more docstrings and comments
|
2022-03-14 19:38:33 -05:00 |
|
Tristan Lee
|
c3eab2f176
|
merged main
|
2022-03-14 18:19:57 -05:00 |
|
Tristan Lee
|
e4cf9daf73
|
added docstrings, improved Sphinx docs
|
2022-03-14 18:04:27 -05:00 |
|
Tristan Lee
|
db03cbf141
|
Merge pull request #13 from bellingcat/transformer
Merged Transformer branch into main, including example of Transformer instance for Twitter and associated test
|
2022-03-14 11:13:57 -05:00 |
|
Tristan Lee
|
750f0cc887
|
added scraper for Instagram
|
2022-03-14 10:28:10 -05:00 |
|
Logan Williams
|
fe0d762df0
|
Add Transformer and ETLController docstrings
|
2022-03-14 14:02:57 +01:00 |
|
Logan Williams
|
fd4b617743
|
Add TwitterTransformer test
|
2022-03-14 13:39:10 +01:00 |
|
Tristan Lee
|
965bf1e2dc
|
added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better
|
2022-03-11 17:19:52 -06:00 |
|
Tristan Lee
|
821c39004b
|
incorporated vkontakte scraper
|
2022-03-10 22:32:39 -06:00 |
|
Tristan Lee
|
3d919316a9
|
added Bitchute scraper, minor change to Bitchute scraper to correctly extract author name and id
|
2022-03-10 13:03:01 -06:00 |
|
Tristan Lee
|
5783206ad8
|
implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers
|
2022-03-10 10:20:49 -06:00 |
|
Logan Williams
|
fa5037d67c
|
Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction
|
2022-03-10 15:34:24 +01:00 |
|
Tristan Lee
|
6cf3b8842d
|
renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test
|
2022-03-09 13:19:35 -06:00 |
|
Tristan Lee
|
739e1d8484
|
added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram
|
2022-03-09 12:12:01 -06:00 |
|
Tristan Lee
|
506fb54a53
|
added wrapper for requests that retries after encountering exception
|
2022-03-07 13:28:33 -06:00 |
|
Logan Williams
|
253a9bea49
|
Merge pull request #3 from bellingcat/media
Expanding media archiving, implementing Odysee scraper
|
2022-03-07 11:47:08 +01:00 |
|
Tristan Lee
|
cd5f68e9e5
|
added basic unit tests
|
2022-03-04 12:36:09 -06:00 |
|
Tristan Lee
|
c21e43ddfa
|
refactored import structure
|
2022-03-04 10:55:54 -06:00 |
|
Tristan Lee
|
75240bb060
|
fixed various bugs related to archived URL creation and media downloading. Things seem to work well now
|
2022-03-01 15:58:18 -06:00 |
|
Tristan Lee
|
f3d9dc91c6
|
changed URL parsing to use urllib
|
2022-03-01 14:13:04 -06:00 |
|
Tristan Lee
|
ee4d64750b
|
added prototype Rumble scraper
|
2022-02-28 18:38:33 -06:00 |
|
Tristan Lee
|
bc840e631d
|
added Gab scraper
|
2022-02-28 12:11:21 -06:00 |
|
Tristan Lee
|
7a257ea9f5
|
included comments in Odysee scraper
|
2022-02-28 09:15:09 -06:00 |
|
Tristan Lee
|
36fb95d9ae
|
Merge branch 'main' into media
|
2022-02-25 20:30:36 -06:00 |
|
Tristan Lee
|
47dad8fb00
|
added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster
|
2022-02-25 20:28:00 -06:00 |
|
Tristan Lee
|
ef83cc4b0a
|
converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url.
|
2022-02-25 13:43:30 -06:00 |
|
Tristan Lee
|
bd7bbdf993
|
Merge pull request #2 from bellingcat/media
WIP: Archiving media, organization improvements
|
2022-02-25 08:26:58 -06:00 |
|