Commit Graph

  • 806f07f458 Add functions for scraping based on Channel database Logan Williams 2022-03-22 11:26:46 +01:00
  • 885b4687ce Add ORM for Channel class; update foreign key relations; add platform_id to TransformedResult Logan Williams 2022-03-22 11:25:52 +01:00
  • d5bf3629c2 Merge pull request #16 from bellingcat/docs Logan Williams 2022-03-16 15:20:51 +01:00
  • 93554b19e9 fixed typo Tristan Lee 2022-03-15 13:05:41 -05:00
  • d68d76c0ab added missing docstrings, created Makefile target for sphinx-apidoc, added quickstart page for installation and configuration instructions Tristan Lee 2022-03-15 12:40:18 -05:00
  • ee9a8c10dd merged main into branch Tristan Lee 2022-03-15 09:16:11 -05:00
  • e287fd03d9 merged scraper into main and fixed minor merge conflict Tristan Lee 2022-03-15 09:12:12 -05:00
  • a3c859ec79 added more docstrings and comments Tristan Lee 2022-03-14 19:38:33 -05:00
  • c3eab2f176 merged main Tristan Lee 2022-03-14 18:19:57 -05:00
  • e4cf9daf73 added docstrings, improved Sphinx docs Tristan Lee 2022-03-14 18:04:27 -05:00
  • db03cbf141 Merge pull request #13 from bellingcat/transformer Tristan Lee 2022-03-14 11:13:57 -05:00
  • 750f0cc887 added scraper for Instagram Tristan Lee 2022-03-14 10:28:10 -05:00
  • fe0d762df0 Add Transformer and ETLController docstrings Logan Williams 2022-03-14 14:02:57 +01:00
  • fd4b617743 Add TwitterTransformer test Logan Williams 2022-03-14 13:33:55 +01:00
  • 965bf1e2dc added youtube scraper, moved from official youtube-dl repo to using yt-dlp because download speed for youtube videos is much better Tristan Lee 2022-03-11 17:19:52 -06:00
  • 821c39004b incorporated vkontakte scraper Tristan Lee 2022-03-10 22:32:39 -06:00
  • 3d919316a9 added Bitchute scraper, minor change to Bitchute scraper to correctly extract author name and id Tristan Lee 2022-03-10 13:03:01 -06:00
  • 5783206ad8 implemented method to reset database, to enable the 'contoller' fixture scope to be shared across the whole package, which will enable the transformer tests to be run without re-running the scrapers Tristan Lee 2022-03-10 10:20:49 -06:00
  • fa5037d67c Implement transformer for TwitterScraper that handles media; implement image OCR and EXIF extraction Logan Williams 2022-03-10 15:34:24 +01:00
  • 6cf3b8842d renamed 'archive_media' and 'media' to avoid name collision, changed scope of test fixture controller to 'function' so that db is fresh for each executed test Tristan Lee 2022-03-09 13:19:35 -06:00
  • 739e1d8484 added capability of running scraper without archiving media, and implemented prototype Telethon scraper for Telegram Tristan Lee 2022-03-09 12:12:01 -06:00
  • 506fb54a53 added wrapper for requests that retries after encountering exception Tristan Lee 2022-03-07 13:28:33 -06:00
  • 253a9bea49 Merge pull request #3 from bellingcat/media Logan Williams 2022-03-07 11:47:08 +01:00
  • cd5f68e9e5 added basic unit tests Tristan Lee 2022-03-04 12:36:09 -06:00
  • c21e43ddfa refactored import structure Tristan Lee 2022-03-04 10:55:54 -06:00
  • 75240bb060 fixed various bugs related to archived URL creation and media downloading. Things seem to work well now Tristan Lee 2022-03-01 15:58:18 -06:00
  • f3d9dc91c6 changed URL parsing to use urllib Tristan Lee 2022-03-01 14:13:04 -06:00
  • ee4d64750b added prototype Rumble scraper Tristan Lee 2022-02-28 18:38:33 -06:00
  • bc840e631d added Gab scraper Tristan Lee 2022-02-28 12:11:21 -06:00
  • 7a257ea9f5 included comments in Odysee scraper Tristan Lee 2022-02-28 09:15:09 -06:00
  • 36fb95d9ae Merge branch 'main' into media Tristan Lee 2022-02-25 20:30:36 -06:00
  • 47dad8fb00 added odysee scraper, minor refactoring of url_to_blob method (added url_to_key method that can be overridden by child classes while still using the parent url_to_blob method) and changed test file to include only channels with a relatively small number of posts, to make testing faster Tristan Lee 2022-02-25 20:28:00 -06:00
  • ef83cc4b0a converted bitchute to yield, got video archiving working on bitchute and gettr, added url_to_blob method that downloads media bytes blob from url and converted archive_media to take in the media bytes blob instead of the media url. Tristan Lee 2022-02-25 13:43:30 -06:00
  • bd7bbdf993 Merge pull request #2 from bellingcat/media Tristan Lee 2022-02-25 08:26:58 -06:00
  • 8ab56ff5ba Remove MAX_POSTS, auto detect MIME type Logan Williams 2022-02-25 08:52:42 +01:00
  • e6085689b5 On second thought, don't share secrets Logan Williams 2022-02-24 20:47:46 +01:00
  • 3480452fac Fix type hints Logan Williams 2022-02-24 20:36:23 +01:00
  • 1ad7c8bc11 Search for since per-channel Logan Williams 2022-02-24 20:26:10 +01:00
  • 0b1c175dd9 Modify GettrScraper to yield results, archive media (videos incomplete) Logan Williams 2022-02-24 20:25:14 +01:00
  • 456d592792 Use user id for TwitterScraper Logan Williams 2022-02-24 20:24:03 +01:00
  • d159c09aa4 yield data rather than returning a list Logan Williams 2022-02-24 18:58:08 +01:00
  • d163e6b3d9 Fix logging logic in scraper controller Logan Williams 2022-02-24 18:49:06 +01:00
  • e64d845002 Archive media in Twitter scraper Logan Williams 2022-02-24 18:48:48 +01:00
  • 214287b7a8 Archive media in dictionary Logan Williams 2022-02-24 17:35:24 +01:00
  • a87cfd570a Add Telegram channel scraper Logan Williams 2022-02-24 16:37:13 +01:00
  • 6092e4caa5 Add method for archiving media, reoranize scraper base classes Logan Williams 2022-02-24 16:36:55 +01:00
  • e09e0f5202 Merge pull request #1 from bellingcat/add-docs Logan Williams 2022-02-22 15:19:54 +01:00
  • 9fe3d90b0b fixed warnings from sphinx-build, made build path consistent with gitignore (removed sphinx build directory from version control) Tristan Lee 2022-02-21 16:13:16 -06:00
  • e3d29bf811 Add documentation generation with Sphinx Logan Williams 2022-02-21 17:52:38 +01:00
  • 139459e3b2 implemented Bitchute scraper Tristan Lee 2022-02-18 12:45:10 -06:00
  • 4668d4df11 implemented Gettr scraper Tristan Lee 2022-02-18 10:13:37 -06:00
  • 0e5f9f77f3 Configure pipenv Logan Williams 2022-02-18 15:05:02 +01:00
  • b824b98a95 Reorganize transformer defition location Logan Williams 2022-02-18 14:57:10 +01:00
  • c5d49ef521 Reorganize class definitions slightly Logan Williams 2022-02-18 14:14:25 +01:00
  • 82ad210b8e Initial commit Logan Williams 2022-02-18 14:01:49 +01:00