Commit Graph

  • d83a13e0cc Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-06-09 14:11:52 +00:00
  • fba3a661c7 Sleep after every gsheet API call Logan Williams 2022-06-09 14:11:29 +00:00
  • 6294ea7ea7 Increment TelegramTelethonScraper version Logan Williams 2022-06-09 15:30:56 +02:00
  • 92d4839b5e Revise Telethon scraper to use the same client connection telethon-same-client Logan Williams 2022-06-09 10:01:27 +02:00
  • 9a30ecb243 Stop overwriting media when a large file is detected Logan Williams 2022-06-08 17:01:28 +02:00
  • 708d952937 Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-06-08 08:11:34 +00:00
  • bf96f248c1 Merge pull request #57 from bellingcat/synchronization-improvements Logan Williams 2022-06-08 10:08:23 +02:00
  • 143b20fc56 Update Pipfile to load pyexiftool from PyPi synchronization-improvements Logan Williams 2022-06-08 10:08:18 +02:00
  • f932034ab2 Add a bit more logging Logan Williams 2022-06-07 13:14:48 +02:00
  • 39358c7f23 Update platform ID and screenname when synchronizing with gsheet; highlight dupes Logan Williams 2022-06-06 16:36:39 +02:00
  • 4d22838a94 Fixes to Pipfile; easier spacy setup Logan Williams 2022-06-06 16:35:52 +02:00
  • f4072183be added transformer for Gettr Tristan Lee 2022-05-20 02:22:34 -05:00
  • 591f1986e8 added Rumble transformers and test Tristan Lee 2022-05-19 19:40:48 -05:00
  • e2094522c9 updated Bitchute transformer and addewd test Tristan Lee 2022-05-19 18:13:50 -05:00
  • f0414a4f4d updated transformer tests Tristan Lee 2022-05-19 16:34:19 -05:00
  • 424c063ef2 Merge pull request #55 from bellingcat/channel-info-transformers Tristan Lee 2022-05-18 06:58:18 -07:00
  • c279ced73d Minor bug fixes from testing channel-info-transformers Logan Williams 2022-05-18 09:29:53 +01:00
  • 7c8147bb2a Add CLI for channel info transform Logan Williams 2022-05-18 09:20:33 +01:00
  • 6145fd0b6b Add Telegram transformer for channel info Logan Williams 2022-05-18 09:17:49 +01:00
  • 317da2c9d4 Merge pull request #54 from bellingcat/transformers Tristan Lee 2022-05-17 10:18:12 -07:00
  • 9869612b67 Merge pull request #50 from bellingcat/odysee-refactor Logan Williams 2022-05-16 12:25:35 +01:00
  • 7f55b721dd Bug fixes in transformers transformers Logan Williams 2022-05-13 15:39:01 +00:00
  • 34da733e7c Add date_transformed; refinements to telethon transformer Logan Williams 2022-05-12 13:03:30 +00:00
  • 4493618801 Synchronizing channels will update other info for existing channels Logan Williams 2022-05-12 13:02:14 +00:00
  • ab482443db Merge branch 'main' of https://github.com/bellingcat/cisticola into transformers Logan Williams 2022-04-16 13:55:23 +00:00
  • 8535a87def Ad hoc changes to transformers Logan Williams 2022-04-16 13:46:26 +00:00
  • 3b8b03283a Add logging to transform Logan Williams 2022-04-14 11:32:15 +00:00
  • 428af3575f Merge branch 'transformers' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-04-14 11:31:34 +00:00
  • 38e0104078 Separate logging; limit Telegram archive file size Logan Williams 2022-04-14 10:43:27 +00:00
  • 4c221d1133 Transformer for Telegram, base transformer NLP hydration; no media Logan Williams 2022-04-14 11:45:09 +02:00
  • 214a4d7d19 Merge pull request #53 from bellingcat/sync-channels v2022-04-13 Logan Williams 2022-04-13 12:39:41 +02:00
  • 1ac8d6c603 Close sessions; sort channel info by least recently archived Logan Williams 2022-04-13 10:38:08 +00:00
  • 59bab0d812 Disable Youtube scraper for now Logan Williams 2022-04-13 10:12:20 +02:00
  • a2e62cc489 Merge pull request #52 from bellingcat/next-release Logan Williams 2022-04-13 10:11:49 +02:00
  • d96b8177a5 Merge pull request #49 from bellingcat/sync-channels Logan Williams 2022-04-13 10:11:34 +02:00
  • e7c3771788 Merge pull request #51 from bellingcat/channel-info Logan Williams 2022-04-13 10:11:21 +02:00
  • a0dbe7d92b Catch errors in channel info Logan Williams 2022-04-13 10:10:29 +02:00
  • 27b51267a7 fixed bugs from incorporating polyphemus refactoring changes Tristan Lee 2022-04-13 00:02:12 -05:00
  • ef7afc0715 Merge branch 'main' into odysee-refactor Tristan Lee 2022-04-12 23:26:18 -05:00
  • dfc5b77726 incorporated polyphemus refactoring changes Tristan Lee 2022-04-12 23:23:21 -05:00
  • 209152ea69 Synchronize channels that have changed info Logan Williams 2022-04-12 18:13:52 +02:00
  • d5f6ce485b Merge pull request #46 from bellingcat/next-release v2022-04-12 Logan Williams 2022-04-12 14:59:01 +02:00
  • d1f9dd0e01 Limit max # of archived files per session Logan Williams 2022-04-12 12:57:04 +00:00
  • bbb9d283d5 Add RumbleScraper, YoutubeScraper, and BitchuteScraper to the active scrapers Logan Williams 2022-04-12 14:55:45 +02:00
  • 6f11b88f94 Use Youtube cookie for Rumble too Logan Williams 2022-04-12 14:55:18 +02:00
  • 7b8236e6db No recursive retries Logan Williams 2022-04-12 14:55:05 +02:00
  • b596d3e055 Merge pull request #42 from bellingcat/youtube-age-restricted Logan Williams 2022-04-12 11:14:45 +02:00
  • 1f7f957e62 Merge pull request #44 from bellingcat/bitchute-error Logan Williams 2022-04-12 11:13:52 +02:00
  • e05f69bbee Merge pull request #38 from bellingcat/youtube-dl-retry Logan Williams 2022-04-12 11:11:51 +02:00
  • 1f667d532e made get_videos_user use request_from_bitchute requests wrapper to catch errors Tristan Lee 2022-04-06 11:40:43 -05:00
  • f17800b797 added required YOUTUBE_COOKIESTRING environment variable to be used by YoutubeScraper Tristan Lee 2022-04-05 21:22:41 -05:00
  • a204041480 made requested changes to scraper version numbers Tristan Lee 2022-04-05 17:03:45 -05:00
  • 36c81c8e17 Merge pull request #40 from bellingcat/next-release v2022-04-04 Logan Williams 2022-04-04 13:03:16 +02:00
  • b6386747d4 Add indices on appropriate columns; limit # of posts to archive Logan Williams 2022-04-04 10:54:27 +00:00
  • ed74c5692b merged main Tristan Lee 2022-04-03 19:35:16 -05:00
  • c7253148d1 added 'retries' argument to youtube-dl options, and made options consistent across youtube-dl instances. Tristan Lee 2022-04-03 19:31:32 -05:00
  • fccbad7a93 Remove 200 post limit; add log rotation v2022-04-03 Logan Williams 2022-04-03 16:32:00 +00:00
  • 4c580519dd Remove Rumble scraper Logan Williams 2022-04-03 15:59:39 +02:00
  • 0140b09ee8 Release Telethon, VK, and Gettr as 0.0.1; specify unrelease 0.0.0 otherwise Logan Williams 2022-04-03 15:29:24 +02:00
  • 96db662572 Don't add a timestamp to media that failed to archive Logan Williams 2022-04-03 14:16:03 +02:00
  • ecae1aad05 Catch exceptions in archive_files so that archiver continues to run Logan Williams 2022-04-03 14:12:23 +02:00
  • 9c838aae39 Update media_archived column even when TG post has no media Logan Williams 2022-04-03 13:29:10 +02:00
  • 57b9082271 Remove Odysee scraper due to errors Logan Williams 2022-04-03 13:26:05 +02:00
  • a82ec15f0e Change archived_media to be timestamp for all scrapers Logan Williams 2022-04-03 12:02:27 +02:00
  • 8ee20a239c Merge branch 'main' into initial-release Logan Williams 2022-04-03 11:35:12 +02:00
  • 90c99aec00 ensured that Gettr username is lowercase for API requests to work correctly Tristan Lee 2022-04-02 22:36:25 -05:00
  • b0a52e5ad7 handled case where Rumble video has no view information displayed Tristan Lee 2022-04-02 21:26:29 -05:00
  • 01bbabe0cb Fix issues with new datetime baed 'media_archived' column Logan Williams 2022-04-02 18:45:08 +00:00
  • 63633617d2 Configure with Telethon and VK only Logan Williams 2022-04-02 18:34:14 +00:00
  • 0099558c68 Merge pull request #26 from bellingcat/deferred-media-archiving Logan Williams 2022-04-02 14:15:35 +02:00
  • 0bab20e371 ensured that before being scraped, all channels are added to the database, preventing channel.platform_id from being null. Tristan Lee 2022-04-01 17:03:02 -05:00
  • 8ecb904249 merged main Tristan Lee 2022-04-01 02:05:25 -05:00
  • 282f33eff3 implemented deferred media archiving for all scrapers, and implemented tests for them. Refactored archiving methods of Instagram and Gettr scrapers to be able to use default archiving method Tristan Lee 2022-04-01 01:30:49 -05:00
  • d20db5f828 Catch exceptions in get_posts so that archiving continues despites errors Logan Williams 2022-03-31 20:27:18 +02:00
  • 16aad4ef2c TelegramTelethonScraper: Using the username is fine. Logan Williams 2022-03-31 16:50:20 +02:00
  • 94cf6c3d84 TelegramTelethonScraper: Use channel_id when channel has been previously encountered Logan Williams 2022-03-31 16:37:54 +02:00
  • 061af984ee Merge pull request #20 from bellingcat/separate-media-archiving Logan Williams 2022-03-31 16:28:30 +02:00
  • 7f87b03de5 Add option to clear registered scrapers, necessary for tests Logan Williams 2022-03-31 16:17:35 +02:00
  • c8d1b96e3f Fix bug in handling retweets without media Logan Williams 2022-03-31 15:51:17 +02:00
  • a5cffa615f Fix Twitter profile scraper, catch exceptions in controller Logan Williams 2022-03-31 15:37:58 +02:00
  • 2dc9213d64 Use new RawChannelInfo class Logan Williams 2022-03-31 15:17:25 +02:00
  • 61c99d33f6 Add Postgres support with psycopg2 Logan Williams 2022-03-31 08:14:08 +02:00
  • cff1953d21 Initial CLI tool Logan Williams 2022-03-29 21:10:46 +02:00
  • 1c1ff7fb6f Fix bug with Telethon scraper and certain media; add media_archived flag to TwitterScraper Logan Williams 2022-03-29 21:09:55 +02:00
  • 19056a1d9a Merge pull request #23 from bellingcat/profile Logan Williams 2022-03-31 08:13:17 +02:00
  • b7871b060d added capability to scrape Gab group posts Tristan Lee 2022-03-30 09:11:07 -05:00
  • 1f99e52436 refactored Gab scraper to use gabber instead of garc Tristan Lee 2022-03-30 08:05:10 -05:00
  • b805d50132 made tesets work, fixed several issues with Rumble scraper Tristan Lee 2022-03-29 16:09:51 -05:00
  • 67d1abf024 added methods for extracting channel profile metadata, and tests Tristan Lee 2022-03-28 21:11:34 -05:00
  • ea40ea2640 merged main Tristan Lee 2022-03-28 20:22:34 -05:00
  • 5d6473e946 Merge pull request #19 from bellingcat/separate-media-archiving Tristan Lee 2022-03-28 20:20:57 -05:00
  • 16870d7daa implemented methods for extracting profile metadata (still need to test) Tristan Lee 2022-03-28 20:16:59 -05:00
  • a80dbddbbc Add snscrape delayed media archiving support; add explicit bool Logan Williams 2022-03-28 11:42:15 +02:00
  • d68cbd207a Merge pull request #17 from bellingcat/channel-db Tristan Lee 2022-03-24 13:07:03 -05:00
  • 63fdae9f1b Implement media archiving after the initial scrape for Twitter and Telethon Logan Williams 2022-03-24 16:52:11 +01:00
  • 65edde6d20 Fix bug after merge Logan Williams 2022-03-22 11:56:28 +01:00
  • 2a3b5c8200 Merge branch 'main' into channel-db Logan Williams 2022-03-22 11:49:07 +01:00
  • fa516da763 Rename TransformedResult to the clearer Post Logan Williams 2022-03-22 11:41:55 +01:00
  • c0a094eefa Load channels from google sheet in test.py Logan Williams 2022-03-22 11:37:47 +01:00
  • 571b019137 Fix tests for Twitter transformer Logan Williams 2022-03-22 11:33:27 +01:00