Commit Graph

  • d27ea4d3e5 merged recent changes in main main more-docs-and-tests Tristan Lee 2023-08-07 20:42:02 -05:00
  • 8a10451a72 updated developer guide Tristan Lee 2023-08-07 20:28:52 -05:00
  • a89925b99e manually specified all steps in linting workflow Tristan Lee 2023-08-07 20:17:59 -05:00
  • 39f7dd0997 edited workflow file Tristan Lee 2023-08-07 20:14:30 -05:00
  • dd7d8861cd edited github workflow file Tristan Lee 2023-08-07 20:07:29 -05:00
  • 1eb82c5f3e sorted imports using isort and tried to add pre-commit hook for isort Tristan Lee 2023-08-07 20:04:16 -05:00
  • 1ec1d6190a implemented minor fixes recommended by pyling (unused imports, f-strings without patterns, etc.) Tristan Lee 2023-08-07 19:39:03 -05:00
  • 6f4eb21ad0 started addressing mypy issues, updated several method type annotation signatures to be consistent with changes Tristan Lee 2023-08-07 19:15:39 -05:00
  • 89b5068108 added descriptions for undocumented attributes for classes in cisticola.base module Tristan Lee 2023-08-07 17:07:45 -05:00
  • 8064310193 refactored spacy.load of nlp models so that they're loading during ETLController initialization instead of on cisticola.base import spacy_model_load_refactor Tristan Lee 2023-08-07 15:55:06 -05:00
  • 1e2b62be57 Add link to documentation Logan Williams 2023-08-07 11:03:04 +02:00
  • 3aec25f74c Simplify transform method signature Logan Williams 2023-08-07 10:08:13 +02:00
  • 1f0197200e removed pipenv run prefix from commands, to not use pipenv virtual env Tristan Lee 2023-08-04 16:40:21 -05:00
  • 7fd4260d71 added sloppy workaround to avoid build error from spaCy models not being downloaded Tristan Lee 2023-08-04 16:34:34 -05:00
  • e6ca0fe515 added readtehdocs job to install pipenv and use pipenv to install dependencies Tristan Lee 2023-08-04 16:18:30 -05:00
  • 4811240091 added readthedocs config file Tristan Lee 2023-08-04 15:46:21 -05:00
  • 1d7e82ae4a revised database schema diagram Tristan Lee 2023-08-04 15:39:06 -05:00
  • fab65a5d67 formatted with black, added pre-commit hook, pegged typing_extensions package version to fix spaCy issue tests-and-docs Tristan Lee 2023-08-04 14:51:00 -05:00
  • 070ee3391d temporarily removed Dockerfile and crontab Tristan Lee 2023-08-04 09:48:48 -05:00
  • 8421fe7c48 Merge branch 'main' into tests-and-docs Tristan Lee 2023-08-04 09:33:23 -05:00
  • 30bb4e43e4 removed broken scrapers and added basic README Tristan Lee 2023-08-04 09:15:53 -05:00
  • d55c13c95d Add chat attributes, don't overwrite from the sheet if sheet is empty Logan Williams 2023-08-04 14:40:48 +02:00
  • ef9292bc90 added table diagram, and brief developer guide and deployment info for docs Tristan Lee 2023-08-03 23:58:12 -05:00
  • d3b8e1a3b3 removed unused archive_media argument passed to methods throughout codebase Tristan Lee 2023-08-03 18:05:50 -05:00
  • edd772eb94 added and made more consistent docstrings, wrote script that makes minor edits to Sphinx apidocs to improve documentation clarity Tristan Lee 2023-08-03 17:27:33 -05:00
  • b8ddc400f3 updated documentation, minor fixes like excluding very long cookiestring from docs Tristan Lee 2023-08-03 01:59:30 -05:00
  • e2142966e7 refactored tests to reduce redundancy, got tests workig for Telegram, Bitchute, Gettr, and Rumble Tristan Lee 2023-08-03 00:53:38 -05:00
  • bd67806ed2 got Telegram scraper tests all working Tristan Lee 2023-08-01 10:46:50 -05:00
  • 249f411a1d fixed some issues with Telegram tests Tristan Lee 2023-07-27 13:07:44 -05:00
  • 99cc4d80b2 Cache screenname ID lookup Logan Williams 2023-05-04 16:23:24 +02:00
  • ca6e284cb3 Cache reply_to post IDs too Logan Williams 2023-05-04 16:14:03 +02:00
  • 91de6482e0 Add rather hacky bulk insert functionality Logan Williams 2023-05-04 15:26:52 +02:00
  • f9bf2bc2ee Merge branch 'main' of github.com:bellingcat/cisticola Logan Williams 2023-05-04 14:06:59 +02:00
  • ebbc6b69dd Add new function for insert post (faster/bulk) Logan Williams 2023-05-04 14:04:55 +02:00
  • 9dbf05fccb Streamline logging; fix markdown formating in Telegram Logan Williams 2023-05-04 10:00:14 +00:00
  • 2320ea1efd Use telethon session CLI argument always; improvements to Telegram transformer (author id/username for chats, min_id via CLI argument, use the same session) Logan Williams 2023-03-04 09:51:15 +01:00
  • 7d55eace3d Update platform_id when it is empty Logan Williams 2023-03-03 15:28:30 +01:00
  • eced79b278 Fix issue with insert_or_select Logan Williams 2023-03-03 10:47:21 +01:00
  • 793a783963 Revert to previous insert or select behavior Logan Williams 2023-03-02 23:04:34 +01:00
  • d2db83ae93 Update other channel properties too for a linked channel Logan Williams 2023-03-02 17:03:24 +01:00
  • 3a6905e9c1 Adjust logic for changing source label Logan Williams 2023-03-02 16:48:15 +01:00
  • 7b2c597a24 Update channel source, only if non-researcher Logan Williams 2023-03-02 16:45:38 +01:00
  • 64aef7238c Merge branch 'main' of github.com:bellingcat/cisticola Logan Williams 2023-03-02 16:34:42 +01:00
  • ffa8cdd8c6 Fix bitchute transformer Logan Williams 2023-03-02 16:33:48 +01:00
  • 7adf51b5d1 Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2023-03-02 15:28:24 +00:00
  • 6226a9a76e Ignore lobs Logan Williams 2023-03-02 15:26:38 +00:00
  • 531059ca02 Support related Telegram chats (associated discussion groups) Logan Williams 2023-03-02 16:21:43 +01:00
  • 351e471ff4 Change log retention and hackily improve transform speed Logan Williams 2023-01-26 13:21:07 +00:00
  • 5c4dd51435 Fix issues with Gsheet sync Logan Williams 2023-01-11 14:44:17 +00:00
  • 294f6a5172 possible performance improvement markdown-fixes Tristan Lee 2022-12-25 15:19:05 -08:00
  • d80ad442da specified columns to update, skipped channel lookups to increase speed Tristan Lee 2022-12-25 14:59:11 -08:00
  • aad3e67a01 support if reply_to key doesn't exist in raw post Tristan Lee 2022-12-25 13:58:09 -08:00
  • 288333cadd added edge cases support for url regex sloppy-import-refactor Tristan Lee 2022-12-25 13:53:08 -08:00
  • 3ec6f50213 Merge pull request #67 from bellingcat/sync-bug-fixes Logan Williams 2022-10-27 09:27:02 +02:00
  • cca3f174b1 merged main Tristan Lee 2022-10-26 19:50:48 -05:00
  • 6dc61af7a5 fixed problem from gspread update where empty columns raised error, fixed problem where sync tried to process empty channel sync-bug-fixes Tristan Lee 2022-10-26 14:47:59 -05:00
  • 10b33b3dbb Merge pull request #66 from bellingcat/country-language-searching Tristan Lee 2022-10-26 12:25:59 -05:00
  • d9e2250c5a added country index country-language-searching Tristan Lee 2022-10-26 08:42:35 -05:00
  • 5a53ebacd0 removed special case Tristan Lee 2022-10-26 08:22:13 -05:00
  • 3bb5af11e6 changed ORM and Google Sheet sync to reflect converting channels.country to JSONB array, added index for detected_language Tristan Lee 2022-10-26 08:16:49 -05:00
  • 90d1d0f29f Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-10-26 13:12:12 +00:00
  • b023e8044c Scrape snowball_complete sampled channels Logan Williams 2022-10-26 13:11:20 +00:00
  • a44fca1a04 Merge branch 'main' into markdown-fixes Tristan Lee 2022-10-26 07:20:39 -05:00
  • f29da4d5f3 added capability to retransform/update posts in database Tristan Lee 2022-10-26 07:20:19 -05:00
  • c15022402d Add an option to scape posts older than the database record as well as newer (Telegram only) Logan Williams 2022-09-05 13:48:01 +00:00
  • f000c6246e Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-08-29 09:11:21 +00:00
  • 1a29c06062 Fix case where post is dummy (-1) Logan Williams 2022-08-29 09:11:06 +00:00
  • 86656f8ba3 Scrape snowball_it channels too Logan Williams 2022-08-26 15:56:46 +02:00
  • a01d139bef Remove normalized_url column from channel creation Logan Williams 2022-08-24 15:35:08 +02:00
  • 4a17c3475d Add explicit source column to gsheet Logan Williams 2022-08-24 15:32:19 +02:00
  • 0c2360c1dd fixed problems with markdown link insertion relating to incorrect offsets due to surrogates, and multi-line links Tristan Lee 2022-08-12 08:25:47 -05:00
  • 002e9458f5 Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-08-01 10:00:07 +00:00
  • f3997ff6ae Catch errors in Bitchute channel profile scraper; add multi index on posts forwarded from/channel Logan Williams 2022-08-01 09:58:52 +00:00
  • 7d72c0de05 Add index for network analysis Logan Williams 2022-07-29 12:16:17 +02:00
  • 3a04fb51d4 Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-07-28 09:18:34 +00:00
  • fee216386b Fix issue with chronological media archiving Logan Williams 2022-07-28 09:18:07 +00:00
  • d05584a09f Minor bug fixes; helper tool for Telethon sessions Logan Williams 2022-07-28 08:42:59 +00:00
  • ee24367caa Add features for running archive-media simultaneously Logan Williams 2022-07-20 09:26:47 +00:00
  • f828b92b8d removed unused comments and debug statements Tristan Lee 2022-07-09 23:50:22 -05:00
  • bac6240392 fixed langdetect problem (see https://github.com/Mimino666/langdetect/issues/77#issuecomment-1177974684) Tristan Lee 2022-07-09 23:49:18 -05:00
  • fbb846b8d6 Fix two small bugs with media archiving Logan Williams 2022-07-05 13:30:39 +02:00
  • b99958a894 Merge pull request #58 from bellingcat/media-etl Logan Williams 2022-07-05 11:51:08 +02:00
  • 51e5ca1f04 Use smaller batches for now media-etl Logan Williams 2022-07-05 09:48:57 +00:00
  • 6149c4279d Add some more fields to media DB, fix bugs in testing Logan Williams 2022-07-05 11:11:43 +02:00
  • 4ddd8d6b63 Only select untransformed media; simplify insert function Logan Williams 2022-06-08 16:52:40 +02:00
  • 9948af2c4a Media archiving ETL working for Telegram Logan Williams 2022-06-08 16:41:46 +02:00
  • c24babb081 Fix bugs in Gettr/Rumble transformers, avoid offset in batch requests Logan Williams 2022-07-04 14:30:40 +00:00
  • 09f99392ef modified langdetect detect method to decrease run-time, fixed indent error in transform_info, prototyped removal of offset in transform_all_untransformed optimizations Tristan Lee 2022-07-01 03:52:14 -05:00
  • dcf7e77446 added additional options when Telethon GetFullChannelRequest fails Tristan Lee 2022-07-01 03:19:31 -05:00
  • ed4723ed1e Fix merge error Logan Williams 2022-06-30 11:04:41 +00:00
  • fb2a6e77cc modified code to handle import of pickle-serialized Telethon message dicts Tristan Lee 2022-06-29 17:35:27 -05:00
  • 589ac3ba5b Merge branch 'main' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-06-24 10:34:25 +00:00
  • 7215469a74 Correct logs for transforming posts Logan Williams 2022-06-24 10:32:27 +00:00
  • fe0f4f9e2c Merge pull request #62 from bellingcat/other-transformer-fixes Logan Williams 2022-06-24 11:00:50 +02:00
  • 289a47d7b1 tested telegram transformers and implemented vk transformers Tristan Lee 2022-06-23 15:06:10 -05:00
  • bb2e2806e6 got post transformers and channel_info transformers working for Rumble, Bitchute, Gettr Tristan Lee 2022-06-21 19:05:41 -05:00
  • 619fe42a31 got transformers for Bitchute, Rumble, and Gettr working for all raw_posts. Tristan Lee 2022-06-20 21:45:41 -05:00
  • a2a7882f1c fixed Gettr and Bitchute info transformers, added missing or incorrect TelegramTransformer fields, added Telegram mentions to the transformer. Tristan Lee 2022-06-13 13:42:33 -05:00
  • 6e962de244 Don't scrape channel info unless specifically scraping channel info Logan Williams 2022-06-10 08:41:45 +00:00
  • 6183972b1a Merge branch 'more-channel-info-transformers' of https://github.com/bellingcat/cisticola into main Logan Williams 2022-06-10 08:07:02 +00:00