Tristan Lee
|
1eb82c5f3e
|
sorted imports using isort and tried to add pre-commit hook for isort
|
2023-08-07 20:04:16 -05:00 |
|
Tristan Lee
|
1ec1d6190a
|
implemented minor fixes recommended by pyling (unused imports, f-strings without patterns, etc.)
|
2023-08-07 19:39:03 -05:00 |
|
Tristan Lee
|
6f4eb21ad0
|
started addressing mypy issues, updated several method type annotation signatures to be consistent with changes
|
2023-08-07 19:15:39 -05:00 |
|
Tristan Lee
|
89b5068108
|
added descriptions for undocumented attributes for classes in cisticola.base module
|
2023-08-07 17:07:45 -05:00 |
|
Tristan Lee
|
fab65a5d67
|
formatted with black, added pre-commit hook, pegged typing_extensions package version to fix spaCy issue
|
2023-08-04 14:51:00 -05:00 |
|
Tristan Lee
|
8421fe7c48
|
Merge branch 'main' into tests-and-docs
|
2023-08-04 09:33:23 -05:00 |
|
Tristan Lee
|
30bb4e43e4
|
removed broken scrapers and added basic README
|
2023-08-04 09:15:53 -05:00 |
|
Logan Williams
|
d55c13c95d
|
Add chat attributes, don't overwrite from the sheet if sheet is empty
|
2023-08-04 14:40:48 +02:00 |
|
Tristan Lee
|
d3b8e1a3b3
|
removed unused archive_media argument passed to methods throughout codebase
|
2023-08-03 18:05:50 -05:00 |
|
Tristan Lee
|
edd772eb94
|
added and made more consistent docstrings, wrote script that makes minor edits to Sphinx apidocs to improve documentation clarity
|
2023-08-03 17:27:33 -05:00 |
|
Tristan Lee
|
b8ddc400f3
|
updated documentation, minor fixes like excluding very long cookiestring from docs
|
2023-08-03 01:59:30 -05:00 |
|
Tristan Lee
|
e2142966e7
|
refactored tests to reduce redundancy, got tests workig for Telegram, Bitchute, Gettr, and Rumble
|
2023-08-03 00:53:38 -05:00 |
|
Tristan Lee
|
bd67806ed2
|
got Telegram scraper tests all working
|
2023-08-01 10:46:50 -05:00 |
|
Tristan Lee
|
249f411a1d
|
fixed some issues with Telegram tests
|
2023-07-27 13:07:44 -05:00 |
|
Logan Williams
|
99cc4d80b2
|
Cache screenname ID lookup
|
2023-05-04 16:23:24 +02:00 |
|
Logan Williams
|
ca6e284cb3
|
Cache reply_to post IDs too
|
2023-05-04 16:14:03 +02:00 |
|
Logan Williams
|
91de6482e0
|
Add rather hacky bulk insert functionality
|
2023-05-04 15:26:52 +02:00 |
|
Logan Williams
|
f9bf2bc2ee
|
Merge branch 'main' of github.com:bellingcat/cisticola
|
2023-05-04 14:06:59 +02:00 |
|
Logan Williams
|
ebbc6b69dd
|
Add new function for insert post (faster/bulk)
|
2023-05-04 14:04:55 +02:00 |
|
Logan Williams
|
9dbf05fccb
|
Streamline logging; fix markdown formating in Telegram
|
2023-05-04 10:00:14 +00:00 |
|
Logan Williams
|
2320ea1efd
|
Use telethon session CLI argument always; improvements to Telegram transformer (author id/username for chats, min_id via CLI argument, use the same session)
|
2023-03-04 09:51:15 +01:00 |
|
Logan Williams
|
7d55eace3d
|
Update platform_id when it is empty
|
2023-03-03 15:28:30 +01:00 |
|
Logan Williams
|
eced79b278
|
Fix issue with insert_or_select
|
2023-03-03 10:47:21 +01:00 |
|
Logan Williams
|
793a783963
|
Revert to previous insert or select behavior
|
2023-03-02 23:04:34 +01:00 |
|
Logan Williams
|
d2db83ae93
|
Update other channel properties too for a linked channel
|
2023-03-02 17:03:24 +01:00 |
|
Logan Williams
|
3a6905e9c1
|
Adjust logic for changing source label
|
2023-03-02 16:48:15 +01:00 |
|
Logan Williams
|
7b2c597a24
|
Update channel source, only if non-researcher
|
2023-03-02 16:45:38 +01:00 |
|
Logan Williams
|
ffa8cdd8c6
|
Fix bitchute transformer
|
2023-03-02 16:33:48 +01:00 |
|
Logan Williams
|
531059ca02
|
Support related Telegram chats (associated discussion groups)
|
2023-03-02 16:21:43 +01:00 |
|
Logan Williams
|
351e471ff4
|
Change log retention and hackily improve transform speed
|
2023-01-26 13:21:07 +00:00 |
|
Tristan Lee
|
10b33b3dbb
|
Merge pull request #66 from bellingcat/country-language-searching
Updated ORM and sync to improve filtering by language and country
|
2022-10-26 12:25:59 -05:00 |
|
Tristan Lee
|
d9e2250c5a
|
added country index
|
2022-10-26 08:42:35 -05:00 |
|
Tristan Lee
|
3bb5af11e6
|
changed ORM and Google Sheet sync to reflect converting channels.country to JSONB array, added index for detected_language
|
2022-10-26 08:16:49 -05:00 |
|
Logan Williams
|
90d1d0f29f
|
Merge branch 'main' of https://github.com/bellingcat/cisticola into main
|
2022-10-26 13:12:12 +00:00 |
|
Logan Williams
|
b023e8044c
|
Scrape snowball_complete sampled channels
|
2022-10-26 13:11:20 +00:00 |
|
Logan Williams
|
c15022402d
|
Add an option to scape posts older than the database record as well as newer (Telegram only)
|
2022-09-05 13:48:01 +00:00 |
|
Logan Williams
|
f000c6246e
|
Merge branch 'main' of https://github.com/bellingcat/cisticola into main
|
2022-08-29 09:11:21 +00:00 |
|
Logan Williams
|
1a29c06062
|
Fix case where post is dummy (-1)
|
2022-08-29 09:11:06 +00:00 |
|
Logan Williams
|
86656f8ba3
|
Scrape snowball_it channels too
|
2022-08-26 15:56:46 +02:00 |
|
Logan Williams
|
f3997ff6ae
|
Catch errors in Bitchute channel profile scraper; add multi index on posts forwarded from/channel
|
2022-08-01 09:58:52 +00:00 |
|
Logan Williams
|
3a04fb51d4
|
Merge branch 'main' of https://github.com/bellingcat/cisticola into main
|
2022-07-28 09:18:34 +00:00 |
|
Logan Williams
|
fee216386b
|
Fix issue with chronological media archiving
|
2022-07-28 09:18:07 +00:00 |
|
Logan Williams
|
d05584a09f
|
Minor bug fixes; helper tool for Telethon sessions
|
2022-07-28 08:42:59 +00:00 |
|
Logan Williams
|
ee24367caa
|
Add features for running archive-media simultaneously
|
2022-07-20 09:26:47 +00:00 |
|
Logan Williams
|
fbb846b8d6
|
Fix two small bugs with media archiving
|
2022-07-05 13:30:39 +02:00 |
|
Logan Williams
|
b99958a894
|
Merge pull request #58 from bellingcat/media-etl
Media ETL
|
2022-07-05 11:51:08 +02:00 |
|
Logan Williams
|
51e5ca1f04
|
Use smaller batches for now
|
2022-07-05 09:48:57 +00:00 |
|
Logan Williams
|
6149c4279d
|
Add some more fields to media DB, fix bugs in testing
|
2022-07-05 11:11:43 +02:00 |
|
Logan Williams
|
4ddd8d6b63
|
Only select untransformed media; simplify insert function
|
2022-07-05 10:03:38 +02:00 |
|
Logan Williams
|
9948af2c4a
|
Media archiving ETL working for Telegram
|
2022-07-05 10:03:36 +02:00 |
|