Commit Graph

73 Commits

Author SHA1 Message Date
Patrick Robertson
4bb4ebdf82 Further cleanup, abstracts 'dropins' out into generic files 2025-01-21 16:36:45 +01:00
Patrick Robertson
dff0105659 Small fixups + implement Truth code for posts with multiple media 2025-01-20 18:40:46 +01:00
Patrick Robertson
fd2e7f973b Further tidy-ups, also adds some ytdlp utils to 'utils' 2025-01-20 16:31:28 +01:00
Patrick Robertson
9c5a9e1bcd Rename BaseArchiver to GenericArchiver + some other tidyups 2025-01-17 17:06:04 +01:00
Patrick Robertson
394bcd8d47 Further refactoring of youtubedl_archiver->base_archiver
* Keep twitter_api_archiver
* Remove unit tests for obsolete archivers
* Guess filename of media using the 'Content-Type' header
* Add mechanism to run 'expensive' tests last (see conftest.py) and also flag expensive tests to fail straight off (pytest.mark.incremental)
2025-01-17 11:56:08 +01:00
Patrick Robertson
74cf1f5f23 Merge branch 'main' into youtubedlp-rewrite 2025-01-15 17:47:23 +01:00
Patrick Robertson
4f2b9baa73 refactor youtubedlp archiver to work for all valid websites
1. Extract more metadata
2. Better extract thumbnail
3. Setup framework for specific sites to provide more granular metadata processing
2025-01-15 17:46:47 +01:00
Patrick Robertson
c3dd19f309 Sniff filetype of downloaded media and add extension
Also download in chunks - fixes 2 x TODOs
2025-01-15 17:46:47 +01:00
Patrick Robertson
20726c1116 Remove tiktok-downloader - getting info is broken
TODO: switch to using youtube-dlp
2025-01-14 17:40:45 +01:00
Patrick Robertson
1b1af2f0b1 Revert change to twitter_archiver
As per discussion at: https://github.com/bellingcat/auto-archiver/pull/165#discussion_r1905930837
2025-01-14 10:30:41 +01:00
Patrick Robertson
bdfedfcf61 Merge branch 'main' into feat/unittest 2025-01-13 19:50:47 +01:00
Erin Clark
9cdaea873b Merge pull request #164 from bellingcat/ec_add_poetry
Migrate to Poetry
2025-01-13 18:49:15 +00:00
Patrick Robertson
528b78db85 Flag tombstone tweets for twitter_syndication method 2025-01-13 18:17:24 +01:00
erinhmclark
d80b4b7557 Remove snscrape and Python 3.12 restriction. 2025-01-12 12:15:56 +00:00
Patrick Robertson
3546d4ad79 Fix 'download_syndication' method for tweet archiving (now requires a token)
Plus add in unit tests for token generation + download syndication
2025-01-12 12:55:00 +01:00
Patrick Robertson
c932fb7416 Improved logging when an invalid/deleted tweet is attempted to be downloaded
Plus: unit tests for non-existent tweet + invalid tweet ID
2025-01-12 12:00:45 +01:00
Patrick Robertson
f29950905c Merge branch 'main' into small_issues 2025-01-12 11:47:55 +01:00
Patrick Robertson
add83c9650 Remove snscrape from twitter_archiver
1. snscrape twitter downloader no longer works (ref: https://github.com/JustAnotherArchivist/snscrape/issues/1045)
2. snscrape limits python to versions <3.12
2025-01-07 19:40:19 +01:00
Miguel Sozinho Ramalho
a697f0a212 adds an unauthenticated Bluesky archiver (#160)
* adds a TODO for next code iterations

* implements bsky archiver

* adds new archiver to example orchestration file

* Fix downloading media for posts with multiple images

(Images are stored in media/images)

* Setup a basic framework for unit tests

Use 'python -m unittest' from the project root to run

---------

Co-authored-by: Patrick Robertson <robertson.patrick@gmail.com>
2025-01-07 10:28:07 +00:00
Patrick Robertson
928518cda7 Allow setting cookies for yt-dl (#158) 2025-01-06 16:19:53 +00:00
Patrick Robertson
a46f9997ea Better logging when there's a timestamp parse error 2024-12-31 09:28:08 +01:00
Miguel Sozinho Ramalho
f8824691dd refactors free twitter archiver strategies (#142) 2024-05-14 16:23:33 +01:00
msramalho
012cc36609 removes deprecated datetime method 2024-05-14 15:54:50 +01:00
Miguel Sozinho Ramalho
7cfe1e39cc #135 fix cleanup of telethon session files (#139)
* closes #135

* version bump
2024-04-16 12:45:45 +01:00
Jett Chen
cf8691bad7 Add yt-dlp based archiving for TwitterArchiver (#138)
* Add ytdlp archiving capability

* Add type annotation

* version bump

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2024-04-15 19:54:55 +01:00
msramalho
f4827770e6 adds instagram no stories as success, and fix for telethon-based archivers. 2024-03-05 14:49:10 +00:00
msramalho
ccf5f857ef adds configurable limits to instagram/youtube 2024-02-25 15:14:17 +00:00
msramalho
7de317d1b5 avoiding exception 2024-02-23 15:54:33 +00:00
msramalho
70075a1e5e improving insta archiver 2024-02-23 15:37:28 +00:00
msramalho
f0158ffd9c adds tagged posts and better parsing 2024-02-23 14:08:17 +00:00
msramalho
bfb35a43a9 adds more details from yt-dlp 2024-02-23 14:08:05 +00:00
Miguel Sozinho Ramalho
7a21ae96af V0.9.0 - closes several open issues: new enrichers and bug fixes (#133)
* clean orchestrator code, add archiver cleanup logic

* improves documentation for database.py

* telethon archivers isolate sessions into copied files

* closes #127

* closes #125

* closes #84

* meta enricher applies to all media

* closes #61 adds subtitles and comments

* minor update

* minor fixes to yt-dlp subtitles and comments

* closes #17 but logic is imperfect.

* closes #85 ssl enhancer

* minimifies html, JS refactor for preview of certificates

* closes #91 adds freetsa timestamp authority

* version bump

* simplify download_url method

* skip ssl if nothing archived

* html preview improvements

* adds retrying lib

* manual download archiver improvements

* meta only runs when relevant data available

* new metadata convenience method

* html template improvements

* removes debug message

* does not close #91 yet, will need a few more certificate chaing logging

* adds verbosity config

* new instagram api archiver

* adds proxy support we

* adds proxy/end support and bug fix for yt-dlp

* proxy support for webdriver

* adds socks proxy to wacz_enricher

* refactor recursivity in inner media and display

* infinite recursive display

* foolproofing timestamping authortities

* version to 0.9.0

* minor fixes from code-review
2024-02-20 18:05:29 +00:00
msramalho
2a773a25e8 better handling of telethon data display 2024-02-01 15:08:23 +00:00
Miguel Sozinho Ramalho
e6b6b83007 0.8.0 new features and dependency updates (#119)
* wacz can extract_screenshot only

* new meta enricher

* twitter api can use multiple authentication tokens in sequence

* cleanup non-dup logic

* meta info on archive duration

* minor html report update

* updated dependencies

* new version
2023-12-20 14:13:22 +00:00
Miguel Sozinho Ramalho
3e56ef137d reduce s3 duplicating while keeping random urls via hash (#112) 2023-12-12 19:12:03 +00:00
msramalho
ddb9dc87d7 unfortunately needed twitter->x 2023-09-20 10:17:31 +01:00
Miguel Sozinho Ramalho
21d7d2e16c format youtubedl_archiver.py 2023-08-28 11:09:03 +01:00
Dave Mateer
0bbb4c9b08 Added noplaylist true to youtubedl so that videos in playlists will work 2023-08-27 17:26:36 +01:00
msramalho
419eaef449 fixes unsued tmp_dir 2023-07-28 12:50:52 +01:00
msramalho
59551b3b20 minor improvements: finding best twitter image quality 2023-07-27 21:36:15 +01:00
msramalho
3dd3775cbd removes rearchiving logic 2023-07-27 20:14:50 +01:00
msramalho
e8f44b652e minor improvements 2023-07-27 15:42:23 +01:00
msramalho
888ad8f004 fix: twitter hack videos extension detection 2023-07-26 16:12:56 +01:00
msramalho
086a9e6c84 fix: remove unnecessary log 2023-07-11 12:17:15 +01:00
msramalho
4d80ee6f02 Bump version to v0.5.27 for release 2023-07-11 12:16:06 +01:00
msramalho
92569ae6be fix: telegram archiver was outdated for images 2023-07-11 12:15:56 +01:00
msramalho
8005a1955a fixes #82 twitter api walls 2023-07-02 18:42:43 +02:00
msramalho
9191b38cf2 tbot archiver works 2023-05-02 19:04:51 +01:00
msramalho
906ed0f6e0 creating global context and refactoring tmp_dir logic 2023-03-23 11:17:38 +00:00
msramalho
7497bc08c0 Bump version to v0.4.2 for release 2023-02-23 17:14:29 +01:00