Commit Graph

300 Commits

Author SHA1 Message Date
Patrick Robertson
0b03f54f4e Fix up config validation, and allow for custom 'validators' 2025-01-27 11:00:52 +01:00
Patrick Robertson
3fc6ddfe85 Tweaks to logging strings 2025-01-24 15:30:00 +01:00
Patrick Robertson
f1e9ab6751 Merge branch 'main' into load_modules 2025-01-24 15:23:15 +01:00
Patrick Robertson
e8138eac1c Add ubuntu-latest to the matrix of test runners (#181)
* Don't clutter logs with info about generic dropin

* Add ubuntu-latest to unit tests

This is currently failing due to an issue with oscrypto and newer openssl https://github.com/wbond/oscrypto/issues/78#issuecomment-1756317472

* fix oscrypto version for ubuntu 24 compatibility (boto3 too see #180)

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2025-01-24 14:03:55 +00:00
Miguel Sozinho Ramalho
a6fc4e1bb1 modifies base docker image to use browsertrix 1.4.2 (#182)
* modifies base image to newest browsertrix version

* modify browsertrix cmd args based on recent experience
2025-01-24 13:59:29 +00:00
Patrick Robertson
9befb9776c Fix loading modules when entry_point isn't set 2025-01-23 21:08:54 +01:00
Patrick Robertson
b27bf8ffeb Fix up loading/storing configs + unit tests 2025-01-23 20:32:19 +01:00
Patrick Robertson
65ef46d01e Fix loading already loaded modules - don't load them twice 2025-01-23 00:09:39 +01:00
Patrick Robertson
550097ab7b Get module loading working properly 2025-01-22 23:54:21 +01:00
Patrick Robertson
ade5ea0f6f Tidy up imports + start on loading modules - program now starts much faster 2025-01-22 18:45:58 +01:00
Patrick Robertson
b6b085854c Switch back to using yaml with dot notation
(two simple helper functions to convert between dot and dict notation)
2025-01-22 17:40:51 +01:00
Patrick Robertson
54995ad6ab Further tweaks based on __manifest__.py files
Loading configs now works
2025-01-22 13:11:43 +01:00
erinhmclark
7b3a1468cd Create manifest files for archiver modules. 2025-01-22 10:21:27 +01:00
Patrick Robertson
4830f99300 Get parsing of manifest and combining with config file working 2025-01-21 20:03:10 +01:00
Patrick Robertson
241b35002c Initial changes to move to '__manifest__' format 2025-01-21 19:02:38 +01:00
Patrick Robertson
03f3770223 Add __manifest__.py for generic_extractor 2025-01-21 18:00:45 +01:00
Patrick Robertson
c41d93a634 Use already implemented helper to get version 2025-01-21 17:53:37 +01:00
Patrick Robertson
cd2ae3763f Minor adjustments
Co-authored-by: Miguel Sozinho Ramalho <19508417+msramalho@users.noreply.github.com>
2025-01-21 16:24:37 +00:00
Patrick Robertson
d3e3eb7639 unit tests for loading dropins 2025-01-21 16:59:45 +01:00
Patrick Robertson
9dde9b26d0 Patch in upstream changes to ytdlp for now
Seems like ytdlp may not merge https://github.com/yt-dlp/yt-dlp/pull/12098 anytime soon
2025-01-21 16:49:49 +01:00
Patrick Robertson
7c0dcbfd81 Re-add doc string to generic_archiver
(renamed from youtube_archiver)
2025-01-21 16:49:30 +01:00
Patrick Robertson
6388983815 Merge branch 'main' into youtubedlp-rewrite 2025-01-21 16:43:14 +01:00
Patrick Robertson
4bb4ebdf82 Further cleanup, abstracts 'dropins' out into generic files 2025-01-21 16:36:45 +01:00
Patrick Robertson
dff0105659 Small fixups + implement Truth code for posts with multiple media 2025-01-20 18:40:46 +01:00
Patrick Robertson
fd2e7f973b Further tidy-ups, also adds some ytdlp utils to 'utils' 2025-01-20 16:31:28 +01:00
Patrick Robertson
9c5a9e1bcd Rename BaseArchiver to GenericArchiver + some other tidyups 2025-01-17 17:06:04 +01:00
Patrick Robertson
5b20288d06 Add a 'version' arg to get the current running version 2025-01-17 16:59:57 +01:00
Patrick Robertson
394bcd8d47 Further refactoring of youtubedl_archiver->base_archiver
* Keep twitter_api_archiver
* Remove unit tests for obsolete archivers
* Guess filename of media using the 'Content-Type' header
* Add mechanism to run 'expensive' tests last (see conftest.py) and also flag expensive tests to fail straight off (pytest.mark.incremental)
2025-01-17 11:56:08 +01:00
erinhmclark
6fabe2a189 Fixed twitter_archiver.py changes. 2025-01-16 09:56:54 +00:00
erinhmclark
bbb3269c2b Changes from main. 2025-01-16 09:30:32 +00:00
erinhmclark
d3eec5d90f Basic docs structure for RTD 2025-01-15 21:45:29 +00:00
Patrick Robertson
74cf1f5f23 Merge branch 'main' into youtubedlp-rewrite 2025-01-15 17:47:23 +01:00
Patrick Robertson
4f2b9baa73 refactor youtubedlp archiver to work for all valid websites
1. Extract more metadata
2. Better extract thumbnail
3. Setup framework for specific sites to provide more granular metadata processing
2025-01-15 17:46:47 +01:00
Patrick Robertson
c3dd19f309 Sniff filetype of downloaded media and add extension
Also download in chunks - fixes 2 x TODOs
2025-01-15 17:46:47 +01:00
Patrick Robertson
306df62a98 Fix all instances of utcnow() 2025-01-14 17:51:41 +01:00
Patrick Robertson
20726c1116 Remove tiktok-downloader - getting info is broken
TODO: switch to using youtube-dlp
2025-01-14 17:40:45 +01:00
Patrick Robertson
2eb2ab9ac9 Merge branch 'main' into remove_dependencies 2025-01-14 17:39:20 +01:00
Patrick Robertson
080f474d49 Remove minify_html package - HTML file is no longer minified
Savings were 5K (~15KB vs ~20KB) for the generated .html file, but minify_html is currently not compatible with python3.13+
2025-01-14 11:36:10 +01:00
Patrick Robertson
4e13a09a87 Fix deprecation warning about utcnow 2025-01-14 11:01:40 +01:00
Patrick Robertson
1b1af2f0b1 Revert change to twitter_archiver
As per discussion at: https://github.com/bellingcat/auto-archiver/pull/165#discussion_r1905930837
2025-01-14 10:30:41 +01:00
Patrick Robertson
bdfedfcf61 Merge branch 'main' into feat/unittest 2025-01-13 19:50:47 +01:00
Erin Clark
9cdaea873b Merge pull request #164 from bellingcat/ec_add_poetry
Migrate to Poetry
2025-01-13 18:49:15 +00:00
Patrick Robertson
528b78db85 Flag tombstone tweets for twitter_syndication method 2025-01-13 18:17:24 +01:00
Patrick Robertson
57eacdc24a Merge branch 'main' into feat/unittest 2025-01-13 18:06:55 +01:00
Patrick Robertson
63973e2ce7 switch to pytest and pytest-recording 2025-01-13 16:23:20 +01:00
erinhmclark
d80b4b7557 Remove snscrape and Python 3.12 restriction. 2025-01-12 12:15:56 +00:00
erinhmclark
6d5b0090d9 Pull version from pyproject.toml file/ 2025-01-12 12:15:56 +00:00
erinhmclark
6da837b374 Add note to update dynamic versioning and references to version. 2025-01-12 12:15:56 +00:00
Patrick Robertson
3546d4ad79 Fix 'download_syndication' method for tweet archiving (now requires a token)
Plus add in unit tests for token generation + download syndication
2025-01-12 12:55:00 +01:00
Patrick Robertson
c932fb7416 Improved logging when an invalid/deleted tweet is attempted to be downloaded
Plus: unit tests for non-existent tweet + invalid tweet ID
2025-01-12 12:00:45 +01:00