Commit Graph

299 Commits

Author SHA1 Message Date
Patrick Robertson
74cf1f5f23 Merge branch 'main' into youtubedlp-rewrite 2025-01-15 17:47:23 +01:00
Patrick Robertson
4f2b9baa73 refactor youtubedlp archiver to work for all valid websites
1. Extract more metadata
2. Better extract thumbnail
3. Setup framework for specific sites to provide more granular metadata processing
2025-01-15 17:46:47 +01:00
Patrick Robertson
c3dd19f309 Sniff filetype of downloaded media and add extension
Also download in chunks - fixes 2 x TODOs
2025-01-15 17:46:47 +01:00
Patrick Robertson
306df62a98 Fix all instances of utcnow() 2025-01-14 17:51:41 +01:00
Patrick Robertson
20726c1116 Remove tiktok-downloader - getting info is broken
TODO: switch to using youtube-dlp
2025-01-14 17:40:45 +01:00
Patrick Robertson
2eb2ab9ac9 Merge branch 'main' into remove_dependencies 2025-01-14 17:39:20 +01:00
Patrick Robertson
080f474d49 Remove minify_html package - HTML file is no longer minified
Savings were 5K (~15KB vs ~20KB) for the generated .html file, but minify_html is currently not compatible with python3.13+
2025-01-14 11:36:10 +01:00
Patrick Robertson
4e13a09a87 Fix deprecation warning about utcnow 2025-01-14 11:01:40 +01:00
Patrick Robertson
1b1af2f0b1 Revert change to twitter_archiver
As per discussion at: https://github.com/bellingcat/auto-archiver/pull/165#discussion_r1905930837
2025-01-14 10:30:41 +01:00
Patrick Robertson
bdfedfcf61 Merge branch 'main' into feat/unittest 2025-01-13 19:50:47 +01:00
Erin Clark
9cdaea873b Merge pull request #164 from bellingcat/ec_add_poetry
Migrate to Poetry
2025-01-13 18:49:15 +00:00
Patrick Robertson
528b78db85 Flag tombstone tweets for twitter_syndication method 2025-01-13 18:17:24 +01:00
Patrick Robertson
57eacdc24a Merge branch 'main' into feat/unittest 2025-01-13 18:06:55 +01:00
Patrick Robertson
63973e2ce7 switch to pytest and pytest-recording 2025-01-13 16:23:20 +01:00
erinhmclark
d80b4b7557 Remove snscrape and Python 3.12 restriction. 2025-01-12 12:15:56 +00:00
erinhmclark
6d5b0090d9 Pull version from pyproject.toml file/ 2025-01-12 12:15:56 +00:00
erinhmclark
6da837b374 Add note to update dynamic versioning and references to version. 2025-01-12 12:15:56 +00:00
Patrick Robertson
3546d4ad79 Fix 'download_syndication' method for tweet archiving (now requires a token)
Plus add in unit tests for token generation + download syndication
2025-01-12 12:55:00 +01:00
Patrick Robertson
c932fb7416 Improved logging when an invalid/deleted tweet is attempted to be downloaded
Plus: unit tests for non-existent tweet + invalid tweet ID
2025-01-12 12:00:45 +01:00
Patrick Robertson
f29950905c Merge branch 'main' into small_issues 2025-01-12 11:47:55 +01:00
Patrick Robertson
add83c9650 Remove snscrape from twitter_archiver
1. snscrape twitter downloader no longer works (ref: https://github.com/JustAnotherArchivist/snscrape/issues/1045)
2. snscrape limits python to versions <3.12
2025-01-07 19:40:19 +01:00
Miguel Sozinho Ramalho
a697f0a212 adds an unauthenticated Bluesky archiver (#160)
* adds a TODO for next code iterations

* implements bsky archiver

* adds new archiver to example orchestration file

* Fix downloading media for posts with multiple images

(Images are stored in media/images)

* Setup a basic framework for unit tests

Use 'python -m unittest' from the project root to run

---------

Co-authored-by: Patrick Robertson <robertson.patrick@gmail.com>
2025-01-07 10:28:07 +00:00
Patrick Robertson
bffa3a6254 Merge pull request #159 from bellingcat/print_pdf
Add 'print_pdf' option to the screenshot enricher. Fixes #132
2025-01-06 18:13:38 +01:00
Miguel Sozinho Ramalho
ef471f41e1 adds better debug for wayback failures (#161) 2025-01-06 16:49:11 +00:00
Patrick Robertson
928518cda7 Allow setting cookies for yt-dl (#158) 2025-01-06 16:19:53 +00:00
Patrick Robertson
0c803f15a5 Fix showing preview images in the .html file when using local storage
Local storage media urls are prefixed with '/', previously only http(s) media preview src were displayed
2024-12-31 09:29:31 +01:00
Patrick Robertson
a46f9997ea Better logging when there's a timestamp parse error 2024-12-31 09:28:08 +01:00
msramalho
83da9ae089 adds pdf preview support for html formatter 2024-12-23 18:19:26 +00:00
Patrick Robertson
663c8ad93a Add 'print_pdf' option to the screenshot enricher. Fixes #132 2024-12-20 07:14:03 +01:00
msramalho
e49550163f adds proxy_server option to wacz 2024-10-06 10:45:34 +06:00
msramalho
e6f5981afc numpy version downgrade 2024-10-06 10:10:04 +06:00
msramalho
c62bf1a34d yt-dlp version bump 2024-10-05 17:43:07 +06:00
msramalho
b166d57e61 v0.12.0 bump 2024-08-21 13:34:34 +01:00
msramalho
004143a58a version bump v0.11.6 2024-07-18 11:27:39 +01:00
msramalho
1e375bd740 version bump 2024-05-14 16:42:15 +01:00
Miguel Sozinho Ramalho
f8824691dd refactors free twitter archiver strategies (#142) 2024-05-14 16:23:33 +01:00
msramalho
012cc36609 removes deprecated datetime method 2024-05-14 15:54:50 +01:00
Miguel Sozinho Ramalho
7cfe1e39cc #135 fix cleanup of telethon session files (#139)
* closes #135

* version bump
2024-04-16 12:45:45 +01:00
Jett Chen
cf8691bad7 Add yt-dlp based archiving for TwitterArchiver (#138)
* Add ytdlp archiving capability

* Add type annotation

* version bump

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2024-04-15 19:54:55 +01:00
R. Miles McCain
f603400d0d Add direct Atlos integration (#137)
* Add Atlos feeder

* Add Atlos db

* Add Atlos storage

* Fix Atlos storages

* Fix Atlos feeder

* Only include URLs in Atlos feeder once they're processed

* Remove print

* Add Atlos documentation to README

* Formatting fixes

* Don't archive existing material

* avoid KeyError in atlos_db

* version bump

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2024-04-15 19:25:17 +01:00
msramalho
eb37f0b45b version bump 2024-04-15 19:02:54 +01:00
msramalho
75497f5773 minor bug fix when using an archiver_enricher in enrichers only 2024-04-15 19:02:40 +01:00
msramalho
9c7824de57 browsertrix docker updates 2024-04-15 19:01:55 +01:00
msramalho
f4827770e6 adds instagram no stories as success, and fix for telethon-based archivers. 2024-03-05 14:49:10 +00:00
msramalho
601572d76e strip url 2024-02-29 11:54:01 +00:00
msramalho
d21e79a272 general security updates 2024-02-29 11:40:30 +00:00
msramalho
ccf5f857ef adds configurable limits to instagram/youtube 2024-02-25 15:14:17 +00:00
msramalho
7de317d1b5 avoiding exception 2024-02-23 15:54:33 +00:00
msramalho
70075a1e5e improving insta archiver 2024-02-23 15:37:28 +00:00
msramalho
5b9bc4919a version bump 2024-02-23 14:08:23 +00:00