msramalho
bc06de8e5c
fixes incomplete yt-dlp parts download
2026-04-27 12:34:47 +01:00
msramalho
096c9d09ef
fix for unexpected types for json.dump
2026-01-08 15:18:19 +00:00
msramalho
a936921c4e
updates new utils file and test
2026-01-08 14:54:06 +00:00
m4cd4r4
d02e7e0f02
Add comprehensive deletion detection for removed/unavailable content
...
Implements issue #335 : improve detection of deleted/missing posts
## Changes
### New Deletion Detection System
- Created `deletion_detection.py` utility module with platform-specific
indicators for Twitter, Facebook, Instagram, TikTok, YouTube, Reddit,
VK, and Telegram
- Detects deletion via HTML content, page titles, error messages, and
video metadata
- Stores detailed deletion context (indicator, source, platform) in
metadata for investigators
### Integration Points
- **Antibot Extractor**: Checks HTML and page titles after page load;
resolves TODO about detecting deleted videos
- **Generic Extractor**: Checks yt-dlp video data and error messages
for deletion indicators
- **Twitter Dropin**: Enhanced detection when user/created_at fields
are missing
### Test Coverage
- Comprehensive test suite covering all platforms
- Tests for HTML, title, error message, and metadata detection
- Validates that normal content is not falsely flagged
## Impact for Conflict Documentation
This fix is critical for evidence preservation in war-torn regions:
- Investigators can now document that evidence existed but was deleted
- Prevents wasted archival attempts on deleted content
- Tracks patterns of content removal
- Preserves metadata about what was deleted and when
Twitter example: Detects "Hmm...this page doesn't exist. Try searching
for something else" and flags content as deleted_or_unavailable.
2025-12-17 18:40:58 +08:00
msramalho
7c9475cde2
allow for human readable console logs, but defaults to JSON on file logs.
2025-06-30 00:53:10 +01:00
msramalho
afd9090a4c
concludes logging standardization refactor
2025-06-26 17:20:04 +01:00
msramalho
ce4d7ac649
WIP refactor logging
2025-06-21 15:54:51 +01:00
msramalho
cd19181d8f
minor improvements
2025-06-11 16:51:42 +01:00
msramalho
3cf51dd874
adds tracker remove feature and tests
2025-06-11 11:56:42 +01:00
msramalho
8314833ae8
removes exclude_media_extensions option
2025-06-10 18:34:33 +01:00
msramalho
287e823f43
improves twitter URL cleaning and introduces another bestquality check
2025-06-10 16:09:38 +01:00
msramalho
c815488daa
adds new URLs to ignore
2025-06-10 15:44:52 +01:00
msramalho
07ff5baf07
adds Dropin flexible integration for antibot
2025-06-07 19:09:37 +01:00
msramalho
c7a84bc97a
generalizes ydl info to filename method for reusing
2025-06-07 18:14:08 +01:00
msramalho
5f68c151a0
removes webdriver utils used by screenshot enricher
2025-06-04 14:17:19 +01:00
Patrick Robertson
aacb874b56
removeprefix for www. is required here
2025-03-21 12:23:45 +04:00
Patrick Robertson
14c56f4916
Provide better logs for screenshot enricher when auth is/isn't supported (cookies only)
2025-03-21 12:05:47 +04:00
Patrick Robertson
168dfb6254
Unit tests for url utils
2025-03-21 11:53:47 +04:00
Patrick Robertson
99e9ac2465
Fix 'Syntax Error' warning in python3.12+
2025-03-17 09:29:51 +00:00
Patrick Robertson
19715c8ec2
Merge branch 'main' into webdriver-cookies
2025-03-14 12:44:48 +00:00
Patrick Robertson
f6b13327f0
Tweaks and additional debug logging
2025-03-13 17:41:41 +00:00
Patrick Robertson
0efeaaabb1
Revert to using time.sleep and .click() - since we only want to be waiting the first time (for the page to load)
2025-03-13 17:41:16 +00:00
Patrick Robertson
7a81ab617a
Better checking of cookies to add to webdriver
2025-03-11 11:57:25 +00:00
erinhmclark
e7fa88f1c7
Implementing ruff suggestions.
2025-03-10 21:45:30 +00:00
erinhmclark
ca44a40b88
Ruff fix on src.
2025-03-10 19:03:45 +00:00
erinhmclark
85abe1837a
Ruff format with defaults.
2025-03-10 18:44:54 +00:00
Patrick Robertson
e519ba2433
Add 'reject all' cookie button
2025-03-07 16:40:34 +00:00
Patrick Robertson
dba44b1ac1
Use WebDriverWait when waiting for elements in screenshot enricher
2025-03-07 12:07:54 +00:00
Patrick Robertson
0dfab2d1bc
Add some code to attempt to click the cookies banners on various websites
2025-03-03 15:55:04 +00:00
Patrick Robertson
dea0a49600
Download correct gecko-driver for the platform + fix setting executable path when running in Docker
...
Fixes #232
2025-03-03 15:41:44 +00:00
erinhmclark
8124bb831d
Merge branch 'main' into small_issues
...
# Conflicts:
# src/auto_archiver/core/base_module.py
# src/auto_archiver/utils/misc.py
2025-02-26 13:19:49 +00:00
erinhmclark
9157846930
Add docstrings to explain date formats.
2025-02-26 10:01:52 +00:00
erinhmclark
83a08dd215
Update date parsing to use dateutil.parser in misc.py
2025-02-25 20:17:31 +00:00
Patrick Robertson
eda359a1ef
Fix json loader - it should go in 'validators' not 'utils'
...
Fixes #214
2025-02-20 13:10:39 +00:00
Patrick Robertson
7734a551fa
Move 'assert_valid_url' out into utils, don't use assert but raise
...
assert is recommended only for debugging
2025-02-20 11:19:29 +00:00
erinhmclark
ce5a200d1f
Added tests, updated instagram_tbot_extractor.py raise failure.
2025-02-18 12:59:10 +00:00
Patrick Robertson
1b976f4c09
Remove unused atlos util functions
2025-02-11 18:49:54 +00:00
Patrick Robertson
756f46012b
Remove empty file
2025-02-11 18:47:54 +00:00
erinhmclark
8d894066f2
Merge branch 'load_modules' into add_module_tests
...
# Conflicts:
# src/auto_archiver/modules/gsheet_feeder/gsheet_feeder.py
# src/auto_archiver/utils/misc.py
2025-02-10 19:00:05 +00:00
msramalho
ab6cf52533
fixes bad hash initialization
2025-02-10 16:45:28 +00:00
erinhmclark
c4bb667cec
Merge branch 'load_modules' into add_module_tests
...
# Conflicts:
# src/auto_archiver/modules/s3_storage/s3_storage.py
# src/auto_archiver/utils/gsheet.py
# src/auto_archiver/utils/misc.py
2025-02-10 16:17:08 +00:00
erinhmclark
f311621e58
Small fixes.
...
Add timestamp helper method.
2025-02-10 15:57:42 +00:00
msramalho
15abf686b1
decouples s3_storage from hash_enricher
2025-02-10 15:48:54 +00:00
Patrick Robertson
63aba6ad39
Fix sphinx-autoapi imports
2025-02-07 21:54:49 +01:00
erinhmclark
266c7a14e6
Context related fixes, some more tests.
2025-02-06 16:53:00 +00:00
Patrick Robertson
0633e17998
Close the facebook 'login' window if it's there - to allow for proper screenshots
2025-02-04 14:18:46 +01:00
Patrick Robertson
72b5ea9ab6
Restore headless arg
2025-02-03 17:40:40 +01:00
Patrick Robertson
c574b694ed
Set up screenshot enricher to use authentication/cookies
2025-02-03 17:25:59 +01:00
Patrick Robertson
b7d9145f6c
Further tidyups + refactoring for new structure
...
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
2025-01-30 13:21:10 +01:00
Patrick Robertson
3d37c494aa
Tidy ups + unit tests:
...
1. Allow loading modules from --module_paths=/extra/path/here
2. Improved unit tests for module loading
3. Further small tidy ups/clean ups
2025-01-29 18:42:49 +01:00