msramalho
bc06de8e5c
fixes incomplete yt-dlp parts download
2026-04-27 12:34:47 +01:00
msramalho
a57a5ee005
adds an extra check when calling pypi as it's led to uncaught ssl errors
2026-04-23 14:20:07 +01:00
msramalho
ae0e53e434
adds tests for new ghostarchive enricher feature
2026-04-06 17:15:32 +01:00
msramalho
3194fee95d
fix telethon bug when running in celery workers that close the event loop
2026-03-12 10:20:11 +00:00
msramalho
23a88e3cf4
ci issues
2026-03-02 17:07:09 +00:00
msramalho
e9a92272c5
bug fix: missing filename on url download
2026-03-02 17:01:16 +00:00
msramalho
077b03fc61
minor tests change to work in gh actions
2026-03-02 14:08:14 +00:00
msramalho
bc66dd4f2a
fxtwitter working instead of nitter
2026-03-02 12:31:28 +00:00
msramalho
f465b570cd
adding missing tests (no download)
2026-03-02 12:14:47 +00:00
msramalho
3e2c0b564b
wiki fix
2026-01-08 15:49:42 +00:00
msramalho
536cbd905f
puts tests file in correct directory
2026-01-08 14:55:40 +00:00
msramalho
a936921c4e
updates new utils file and test
2026-01-08 14:54:06 +00:00
Miguel Sozinho Ramalho
68f672a4fa
Merge branch 'dev' into fix/improve-deleted-post-detection
2026-01-08 14:36:17 +00:00
msramalho
bac809451c
expands tests to included non predefined metadata keys
2026-01-08 14:33:16 +00:00
msramalho
53dc9904ce
refactorws PR to obey standard code approach
2026-01-08 14:30:26 +00:00
Miguel Sozinho Ramalho
c1f312d42a
Merge branch 'dev' into specify-medatada-feature
2026-01-08 14:04:42 +00:00
m4cd4r4
d02e7e0f02
Add comprehensive deletion detection for removed/unavailable content
...
Implements issue #335 : improve detection of deleted/missing posts
## Changes
### New Deletion Detection System
- Created `deletion_detection.py` utility module with platform-specific
indicators for Twitter, Facebook, Instagram, TikTok, YouTube, Reddit,
VK, and Telegram
- Detects deletion via HTML content, page titles, error messages, and
video metadata
- Stores detailed deletion context (indicator, source, platform) in
metadata for investigators
### Integration Points
- **Antibot Extractor**: Checks HTML and page titles after page load;
resolves TODO about detecting deleted videos
- **Generic Extractor**: Checks yt-dlp video data and error messages
for deletion indicators
- **Twitter Dropin**: Enhanced detection when user/created_at fields
are missing
### Test Coverage
- Comprehensive test suite covering all platforms
- Tests for HTML, title, error message, and metadata detection
- Validates that normal content is not falsely flagged
## Impact for Conflict Documentation
This fix is critical for evidence preservation in war-torn regions:
- Investigators can now document that evidence existed but was deleted
- Prevents wasted archival attempts on deleted content
- Tracks patterns of content removal
- Preserves metadata about what was deleted and when
Twitter example: Detects "Hmm...this page doesn't exist. Try searching
for something else" and flags content as deleted_or_unavailable.
2025-12-17 18:40:58 +08:00
msramalho
3a22cc28c0
skip tiktok antibot test in CI
2025-10-23 10:17:14 +01:00
msramalho
dbb3dfa04f
fixes wikipedia test
2025-10-23 10:04:44 +01:00
msramalho
43cbc6ac56
generic extractor improvements
2025-10-23 09:51:14 +01:00
mgaughan
94e0803fb3
implementing default metadata omission/user metadata selection
2025-09-22 20:16:40 -05:00
msramalho
d3efd7121c
avoid empty metadata comments
2025-07-06 14:05:17 +01:00
msramalho
9d3cd5774b
an improved approach for #295
2025-07-06 14:04:01 +01:00
msramalho
3a34a49822
adds antibot tiktok logic for photos closes #295
2025-07-05 18:31:12 +01:00
msramalho
c2c9718f73
make python api tests work on gh when no env is set
2025-06-30 02:20:51 +01:00
msramalho
afd9090a4c
concludes logging standardization refactor
2025-06-26 17:20:04 +01:00
msramalho
ce4d7ac649
WIP refactor logging
2025-06-21 15:54:51 +01:00
msramalho
12b457706b
closes #166 adds story URL feature to telethon extractor
2025-06-18 17:37:44 +01:00
msramalho
592dc30415
closes #330
2025-06-18 16:40:55 +01:00
msramalho
4a36e6f6b0
fix tests
2025-06-18 13:50:21 +01:00
msramalho
76fd329fe5
twitter tests fix
2025-06-17 23:51:03 +01:00
msramalho
23b781c866
new check for edge case
2025-06-17 20:36:22 +01:00
msramalho
2aec240128
thumbnail enricher always run probe by default
2025-06-17 20:28:20 +01:00
msramalho
cd6a2b6031
generic_extractor download tests adaptations
2025-06-11 20:05:35 +01:00
msramalho
d7a48e465b
fix copypasta
2025-06-11 18:04:49 +01:00
msramalho
f5be7a50c1
Testing Linkedin Dropin for Antibot
2025-06-11 16:52:03 +01:00
msramalho
3cf51dd874
adds tracker remove feature and tests
2025-06-11 11:56:42 +01:00
msramalho
69ddb72146
separate reddit tests
2025-06-11 11:27:11 +01:00
msramalho
1039e9631f
new reddit tests with .env.test
2025-06-11 11:22:23 +01:00
msramalho
8314833ae8
removes exclude_media_extensions option
2025-06-10 18:34:33 +01:00
msramalho
6bbc7fb47a
improves antibot flow and makes auth_wall detection optional
2025-06-10 16:29:07 +01:00
msramalho
287e823f43
improves twitter URL cleaning and introduces another bestquality check
2025-06-10 16:09:38 +01:00
msramalho
c815488daa
adds new URLs to ignore
2025-06-10 15:44:52 +01:00
msramalho
6f02493ff1
adds clips extraction to VK, though generic_extractor should still be run for those
2025-06-08 14:36:55 +01:00
msramalho
d13a5ef003
adds tests in minor improvements
2025-06-07 19:58:18 +01:00
msramalho
5491f3e9e7
fixing s3 storage tests
2025-06-04 14:41:00 +01:00
msramalho
264ba82ea0
finish removing screenshot_enricher references
2025-06-04 14:31:07 +01:00
msramalho
2c6be4447f
linting
2025-06-04 14:17:38 +01:00
msramalho
22408e2a98
adds test for antibot
2025-06-04 11:59:59 +01:00
msramalho
cbd189c97d
general cleanup
2025-06-04 11:53:01 +01:00