Commit Graph

721 Commits

Author SHA1 Message Date
msramalho
a739361e12 bug fix: wacz screenshots leak in shared session 2026-02-23 16:26:36 +00:00
msramalho
2d13077fad bumping ruff version 2026-02-23 12:36:53 +00:00
msramalho
a09927c507 minor docs fix 2026-02-23 12:18:47 +00:00
msramalho
5fd23baa55 this is ruff 2026-01-08 15:48:08 +00:00
msramalho
096c9d09ef fix for unexpected types for json.dump 2026-01-08 15:18:19 +00:00
msramalho
a936921c4e updates new utils file and test 2026-01-08 14:54:06 +00:00
Miguel Sozinho Ramalho
68f672a4fa Merge branch 'dev' into fix/improve-deleted-post-detection 2026-01-08 14:36:17 +00:00
msramalho
53dc9904ce refactorws PR to obey standard code approach 2026-01-08 14:30:26 +00:00
Miguel Sozinho Ramalho
c1f312d42a Merge branch 'dev' into specify-medatada-feature 2026-01-08 14:04:42 +00:00
msramalho
23c9dfe717 updating dependencies 2026-01-08 13:53:44 +00:00
m4cd4r4
d02e7e0f02 Add comprehensive deletion detection for removed/unavailable content
Implements issue #335: improve detection of deleted/missing posts

## Changes

### New Deletion Detection System
- Created `deletion_detection.py` utility module with platform-specific
  indicators for Twitter, Facebook, Instagram, TikTok, YouTube, Reddit,
  VK, and Telegram
- Detects deletion via HTML content, page titles, error messages, and
  video metadata
- Stores detailed deletion context (indicator, source, platform) in
  metadata for investigators

### Integration Points
- **Antibot Extractor**: Checks HTML and page titles after page load;
  resolves TODO about detecting deleted videos
- **Generic Extractor**: Checks yt-dlp video data and error messages
  for deletion indicators
- **Twitter Dropin**: Enhanced detection when user/created_at fields
  are missing

### Test Coverage
- Comprehensive test suite covering all platforms
- Tests for HTML, title, error message, and metadata detection
- Validates that normal content is not falsely flagged

## Impact for Conflict Documentation

This fix is critical for evidence preservation in war-torn regions:
- Investigators can now document that evidence existed but was deleted
- Prevents wasted archival attempts on deleted content
- Tracks patterns of content removal
- Preserves metadata about what was deleted and when

Twitter example: Detects "Hmm...this page doesn't exist. Try searching
for something else" and flags content as deleted_or_unavailable.
2025-12-17 18:40:58 +08:00
msramalho
43cbc6ac56 generic extractor improvements 2025-10-23 09:51:14 +01:00
mgaughan
94e0803fb3 implementing default metadata omission/user metadata selection 2025-09-22 20:16:40 -05:00
msramalho
2081c16555 embed retry into timestamping 2025-07-10 14:49:53 +01:00
msramalho
d3efd7121c avoid empty metadata comments 2025-07-06 14:05:17 +01:00
msramalho
9d3cd5774b an improved approach for #295 2025-07-06 14:04:01 +01:00
msramalho
c1506ee1cf some wayback errors are expected and should be warnings 2025-07-05 18:31:39 +01:00
msramalho
3a34a49822 adds antibot tiktok logic for photos closes #295 2025-07-05 18:31:12 +01:00
msramalho
37c6d97275 new auth wall check logic and escaped CSS selector in selenium 2025-07-05 18:30:31 +01:00
msramalho
7234eda85f expands Sheets API retries for really large spreadsheets 2025-07-05 18:29:33 +01:00
msramalho
a8c1ef3912 generic_extractor config to use proxy only when needed to avoid overzealousness 2025-07-05 16:54:58 +01:00
msramalho
2051e8e491 adds further exponential backoff for Sheets API worksheet enumeration 2025-07-05 16:02:07 +01:00
msramalho
21255db86a stops using service that is not up for timestamping 2025-07-05 16:00:46 +01:00
msramalho
eae0da08b3 fix issue with two runs of anitbot extractor 2025-07-05 16:00:03 +01:00
msramalho
649412053e exclude non-ready code 2025-06-30 02:27:21 +01:00
msramalho
b2648fa3cd follow docs advice on exponential backoff of SheetsAPI 2025-06-30 01:47:12 +01:00
msramalho
4ad71b3589 adds retry to worksheet read for slow worksheets 2025-06-30 01:42:34 +01:00
msramalho
7c9475cde2 allow for human readable console logs, but defaults to JSON on file logs. 2025-06-30 00:53:10 +01:00
msramalho
afd9090a4c concludes logging standardization refactor 2025-06-26 17:20:04 +01:00
msramalho
ad29cb4447 adds post_data to metadata for instagram 2025-06-26 15:48:10 +01:00
msramalho
ce4d7ac649 WIP refactor logging 2025-06-21 15:54:51 +01:00
msramalho
12b457706b closes #166 adds story URL feature to telethon extractor 2025-06-18 17:37:44 +01:00
msramalho
592dc30415 closes #330 2025-06-18 16:40:55 +01:00
msramalho
d46eeee9b6 docs improved 2025-06-18 13:35:51 +01:00
msramalho
302e6f4258 logs improved 2025-06-18 13:35:43 +01:00
msramalho
76fd329fe5 twitter tests fix 2025-06-17 23:51:03 +01:00
msramalho
a3ae9ebbb3 log level updates 2025-06-17 20:36:33 +01:00
msramalho
23b781c866 new check for edge case 2025-06-17 20:36:22 +01:00
msramalho
2aec240128 thumbnail enricher always run probe by default 2025-06-17 20:28:20 +01:00
msramalho
c5a2fd45f9 log levels updated 2025-06-17 20:04:40 +01:00
msramalho
ad168785e7 retry for Google API 503s 2025-06-17 19:22:09 +01:00
msramalho
74a1561c3d logging and clean up 2025-06-17 19:21:40 +01:00
msramalho
55d9ffaacd typo 2025-06-17 18:51:21 +01:00
msramalho
f19fb575a7 logging updates 2025-06-17 18:50:54 +01:00
msramalho
f53b2075ba fixes gdrive error 2025-06-17 18:45:55 +01:00
msramalho
6085a66c58 revert metadata json renaming 2025-06-17 16:10:24 +01:00
msramalho
33cca734d9 original_url changes still constitute empty result 2025-06-17 16:06:25 +01:00
msramalho
2f1a07abbf renaming and code improvements to json_e richer 2025-06-17 16:06:04 +01:00
msramalho
664ee8d037 fixes bugs and limited configuration of multi-level logs 2025-06-17 14:10:46 +01:00
msramalho
1b260788de do not add commit comments to code 2025-06-17 13:18:12 +01:00