msramalho
a739361e12
bug fix: wacz screenshots leak in shared session
2026-02-23 16:26:36 +00:00
msramalho
2d13077fad
bumping ruff version
2026-02-23 12:36:53 +00:00
msramalho
a09927c507
minor docs fix
2026-02-23 12:18:47 +00:00
msramalho
5fd23baa55
this is ruff
2026-01-08 15:48:08 +00:00
msramalho
096c9d09ef
fix for unexpected types for json.dump
2026-01-08 15:18:19 +00:00
msramalho
a936921c4e
updates new utils file and test
2026-01-08 14:54:06 +00:00
Miguel Sozinho Ramalho
68f672a4fa
Merge branch 'dev' into fix/improve-deleted-post-detection
2026-01-08 14:36:17 +00:00
msramalho
53dc9904ce
refactorws PR to obey standard code approach
2026-01-08 14:30:26 +00:00
Miguel Sozinho Ramalho
c1f312d42a
Merge branch 'dev' into specify-medatada-feature
2026-01-08 14:04:42 +00:00
msramalho
23c9dfe717
updating dependencies
2026-01-08 13:53:44 +00:00
m4cd4r4
d02e7e0f02
Add comprehensive deletion detection for removed/unavailable content
...
Implements issue #335 : improve detection of deleted/missing posts
## Changes
### New Deletion Detection System
- Created `deletion_detection.py` utility module with platform-specific
indicators for Twitter, Facebook, Instagram, TikTok, YouTube, Reddit,
VK, and Telegram
- Detects deletion via HTML content, page titles, error messages, and
video metadata
- Stores detailed deletion context (indicator, source, platform) in
metadata for investigators
### Integration Points
- **Antibot Extractor**: Checks HTML and page titles after page load;
resolves TODO about detecting deleted videos
- **Generic Extractor**: Checks yt-dlp video data and error messages
for deletion indicators
- **Twitter Dropin**: Enhanced detection when user/created_at fields
are missing
### Test Coverage
- Comprehensive test suite covering all platforms
- Tests for HTML, title, error message, and metadata detection
- Validates that normal content is not falsely flagged
## Impact for Conflict Documentation
This fix is critical for evidence preservation in war-torn regions:
- Investigators can now document that evidence existed but was deleted
- Prevents wasted archival attempts on deleted content
- Tracks patterns of content removal
- Preserves metadata about what was deleted and when
Twitter example: Detects "Hmm...this page doesn't exist. Try searching
for something else" and flags content as deleted_or_unavailable.
2025-12-17 18:40:58 +08:00
msramalho
43cbc6ac56
generic extractor improvements
2025-10-23 09:51:14 +01:00
mgaughan
94e0803fb3
implementing default metadata omission/user metadata selection
2025-09-22 20:16:40 -05:00
msramalho
2081c16555
embed retry into timestamping
2025-07-10 14:49:53 +01:00
msramalho
d3efd7121c
avoid empty metadata comments
2025-07-06 14:05:17 +01:00
msramalho
9d3cd5774b
an improved approach for #295
2025-07-06 14:04:01 +01:00
msramalho
c1506ee1cf
some wayback errors are expected and should be warnings
2025-07-05 18:31:39 +01:00
msramalho
3a34a49822
adds antibot tiktok logic for photos closes #295
2025-07-05 18:31:12 +01:00
msramalho
37c6d97275
new auth wall check logic and escaped CSS selector in selenium
2025-07-05 18:30:31 +01:00
msramalho
7234eda85f
expands Sheets API retries for really large spreadsheets
2025-07-05 18:29:33 +01:00
msramalho
a8c1ef3912
generic_extractor config to use proxy only when needed to avoid overzealousness
2025-07-05 16:54:58 +01:00
msramalho
2051e8e491
adds further exponential backoff for Sheets API worksheet enumeration
2025-07-05 16:02:07 +01:00
msramalho
21255db86a
stops using service that is not up for timestamping
2025-07-05 16:00:46 +01:00
msramalho
eae0da08b3
fix issue with two runs of anitbot extractor
2025-07-05 16:00:03 +01:00
msramalho
649412053e
exclude non-ready code
2025-06-30 02:27:21 +01:00
msramalho
b2648fa3cd
follow docs advice on exponential backoff of SheetsAPI
2025-06-30 01:47:12 +01:00
msramalho
4ad71b3589
adds retry to worksheet read for slow worksheets
2025-06-30 01:42:34 +01:00
msramalho
7c9475cde2
allow for human readable console logs, but defaults to JSON on file logs.
2025-06-30 00:53:10 +01:00
msramalho
afd9090a4c
concludes logging standardization refactor
2025-06-26 17:20:04 +01:00
msramalho
ad29cb4447
adds post_data to metadata for instagram
2025-06-26 15:48:10 +01:00
msramalho
ce4d7ac649
WIP refactor logging
2025-06-21 15:54:51 +01:00
msramalho
12b457706b
closes #166 adds story URL feature to telethon extractor
2025-06-18 17:37:44 +01:00
msramalho
592dc30415
closes #330
2025-06-18 16:40:55 +01:00
msramalho
d46eeee9b6
docs improved
2025-06-18 13:35:51 +01:00
msramalho
302e6f4258
logs improved
2025-06-18 13:35:43 +01:00
msramalho
76fd329fe5
twitter tests fix
2025-06-17 23:51:03 +01:00
msramalho
a3ae9ebbb3
log level updates
2025-06-17 20:36:33 +01:00
msramalho
23b781c866
new check for edge case
2025-06-17 20:36:22 +01:00
msramalho
2aec240128
thumbnail enricher always run probe by default
2025-06-17 20:28:20 +01:00
msramalho
c5a2fd45f9
log levels updated
2025-06-17 20:04:40 +01:00
msramalho
ad168785e7
retry for Google API 503s
2025-06-17 19:22:09 +01:00
msramalho
74a1561c3d
logging and clean up
2025-06-17 19:21:40 +01:00
msramalho
55d9ffaacd
typo
2025-06-17 18:51:21 +01:00
msramalho
f19fb575a7
logging updates
2025-06-17 18:50:54 +01:00
msramalho
f53b2075ba
fixes gdrive error
2025-06-17 18:45:55 +01:00
msramalho
6085a66c58
revert metadata json renaming
2025-06-17 16:10:24 +01:00
msramalho
33cca734d9
original_url changes still constitute empty result
2025-06-17 16:06:25 +01:00
msramalho
2f1a07abbf
renaming and code improvements to json_e richer
2025-06-17 16:06:04 +01:00
msramalho
664ee8d037
fixes bugs and limited configuration of multi-level logs
2025-06-17 14:10:46 +01:00
msramalho
1b260788de
do not add commit comments to code
2025-06-17 13:18:12 +01:00