Miguel Sozinho Ramalho
c1f312d42a
Merge branch 'dev' into specify-medatada-feature
2026-01-08 14:04:42 +00:00
msramalho
23c9dfe717
updating dependencies
2026-01-08 13:53:44 +00:00
m4cd4r4
d02e7e0f02
Add comprehensive deletion detection for removed/unavailable content
...
Implements issue #335 : improve detection of deleted/missing posts
## Changes
### New Deletion Detection System
- Created `deletion_detection.py` utility module with platform-specific
indicators for Twitter, Facebook, Instagram, TikTok, YouTube, Reddit,
VK, and Telegram
- Detects deletion via HTML content, page titles, error messages, and
video metadata
- Stores detailed deletion context (indicator, source, platform) in
metadata for investigators
### Integration Points
- **Antibot Extractor**: Checks HTML and page titles after page load;
resolves TODO about detecting deleted videos
- **Generic Extractor**: Checks yt-dlp video data and error messages
for deletion indicators
- **Twitter Dropin**: Enhanced detection when user/created_at fields
are missing
### Test Coverage
- Comprehensive test suite covering all platforms
- Tests for HTML, title, error message, and metadata detection
- Validates that normal content is not falsely flagged
## Impact for Conflict Documentation
This fix is critical for evidence preservation in war-torn regions:
- Investigators can now document that evidence existed but was deleted
- Prevents wasted archival attempts on deleted content
- Tracks patterns of content removal
- Preserves metadata about what was deleted and when
Twitter example: Detects "Hmm...this page doesn't exist. Try searching
for something else" and flags content as deleted_or_unavailable.
2025-12-17 18:40:58 +08:00
Miguel Sozinho Ramalho
56526a9ac7
Merge pull request #365 from bellingcat/dev
...
Facebook reels fix
v1.1.6
2025-10-23 10:40:43 +01:00
msramalho
3a22cc28c0
skip tiktok antibot test in CI
2025-10-23 10:17:14 +01:00
msramalho
dbb3dfa04f
fixes wikipedia test
2025-10-23 10:04:44 +01:00
msramalho
01bdb35f5d
version bump
2025-10-23 09:51:31 +01:00
msramalho
43cbc6ac56
generic extractor improvements
2025-10-23 09:51:14 +01:00
msramalho
9c7cab1ae2
dependencies update
2025-10-22 21:07:12 +01:00
msramalho
a9a0bae083
dependencies update
v1.1.5
2025-10-22 18:11:36 +01:00
Miguel Sozinho Ramalho
97d133ce79
Merge pull request #357 from bellingcat/dev
...
small improvements on tiktok and verison bumps
v1.1.4
2025-10-22 16:02:26 +01:00
msramalho
432ee3dcfd
version bump
2025-10-22 15:50:50 +01:00
mgaughan
94e0803fb3
implementing default metadata omission/user metadata selection
2025-09-22 20:16:40 -05:00
msramalho
794b4f6052
Merge branch 'dev' of https://github.com/bellingcat/auto-archiver into dev
2025-09-11 15:06:27 +01:00
msramalho
965d7d41dd
dependency updates
2025-09-11 15:06:25 +01:00
Miguel Sozinho Ramalho
e73faa70cc
Merge pull request #352 from mjgaughan/developer-documentation-updates
...
updating the style-checking code in the documentation
2025-08-11 10:42:53 +01:00
mgaughan
80beab9f23
ruff-fix -> ruff-clean; there is no ruff-fix in the Makefile. Maybe the command /should/ be ruff-fix to align with the underlying ruff command; for later discussion. This at least reconciles the documentation to the Makefile
2025-08-05 21:36:32 -04:00
Miguel Sozinho Ramalho
200cea4e12
Merge pull request #345 from mjgaughan/main
...
Correction of small documentation typos
2025-07-29 09:36:10 +01:00
mgaughan
1256fde159
updating location of .env.test.example in documentation
2025-07-23 13:04:48 -04:00
mgaughan
65e222e177
fixing typo in documentation pytest -> poetry
2025-07-22 17:20:59 -04:00
mgaughan
f2eb9ef784
correcting to double-dash in the poetry install documentation
2025-07-21 17:55:48 -04:00
msramalho
2081c16555
embed retry into timestamping
2025-07-10 14:49:53 +01:00
msramalho
d3efd7121c
avoid empty metadata comments
2025-07-06 14:05:17 +01:00
msramalho
9d3cd5774b
an improved approach for #295
2025-07-06 14:04:01 +01:00
Miguel Sozinho Ramalho
80d61e8b85
Merge pull request #341 from bellingcat/dev
...
Address several small bugs, includes tiktok photos extraction, and data-saving for proxy usage in generic_extractor.
v1.1.2
2025-07-05 20:28:00 +01:00
msramalho
d36cdbfa87
fixing pypaperclip see issue #339
2025-07-05 19:07:23 +01:00
msramalho
c1506ee1cf
some wayback errors are expected and should be warnings
2025-07-05 18:31:39 +01:00
msramalho
3a34a49822
adds antibot tiktok logic for photos closes #295
2025-07-05 18:31:12 +01:00
msramalho
37c6d97275
new auth wall check logic and escaped CSS selector in selenium
2025-07-05 18:30:31 +01:00
msramalho
7234eda85f
expands Sheets API retries for really large spreadsheets
2025-07-05 18:29:33 +01:00
msramalho
a8c1ef3912
generic_extractor config to use proxy only when needed to avoid overzealousness
2025-07-05 16:54:58 +01:00
msramalho
52ed8196a5
updates dependencies
2025-07-05 16:03:47 +01:00
msramalho
2051e8e491
adds further exponential backoff for Sheets API worksheet enumeration
2025-07-05 16:02:07 +01:00
msramalho
21255db86a
stops using service that is not up for timestamping
2025-07-05 16:00:46 +01:00
msramalho
eae0da08b3
fix issue with two runs of anitbot extractor
2025-07-05 16:00:03 +01:00
msramalho
0d1447117c
updates docs to reflect new general approach extractor
2025-07-05 15:56:13 +01:00
Miguel Sozinho Ramalho
0f56a5aae5
Merge pull request #331 from bellingcat/dev
...
1.1.1 multiple small fixes, and new logging strategy
v1.1.1
2025-06-30 02:36:25 +01:00
msramalho
649412053e
exclude non-ready code
2025-06-30 02:27:21 +01:00
msramalho
c2c9718f73
make python api tests work on gh when no env is set
2025-06-30 02:20:51 +01:00
msramalho
30ea8a0ba4
bumps dependencies
2025-06-30 02:20:09 +01:00
msramalho
73c8dc583f
closes #333
2025-06-30 01:52:22 +01:00
msramalho
b2648fa3cd
follow docs advice on exponential backoff of SheetsAPI
2025-06-30 01:47:12 +01:00
msramalho
4ad71b3589
adds retry to worksheet read for slow worksheets
2025-06-30 01:42:34 +01:00
msramalho
7c9475cde2
allow for human readable console logs, but defaults to JSON on file logs.
2025-06-30 00:53:10 +01:00
msramalho
afd9090a4c
concludes logging standardization refactor
2025-06-26 17:20:04 +01:00
msramalho
ad29cb4447
adds post_data to metadata for instagram
2025-06-26 15:48:10 +01:00
msramalho
ce4d7ac649
WIP refactor logging
2025-06-21 15:54:51 +01:00
msramalho
ade7feb5a0
version bump
2025-06-18 17:38:17 +01:00
msramalho
12b457706b
closes #166 adds story URL feature to telethon extractor
2025-06-18 17:37:44 +01:00
msramalho
592dc30415
closes #330
2025-06-18 16:40:55 +01:00