Compare commits

...

109 Commits

Author SHA1 Message Date
msramalho
ac4c09810b experimental feature for one-click deployment 2026-03-12 11:47:20 +00:00
msramalho
3194fee95d fix telethon bug when running in celery workers that close the event loop 2026-03-12 10:20:11 +00:00
msramalho
0040810e2e dependencies bump 2026-03-10 14:33:25 +00:00
msramalho
23a88e3cf4 ci issues 2026-03-02 17:07:09 +00:00
msramalho
3cac160cc1 version bump 2026-03-02 17:01:33 +00:00
msramalho
e9a92272c5 bug fix: missing filename on url download 2026-03-02 17:01:16 +00:00
Miguel Sozinho Ramalho
5d6c5ac2b1 Merge pull request #406 from bellingcat/dev
1.2.3
2026-03-02 15:42:08 +00:00
msramalho
f1de07c9aa version bump 2026-03-02 15:41:03 +00:00
msramalho
1e1e060a77 closes #342 2026-03-02 15:37:55 +00:00
msramalho
b43d229326 closes #358 2026-03-02 14:27:48 +00:00
msramalho
077b03fc61 minor tests change to work in gh actions 2026-03-02 14:08:14 +00:00
Miguel Sozinho Ramalho
cf77cfa64d Merge pull request #405 from bellingcat/feat/nitter-alternative
closes #400 Feat twitter drop-in alternative
2026-03-02 12:33:34 +00:00
msramalho
bc66dd4f2a fxtwitter working instead of nitter 2026-03-02 12:31:28 +00:00
msramalho
139d647197 Merge branch 'dev' into feat/nitter-alternative 2026-03-02 12:16:22 +00:00
msramalho
f465b570cd adding missing tests (no download) 2026-03-02 12:14:47 +00:00
Miguel Sozinho Ramalho
52a7cabaf1 Merge pull request #402 from bellingcat/dev
bug fix: wacz screenshots leak in shared session
2026-02-25 10:39:54 +00:00
msramalho
a739361e12 bug fix: wacz screenshots leak in shared session 2026-02-23 16:26:36 +00:00
Miguel Sozinho Ramalho
9a97fede43 Merge pull request #401 from bellingcat/dev
Dependencies maintenance.
2026-02-23 13:27:51 +00:00
msramalho
2d13077fad bumping ruff version 2026-02-23 12:36:53 +00:00
msramalho
8a4a314cf9 ruff python version to dev version 2026-02-23 12:32:24 +00:00
msramalho
75e8b788ae revert ruff workflow changes 2026-02-23 12:31:20 +00:00
msramalho
defe2315bf docs updates 2026-02-23 12:28:25 +00:00
msramalho
b9ab26ed5a see #400 WIP nitter not working as of now 2026-02-23 12:20:10 +00:00
msramalho
ba0dffdd5e Merge branch 'dev' of github.com:bellingcat/auto-archiver into dev 2026-02-23 12:18:58 +00:00
msramalho
a09927c507 minor docs fix 2026-02-23 12:18:47 +00:00
Miguel Sozinho Ramalho
6c938c489a Merge pull request #392 from bellingcat/dependabot/github_actions/actions-bc0df0c757
Bump the actions group with 5 updates
2026-02-23 11:28:24 +00:00
msramalho
0e39768da9 version bumping settings script 2026-02-23 11:27:12 +00:00
msramalho
1e5d6ec4a6 version bump: minor 2026-02-23 11:23:40 +00:00
msramalho
3385d004cf yt-dlp to latest version 2026-02-23 11:23:26 +00:00
msramalho
7f27f7fce0 closes #383 fixing browsertrix-crawler at 1.11.4 2026-02-23 11:23:06 +00:00
msramalho
a6e3240af1 closes #399 and global dependency updates 2026-02-23 11:13:31 +00:00
dependabot[bot]
bf4c196cc2 Bump the actions group with 5 updates
Bumps the actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [docker/login-action](https://github.com/docker/login-action) | `3.4.0` | `3.7.0` |
| [docker/metadata-action](https://github.com/docker/metadata-action) | `5.7.0` | `5.10.0` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/cache](https://github.com/actions/cache) | `4` | `5` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

Updates `docker/login-action` from 3.4.0 to 3.7.0
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](74a5d14239...c94ce9fb46)

Updates `docker/metadata-action` from 5.7.0 to 5.10.0
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Commits](902fa8ec7d...c299e40c65)

Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)

Updates `actions/cache` from 4 to 5
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: docker/login-action
  dependency-version: 3.7.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
- dependency-name: docker/metadata-action
  dependency-version: 5.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/cache
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-02-01 20:17:43 +00:00
Miguel Sozinho Ramalho
c640cc898a Merge pull request #385 from bellingcat/dev
1.2.0 dependencies, small bugs, 1st time contributors
2026-01-08 15:55:40 +00:00
msramalho
3e2c0b564b wiki fix 2026-01-08 15:49:42 +00:00
msramalho
5fd23baa55 this is ruff 2026-01-08 15:48:08 +00:00
msramalho
8a450310c7 version bump for new release 2026-01-08 15:41:27 +00:00
msramalho
bef8a14089 pyperclip version bump closes #339 2026-01-08 15:40:17 +00:00
msramalho
cd0b093e7a browsertrix-crawler to 1.9.2 see #383 2026-01-08 15:33:40 +00:00
msramalho
096c9d09ef fix for unexpected types for json.dump 2026-01-08 15:18:19 +00:00
Miguel Sozinho Ramalho
df3521e9ca Merge pull request #377 from m4cd4r4/fix/improve-deleted-post-detection
Fix #335: Add comprehensive deletion detection for removed/unavailable content
2026-01-08 15:06:21 +00:00
msramalho
a89d0193e4 removes patch file 2026-01-08 15:02:00 +00:00
msramalho
536cbd905f puts tests file in correct directory 2026-01-08 14:55:40 +00:00
msramalho
a936921c4e updates new utils file and test 2026-01-08 14:54:06 +00:00
Miguel Sozinho Ramalho
68f672a4fa Merge branch 'dev' into fix/improve-deleted-post-detection 2026-01-08 14:36:17 +00:00
Miguel Sozinho Ramalho
4ee0ad1cf8 Merge pull request #359 from mjgaughan/specify-medatada-feature
implementing default metadata omission/user metadata selection
2026-01-08 14:34:50 +00:00
msramalho
bac809451c expands tests to included non predefined metadata keys 2026-01-08 14:33:16 +00:00
msramalho
53dc9904ce refactorws PR to obey standard code approach 2026-01-08 14:30:26 +00:00
Miguel Sozinho Ramalho
c1f312d42a Merge branch 'dev' into specify-medatada-feature 2026-01-08 14:04:42 +00:00
msramalho
23c9dfe717 updating dependencies 2026-01-08 13:53:44 +00:00
m4cd4r4
d02e7e0f02 Add comprehensive deletion detection for removed/unavailable content
Implements issue #335: improve detection of deleted/missing posts

## Changes

### New Deletion Detection System
- Created `deletion_detection.py` utility module with platform-specific
  indicators for Twitter, Facebook, Instagram, TikTok, YouTube, Reddit,
  VK, and Telegram
- Detects deletion via HTML content, page titles, error messages, and
  video metadata
- Stores detailed deletion context (indicator, source, platform) in
  metadata for investigators

### Integration Points
- **Antibot Extractor**: Checks HTML and page titles after page load;
  resolves TODO about detecting deleted videos
- **Generic Extractor**: Checks yt-dlp video data and error messages
  for deletion indicators
- **Twitter Dropin**: Enhanced detection when user/created_at fields
  are missing

### Test Coverage
- Comprehensive test suite covering all platforms
- Tests for HTML, title, error message, and metadata detection
- Validates that normal content is not falsely flagged

## Impact for Conflict Documentation

This fix is critical for evidence preservation in war-torn regions:
- Investigators can now document that evidence existed but was deleted
- Prevents wasted archival attempts on deleted content
- Tracks patterns of content removal
- Preserves metadata about what was deleted and when

Twitter example: Detects "Hmm...this page doesn't exist. Try searching
for something else" and flags content as deleted_or_unavailable.
2025-12-17 18:40:58 +08:00
Miguel Sozinho Ramalho
56526a9ac7 Merge pull request #365 from bellingcat/dev
Facebook reels fix
2025-10-23 10:40:43 +01:00
msramalho
3a22cc28c0 skip tiktok antibot test in CI 2025-10-23 10:17:14 +01:00
msramalho
dbb3dfa04f fixes wikipedia test 2025-10-23 10:04:44 +01:00
msramalho
01bdb35f5d version bump 2025-10-23 09:51:31 +01:00
msramalho
43cbc6ac56 generic extractor improvements 2025-10-23 09:51:14 +01:00
msramalho
9c7cab1ae2 dependencies update 2025-10-22 21:07:12 +01:00
msramalho
a9a0bae083 dependencies update 2025-10-22 18:11:36 +01:00
Miguel Sozinho Ramalho
97d133ce79 Merge pull request #357 from bellingcat/dev
small improvements on tiktok and verison bumps
2025-10-22 16:02:26 +01:00
msramalho
432ee3dcfd version bump 2025-10-22 15:50:50 +01:00
mgaughan
94e0803fb3 implementing default metadata omission/user metadata selection 2025-09-22 20:16:40 -05:00
msramalho
794b4f6052 Merge branch 'dev' of https://github.com/bellingcat/auto-archiver into dev 2025-09-11 15:06:27 +01:00
msramalho
965d7d41dd dependency updates 2025-09-11 15:06:25 +01:00
Miguel Sozinho Ramalho
e73faa70cc Merge pull request #352 from mjgaughan/developer-documentation-updates
updating the style-checking code in the documentation
2025-08-11 10:42:53 +01:00
mgaughan
80beab9f23 ruff-fix -> ruff-clean; there is no ruff-fix in the Makefile. Maybe the command /should/ be ruff-fix to align with the underlying ruff command; for later discussion. This at least reconciles the documentation to the Makefile 2025-08-05 21:36:32 -04:00
Miguel Sozinho Ramalho
200cea4e12 Merge pull request #345 from mjgaughan/main
Correction of small documentation typos
2025-07-29 09:36:10 +01:00
mgaughan
1256fde159 updating location of .env.test.example in documentation 2025-07-23 13:04:48 -04:00
mgaughan
65e222e177 fixing typo in documentation pytest -> poetry 2025-07-22 17:20:59 -04:00
mgaughan
f2eb9ef784 correcting to double-dash in the poetry install documentation 2025-07-21 17:55:48 -04:00
msramalho
2081c16555 embed retry into timestamping 2025-07-10 14:49:53 +01:00
msramalho
d3efd7121c avoid empty metadata comments 2025-07-06 14:05:17 +01:00
msramalho
9d3cd5774b an improved approach for #295 2025-07-06 14:04:01 +01:00
Miguel Sozinho Ramalho
80d61e8b85 Merge pull request #341 from bellingcat/dev
Address several small bugs, includes tiktok photos extraction, and data-saving for proxy usage in generic_extractor.
2025-07-05 20:28:00 +01:00
msramalho
d36cdbfa87 fixing pypaperclip see issue #339 2025-07-05 19:07:23 +01:00
msramalho
c1506ee1cf some wayback errors are expected and should be warnings 2025-07-05 18:31:39 +01:00
msramalho
3a34a49822 adds antibot tiktok logic for photos closes #295 2025-07-05 18:31:12 +01:00
msramalho
37c6d97275 new auth wall check logic and escaped CSS selector in selenium 2025-07-05 18:30:31 +01:00
msramalho
7234eda85f expands Sheets API retries for really large spreadsheets 2025-07-05 18:29:33 +01:00
msramalho
a8c1ef3912 generic_extractor config to use proxy only when needed to avoid overzealousness 2025-07-05 16:54:58 +01:00
msramalho
52ed8196a5 updates dependencies 2025-07-05 16:03:47 +01:00
msramalho
2051e8e491 adds further exponential backoff for Sheets API worksheet enumeration 2025-07-05 16:02:07 +01:00
msramalho
21255db86a stops using service that is not up for timestamping 2025-07-05 16:00:46 +01:00
msramalho
eae0da08b3 fix issue with two runs of anitbot extractor 2025-07-05 16:00:03 +01:00
msramalho
0d1447117c updates docs to reflect new general approach extractor 2025-07-05 15:56:13 +01:00
Miguel Sozinho Ramalho
0f56a5aae5 Merge pull request #331 from bellingcat/dev
1.1.1 multiple small fixes, and new logging strategy
2025-06-30 02:36:25 +01:00
msramalho
649412053e exclude non-ready code 2025-06-30 02:27:21 +01:00
msramalho
c2c9718f73 make python api tests work on gh when no env is set 2025-06-30 02:20:51 +01:00
msramalho
30ea8a0ba4 bumps dependencies 2025-06-30 02:20:09 +01:00
msramalho
73c8dc583f closes #333 2025-06-30 01:52:22 +01:00
msramalho
b2648fa3cd follow docs advice on exponential backoff of SheetsAPI 2025-06-30 01:47:12 +01:00
msramalho
4ad71b3589 adds retry to worksheet read for slow worksheets 2025-06-30 01:42:34 +01:00
msramalho
7c9475cde2 allow for human readable console logs, but defaults to JSON on file logs. 2025-06-30 00:53:10 +01:00
msramalho
afd9090a4c concludes logging standardization refactor 2025-06-26 17:20:04 +01:00
msramalho
ad29cb4447 adds post_data to metadata for instagram 2025-06-26 15:48:10 +01:00
msramalho
ce4d7ac649 WIP refactor logging 2025-06-21 15:54:51 +01:00
msramalho
ade7feb5a0 version bump 2025-06-18 17:38:17 +01:00
msramalho
12b457706b closes #166 adds story URL feature to telethon extractor 2025-06-18 17:37:44 +01:00
msramalho
592dc30415 closes #330 2025-06-18 16:40:55 +01:00
msramalho
4a36e6f6b0 fix tests 2025-06-18 13:50:21 +01:00
msramalho
d46eeee9b6 docs improved 2025-06-18 13:35:51 +01:00
msramalho
302e6f4258 logs improved 2025-06-18 13:35:43 +01:00
Miguel Sozinho Ramalho
e803c5d0e3 Merge branch 'main' into dev 2025-06-18 13:35:21 +01:00
msramalho
e1d0314a9e Merge branch 'dev' of https://github.com/bellingcat/auto-archiver into dev 2025-06-18 13:26:48 +01:00
Miguel Sozinho Ramalho
5d5119e053 Merge pull request #329 from bellingcat/dev
installs ffmpeg in readthedocs
2025-06-18 00:31:09 +01:00
msramalho
d6c90d87f1 installs ffmpeg in readthedocs 2025-06-18 00:30:45 +01:00
msramalho
212bf67ab1 installs ffmpeg in readthedocs 2025-06-18 00:29:36 +01:00
Miguel Sozinho Ramalho
6abe2edb13 Merge pull request #328 from bellingcat/dev
fix to configuration editor npm versions
2025-06-18 00:22:39 +01:00
msramalho
03c0cf09ae fix issue with grid in scripts/config_editor @mui lib upgrade 2025-06-18 00:20:31 +01:00
Miguel Sozinho Ramalho
0db77c7e68 Merge pull request #326 from bellingcat/dependabot/npm_and_yarn/scripts/settings/actions-27795ad889
Bump @types/react from 19.1.7 to 19.1.8 in /scripts/settings in the actions group across 1 directory
2025-06-18 00:12:51 +01:00
dependabot[bot]
cd6607943d Bump @types/react
Bumps the actions group with 1 update in the /scripts/settings directory: [@types/react](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react).


Updates `@types/react` from 19.1.7 to 19.1.8
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react)

---
updated-dependencies:
- dependency-name: "@types/react"
  dependency-version: 19.1.8
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-06-17 22:58:23 +00:00
121 changed files with 6391 additions and 2466 deletions

View File

@@ -22,7 +22,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Check out the repo - name: Check out the repo
uses: actions/checkout@v4 uses: actions/checkout@v6
- name: Set up QEMU - name: Set up QEMU
uses: docker/setup-qemu-action@v3 uses: docker/setup-qemu-action@v3
@@ -33,14 +33,14 @@ jobs:
uses: docker/setup-buildx-action@v3 uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub - name: Log in to Docker Hub
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9
with: with:
username: ${{ secrets.DOCKER_USERNAME }} username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }} password: ${{ secrets.DOCKER_PASSWORD }}
- name: Extract metadata (tags, labels) for Docker - name: Extract metadata (tags, labels) for Docker
id: meta id: meta
uses: docker/metadata-action@902fa8ec7d6ecbf8d84d538b9b233a880e428804 uses: docker/metadata-action@c299e40c65443455700f0fdfc63efafe5b349051
with: with:
images: bellingcat/auto-archiver images: bellingcat/auto-archiver

View File

@@ -22,10 +22,10 @@ jobs:
steps: steps:
- name: Checkout Repository - name: Checkout Repository
uses: actions/checkout@v4 uses: actions/checkout@v6
- name: Set up Python - name: Set up Python
uses: actions/setup-python@v5 uses: actions/setup-python@v6
with: with:
python-version-file: pyproject.toml python-version-file: pyproject.toml

View File

@@ -20,11 +20,11 @@ jobs:
build: build:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v6
- name: Install Python - name: Install Python
uses: actions/setup-python@v5 uses: actions/setup-python@v6
with: with:
python-version: "3.11" python-version: "3.12"
- name: Install dependencies - name: Install dependencies
run: | run: |
python -m pip install --upgrade pip python -m pip install --upgrade pip

View File

@@ -26,13 +26,13 @@ jobs:
working-directory: ./ working-directory: ./
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v6
- name: Install ffmpeg - name: Install ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg run: sudo apt-get update && sudo apt-get install -y ffmpeg
- name: Set up Python ${{ matrix.python-version }} - name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5 uses: actions/setup-python@v6
with: with:
python-version: ${{ matrix.python-version }} python-version: ${{ matrix.python-version }}
@@ -40,7 +40,7 @@ jobs:
run: pipx install poetry run: pipx install poetry
- name: Cache Poetry and pip artifacts - name: Cache Poetry and pip artifacts
uses: actions/cache@v4 uses: actions/cache@v5
with: with:
path: | path: |
~/.cache/pypoetry ~/.cache/pypoetry

29
.github/workflows/tests-deploy.yaml vendored Normal file
View File

@@ -0,0 +1,29 @@
name: Deploy Tests
on:
push:
branches: [ main ]
paths:
- deploy/**
pull_request:
paths:
- deploy/**
jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Set up Python 3.12
uses: actions/setup-python@v6
with:
python-version: "3.12"
- name: Install dependencies
run: pip install pytest fastapi httpx python-multipart pyyaml
- name: Run Deploy Tests
working-directory: deploy
run: python -m pytest tests/ -v

View File

@@ -20,13 +20,13 @@ jobs:
working-directory: ./ working-directory: ./
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v6
- name: Install ffmpeg - name: Install ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg run: sudo apt-get update && sudo apt-get install -y ffmpeg
- name: Set up Python ${{ matrix.python-version }} - name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5 uses: actions/setup-python@v6
with: with:
python-version: ${{ matrix.python-version }} python-version: ${{ matrix.python-version }}
@@ -34,7 +34,7 @@ jobs:
run: pipx install poetry run: pipx install poetry
- name: Cache Poetry and pip artifacts - name: Cache Poetry and pip artifacts
uses: actions/cache@v4 uses: actions/cache@v5
with: with:
path: | path: |
~/.cache/pypoetry ~/.cache/pypoetry
@@ -47,4 +47,4 @@ jobs:
- name: Run Download Tests - name: Run Download Tests
run: poetry run pytest -ra -v -x -m "download" run: poetry run pytest -ra -v -x -m "download"
env: env:
TWITTER_BEARER_TOKEN: ${{ secrets.TWITTER_BEARER_TOKEN }} TWITTER_BEARER_TOKEN: ${{ secrets.TWITTER_BEARER_TOKEN || '' }}

View File

@@ -7,6 +7,8 @@ version: 2
build: build:
os: ubuntu-22.04 os: ubuntu-22.04
apt_packages:
- ffmpeg
tools: tools:
python: "3.10" python: "3.10"
nodejs: "22" nodejs: "22"

View File

@@ -1,4 +1,4 @@
FROM webrecorder/browsertrix-crawler:1.6.3 AS base FROM webrecorder/browsertrix-crawler:1.11.4 AS base
ENV RUNNING_IN_DOCKER=1 \ ENV RUNNING_IN_DOCKER=1 \
LANG=C.UTF-8 \ LANG=C.UTF-8 \
@@ -41,11 +41,21 @@ COPY ./src/ .
RUN /poetry-venv/bin/poetry install --only main --no-cache RUN /poetry-venv/bin/poetry install --only main --no-cache
# Run as non-root user to avoid permission issues with mounted volumes (see #342)
# The base image already has an 'ubuntu' user at UID/GID 1000.
# Ensure directories that need write access at runtime are writable.
RUN chown 1000:1000 /app && \
chown -R 1000:1000 /app/.venv/lib/python3.12/site-packages/seleniumbase/drivers/ && \
mkdir -p /app/local_archive /app/secrets /tmp/archive && \
chown -R 1000:1000 /app/local_archive /app/secrets /tmp/archive
# Update PATH to include virtual environment binaries # Update PATH to include virtual environment binaries
# Allowing entry point to run the application directly with Python # Allowing entry point to run the application directly with Python
ENV VIRTUAL_ENV=/app/.venv \ ENV VIRTUAL_ENV=/app/.venv \
PATH="/app/.venv/bin:$PATH" PATH="/app/.venv/bin:$PATH"
USER 1000
ENTRYPOINT ["python3", "-m", "auto_archiver"] ENTRYPOINT ["python3", "-m", "auto_archiver"]
# should be executed with 2 volumes (3 if local_storage is used) # should be executed with 2 volumes (3 if local_storage is used)

View File

@@ -22,7 +22,40 @@ Auto Archiver is a Python tool to automatically archive content on the web in a
Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/). Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/).
## Installation ## One-Click Cloud Deploy
Deploy your own Auto Archiver instance to the cloud — no coding required:
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/new/template?template=https://github.com/bellingcat/auto-archiver&envs=AUTH_PASSWORD,GSHEET_URL,GOOGLE_SERVICE_ACCOUNT_JSON,POLL_INTERVAL,S3_BUCKET,S3_KEY,S3_SECRET,S3_REGION,TELEGRAM_API_ID,TELEGRAM_API_HASH,TELEGRAM_BOT_TOKEN,ENABLE_SCREENSHOTS,LOG_LEVEL&optionalEnvs=GSHEET_URL,GOOGLE_SERVICE_ACCOUNT_JSON,POLL_INTERVAL,S3_BUCKET,S3_KEY,S3_SECRET,S3_REGION,TELEGRAM_API_ID,TELEGRAM_API_HASH,TELEGRAM_BOT_TOKEN,ENABLE_SCREENSHOTS,LOG_LEVEL&AUTH_PASSWORDDesc=Password+to+access+your+archiver+web+interface&GSHEET_URLDesc=Google+Sheet+URL+to+monitor+for+new+URLs+(leave+empty+to+disable)&POLL_INTERVALDesc=Seconds+between+Google+Sheet+checks+(min+60)&POLL_INTERVALDefault=300&S3_BUCKETDesc=S3+bucket+name+for+storage+(leave+empty+for+local+only)&S3_REGIONDefault=us-east-1&LOG_LEVELDefault=INFO)
**What you get:** A web interface where you can paste URLs and archive them instantly. Optionally connect a Google Sheet for automated monitoring, S3 for cloud storage, and Telegram for archiving channels.
**Only required setting:** `AUTH_PASSWORD` — everything else is optional and can be configured later via the Railway dashboard.
<details>
<summary>📋 Environment variables reference</summary>
| Variable | Required | Description |
|----------|----------|-------------|
| `AUTH_PASSWORD` | **Yes** | Password to access the web interface |
| `GSHEET_URL` | No | Google Sheet URL to monitor for new URLs [use this template](https://docs.google.com/spreadsheets/d/1NJZo_XZUBKTI1Ghlgi4nTPVvCfb0HXAs6j5tNGas72k/edit?gid=0#gid=0) |
| `GOOGLE_SERVICE_ACCOUNT_JSON` | No | Google service account JSON (required with Sheets) [follow these instructions](https://auto-archiver.readthedocs.io/en/v1.0.1/how_to/gsheets_setup.html) |
| `POLL_INTERVAL` | No | Seconds between Sheet checks (default: 300) |
| `S3_BUCKET` | No | S3 bucket name for archived content, ideal for cloud hosting your archives but not mandatory, any S3-compatible storage works |
| `S3_KEY` / `S3_SECRET` | No | S3 credentials |
| `S3_REGION` | No | S3 region (default: us-east-1) |
| `S3_ENDPOINT` | No | S3 endpoint URL |
| `TELEGRAM_API_ID` / `TELEGRAM_API_HASH` | No | Telegram API credentials |
| `TELEGRAM_BOT_TOKEN` | No | Telegram bot token |
| `ENABLE_SCREENSHOTS` | No | Set to `true` for full-page screenshots |
| `ENABLE_THUMBNAILS` | No | Set to `true` for video thumbnails |
| `ENABLE_CSV_DB` | No | Set to `true` for CSV logging |
| `LOG_LEVEL` | No | DEBUG, INFO, WARNING, ERROR (default: INFO) |
</details>
## Traditional Installation
View the [Installation Guide](https://auto-archiver.readthedocs.io/en/latest/installation/installation.html) for full instructions View the [Installation Guide](https://auto-archiver.readthedocs.io/en/latest/installation/installation.html) for full instructions

34
deploy/Dockerfile Normal file
View File

@@ -0,0 +1,34 @@
# ── Cloud Deploy ──────────────────────────────────────────────────────
# Thin web UI + config generator layer on top of the published
# auto-archiver Docker image. Used by the Railway one-click deploy.
#
# Build:
# docker build -f deploy/Dockerfile -t auto-archiver-deploy .
#
# Run:
# docker run -p 8080:8080 -e PORT=8080 -e AUTH_PASSWORD=secret auto-archiver-deploy
# ──────────────────────────────────────────────────────────────────────
FROM bellingcat/auto-archiver:latest
USER root
# Install the lightweight web layer dependencies
RUN pip install --no-cache-dir fastapi uvicorn[standard] python-multipart pyyaml
# Copy deploy scripts into the image
COPY deploy/ /app/deploy/
# Ensure writable dirs exist
RUN mkdir -p /app/local_archive /app/secrets && \
chown -R 1000:1000 /app/local_archive /app/secrets /app/deploy
USER 1000
# Railway sets PORT; default to 8080
ENV PORT=8080
EXPOSE ${PORT}
# Override the CLI entrypoint with the web server
ENTRYPOINT ["python3", "-m", "deploy.start"]

1
deploy/__init__.py Normal file
View File

@@ -0,0 +1 @@
# Cloud deployment layer for auto-archiver

163
deploy/generate_config.py Normal file
View File

@@ -0,0 +1,163 @@
#!/usr/bin/env python3
"""
Generates orchestration.yaml from environment variables.
This script bridges Railway's env-var-based configuration with
auto-archiver's YAML-based configuration system. It runs at container
startup before the web UI server starts.
"""
import os
from pathlib import Path
import yaml
CONFIG_PATH = Path("/app/secrets/orchestration.yaml")
SECRETS_DIR = Path("/app/secrets")
def build_config() -> dict:
"""Build an orchestration config dict from environment variables."""
# -- Base config: always present ------------------------------------
config = {
"steps": {
"feeders": ["cli_feeder"],
"extractors": ["generic_extractor"],
"enrichers": ["hash_enricher"],
"databases": ["console_db"],
"storages": ["local_storage"],
"formatters": ["html_formatter"],
},
"logging": {
"level": os.environ.get("LOG_LEVEL", "INFO"),
},
"local_storage": {
"save_to": "/app/local_archive",
"path_generator": "flat",
"filename_generator": "static",
},
"generic_extractor": {
"subtitles": os.environ.get("SUBTITLES", "false").lower() == "true",
"comments": False,
"livestreams": False,
"live_from_start": False,
"end_means_success": True,
"allow_playlist": False,
},
"hash_enricher": {
"algorithm": "SHA-256",
},
"html_formatter": {
"detect_thumbnails": True,
},
"authentication": {},
}
# -- Google Sheets feeder (optional) --------------------------------
gsheet_url = os.environ.get("GSHEET_URL", "")
if gsheet_url:
config["steps"]["feeders"].append("gsheet_feeder")
config["steps"]["databases"].append("gsheet_db")
config["gsheet_feeder"] = {
"sheet": gsheet_url,
"header": 1,
"service_account": str(SECRETS_DIR / "service_account.json"),
"use_sheet_names_in_stored_paths": False,
"columns": {
"url": "link",
"status": "archive status",
"folder": "destination folder",
"archive": "archive location",
"date": "archive date",
"thumbnail": "thumbnail",
"timestamp": "upload timestamp",
"title": "upload title",
"text": "textual content",
"screenshot": "screenshot",
"hash": "hash",
"pdq_hash": "perceptual hashes",
},
}
# -- Google service account JSON (optional) -------------------------
sa_json = os.environ.get("GOOGLE_SERVICE_ACCOUNT_JSON", "")
if sa_json:
SECRETS_DIR.mkdir(parents=True, exist_ok=True)
sa_path = SECRETS_DIR / "service_account.json"
sa_path.write_text(sa_json)
print(f"[deploy] Wrote Google service account to {sa_path}")
# -- S3 storage (optional) ------------------------------------------
s3_bucket = os.environ.get("S3_BUCKET", "")
if s3_bucket:
config["steps"]["storages"].append("s3_storage")
config["s3_storage"] = {
"bucket": s3_bucket,
"region": os.environ.get("S3_REGION", "us-east-1"),
"key": os.environ.get("S3_KEY", ""),
"secret": os.environ.get("S3_SECRET", ""),
"endpoint_url": os.environ.get("S3_ENDPOINT", "https://s3.{region}.amazonaws.com"),
"cdn_url": os.environ.get(
"S3_CDN_URL",
"https://{bucket}.s3.{region}.amazonaws.com/{key}",
),
"private": os.environ.get("S3_PRIVATE", "false").lower() == "true",
"random_no_duplicate": True,
"key_path": "random",
}
# -- Telegram extractor (optional) ----------------------------------
tg_api_id = os.environ.get("TELEGRAM_API_ID", "")
tg_api_hash = os.environ.get("TELEGRAM_API_HASH", "")
if tg_api_id and tg_api_hash:
config["steps"]["extractors"].append("telegram_extractor")
config["telegram_extractor"] = {
"api_id": tg_api_id,
"api_hash": tg_api_hash,
}
bot_token = os.environ.get("TELEGRAM_BOT_TOKEN", "")
if bot_token:
config["telegram_extractor"]["bot_token"] = bot_token
# -- Screenshot enricher (optional) ---------------------------------
if os.environ.get("ENABLE_SCREENSHOTS", "").lower() == "true":
config["steps"]["enrichers"].append("screenshot_enricher")
config["screenshot_enricher"] = {
"width": 1280,
"height": 7200,
"save_to_pdf": True,
}
# -- Thumbnail enricher (optional) ----------------------------------
if os.environ.get("ENABLE_THUMBNAILS", "").lower() == "true":
config["steps"]["enrichers"].append("thumbnail_enricher")
config["thumbnail_enricher"] = {
"thumbnails_per_minute": 60,
"max_thumbnails": 16,
}
# -- CSV database (optional) ----------------------------------------
if os.environ.get("ENABLE_CSV_DB", "").lower() == "true":
config["steps"]["databases"].append("csv_db")
config["csv_db"] = {
"csv_file": "/app/local_archive/db.csv",
}
return config
def main():
config = build_config()
CONFIG_PATH.parent.mkdir(parents=True, exist_ok=True)
with open(CONFIG_PATH, "w") as f:
yaml.dump(config, f, default_flow_style=False, sort_keys=False)
print(f"[deploy] Generated config at {CONFIG_PATH}")
print(f"[deploy] Active steps: {config['steps']}")
if __name__ == "__main__":
main()

71
deploy/gsheet_poller.py Normal file
View File

@@ -0,0 +1,71 @@
#!/usr/bin/env python3
"""
Background Google Sheets poller for auto-archiver cloud deployments.
When GSHEET_URL is set, periodically runs auto-archiver with gsheet_feeder
to check for new URLs in the configured spreadsheet. Runs as a daemon thread
alongside the web UI.
"""
import logging
import os
import subprocess
import threading
import time
logger = logging.getLogger("gsheet_poller")
CONFIG_PATH = "/app/secrets/orchestration.yaml"
def _poll_once():
"""Run auto-archiver once to process any new rows in the Google Sheet."""
logger.info("Polling Google Sheet for new URLs...")
try:
result = subprocess.run(
["python3", "-m", "auto_archiver", "--config", CONFIG_PATH],
capture_output=True,
text=True,
cwd="/app",
timeout=600, # 10 minute timeout per poll
)
if result.returncode == 0:
logger.info("Sheet poll completed successfully.")
else:
logger.warning("Sheet poll exited with code %d: %s", result.returncode, result.stderr[-500:])
except subprocess.TimeoutExpired:
logger.error("Sheet poll timed out after 600s")
except Exception:
logger.exception("Sheet poll failed")
def _poll_loop(interval: int):
"""Run the poll loop at the given interval (seconds)."""
logger.info("Google Sheets poller started (interval=%ds)", interval)
while True:
_poll_once()
time.sleep(interval)
def start_poller():
"""
Start the Google Sheets poller as a daemon thread if GSHEET_URL is set.
Call this once at application startup.
"""
gsheet_url = os.environ.get("GSHEET_URL", "")
if not gsheet_url:
logger.info("GSHEET_URL not set Sheet poller disabled.")
return
interval = int(os.environ.get("POLL_INTERVAL", "300"))
if interval < 60:
interval = 60 # minimum 1 minute
thread = threading.Thread(
target=_poll_loop,
args=(interval,),
daemon=True,
name="gsheet-poller",
)
thread.start()
logger.info("Google Sheets poller thread started.")

2
deploy/pytest.ini Normal file
View File

@@ -0,0 +1,2 @@
[pytest]
testpaths = tests

37
deploy/start.py Normal file
View File

@@ -0,0 +1,37 @@
#!/usr/bin/env python3
"""
Startup entrypoint for cloud deployments.
1. Generates orchestration.yaml from environment variables
2. Starts the Google Sheets poller (if GSHEET_URL is set)
3. Starts the FastAPI web UI
"""
import os
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(name)s] %(levelname)s: %(message)s",
)
# Generate config from env vars
from deploy.generate_config import main as generate_config # noqa: E402
generate_config()
# Start gsheet poller (no-op if GSHEET_URL not set)
from deploy.gsheet_poller import start_poller # noqa: E402
start_poller()
# Start web server
import uvicorn # noqa: E402
port = int(os.environ.get("PORT", "8080"))
uvicorn.run(
"deploy.web_ui:app",
host="0.0.0.0",
port=port,
log_level="info",
)

0
deploy/tests/__init__.py Normal file
View File

View File

@@ -0,0 +1,354 @@
"""Tests for deploy/generate_config.py config generation from env vars."""
import json
import os
from unittest.mock import patch
import yaml
from deploy.generate_config import build_config, main
# ── Helpers ───────────────────────────────────────────────────────────
def _env(**overrides):
"""Return a clean env dict with only the given overrides (no leak from host)."""
# Clear all deploy-relevant env vars, then apply overrides
deploy_vars = [
"LOG_LEVEL",
"SUBTITLES",
"GSHEET_URL",
"GOOGLE_SERVICE_ACCOUNT_JSON",
"S3_BUCKET",
"S3_KEY",
"S3_SECRET",
"S3_REGION",
"S3_ENDPOINT",
"S3_CDN_URL",
"S3_PRIVATE",
"TELEGRAM_API_ID",
"TELEGRAM_API_HASH",
"TELEGRAM_BOT_TOKEN",
"ENABLE_SCREENSHOTS",
"ENABLE_THUMBNAILS",
"ENABLE_CSV_DB",
]
clean = {k: v for k, v in os.environ.items() if k not in deploy_vars}
clean.update(overrides)
return clean
# ── Base config (no optional env vars) ────────────────────────────────
class TestBaseConfig:
"""When no optional env vars are set, build_config returns a minimal working config."""
def test_base_steps(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
steps = cfg["steps"]
assert steps["feeders"] == ["cli_feeder"]
assert steps["extractors"] == ["generic_extractor"]
assert steps["enrichers"] == ["hash_enricher"]
assert steps["databases"] == ["console_db"]
assert steps["storages"] == ["local_storage"]
assert steps["formatters"] == ["html_formatter"]
def test_base_has_required_module_configs(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
assert "local_storage" in cfg
assert "generic_extractor" in cfg
assert "hash_enricher" in cfg
assert "html_formatter" in cfg
def test_default_log_level_is_info(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
assert cfg["logging"]["level"] == "INFO"
def test_custom_log_level(self):
with patch.dict(os.environ, _env(LOG_LEVEL="DEBUG"), clear=True):
cfg = build_config()
assert cfg["logging"]["level"] == "DEBUG"
def test_authentication_present_and_empty(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
assert cfg["authentication"] == {}
def test_local_storage_defaults(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
ls = cfg["local_storage"]
assert ls["save_to"] == "/app/local_archive"
assert ls["path_generator"] == "flat"
assert ls["filename_generator"] == "static"
def test_subtitles_default_false(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
assert cfg["generic_extractor"]["subtitles"] is False
def test_subtitles_enabled(self):
with patch.dict(os.environ, _env(SUBTITLES="true"), clear=True):
cfg = build_config()
assert cfg["generic_extractor"]["subtitles"] is True
def test_subtitles_case_insensitive(self):
with patch.dict(os.environ, _env(SUBTITLES="True"), clear=True):
cfg = build_config()
assert cfg["generic_extractor"]["subtitles"] is True
def test_no_optional_modules_present(self):
"""Ensure optional modules don't appear when their env vars are absent."""
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
assert "gsheet_feeder" not in cfg
assert "s3_storage" not in cfg
assert "telegram_extractor" not in cfg
assert "screenshot_enricher" not in cfg
assert "thumbnail_enricher" not in cfg
assert "csv_db" not in cfg
def test_config_is_valid_yaml(self):
"""The output dict should round-trip through YAML cleanly."""
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
dumped = yaml.dump(cfg)
reloaded = yaml.safe_load(dumped)
assert reloaded == cfg
# ── Google Sheets ─────────────────────────────────────────────────────
class TestGSheetConfig:
def test_gsheet_adds_feeder_and_db(self):
with patch.dict(os.environ, _env(GSHEET_URL="https://docs.google.com/spreadsheets/d/abc"), clear=True):
cfg = build_config()
assert "gsheet_feeder" in cfg["steps"]["feeders"]
assert "gsheet_db" in cfg["steps"]["databases"]
def test_gsheet_feeder_config(self):
url = "https://docs.google.com/spreadsheets/d/abc123"
with patch.dict(os.environ, _env(GSHEET_URL=url), clear=True):
cfg = build_config()
gf = cfg["gsheet_feeder"]
assert gf["sheet"] == url
assert gf["header"] == 1
assert "service_account" in gf
assert gf["columns"]["url"] == "link"
assert gf["columns"]["status"] == "archive status"
def test_gsheet_preserves_cli_feeder(self):
"""cli_feeder should still be present even when gsheet is added."""
with patch.dict(os.environ, _env(GSHEET_URL="https://example.com/sheet"), clear=True):
cfg = build_config()
assert "cli_feeder" in cfg["steps"]["feeders"]
def test_service_account_json_written(self, tmp_path):
"""When GOOGLE_SERVICE_ACCOUNT_JSON is set, it writes the file."""
sa_data = json.dumps({"type": "service_account", "project_id": "test"})
secrets_dir = tmp_path / "secrets"
with (
patch.dict(os.environ, _env(GOOGLE_SERVICE_ACCOUNT_JSON=sa_data), clear=True),
patch("deploy.generate_config.SECRETS_DIR", secrets_dir),
):
build_config()
sa_path = secrets_dir / "service_account.json"
assert sa_path.exists()
assert json.loads(sa_path.read_text())["project_id"] == "test"
# ── S3 storage ────────────────────────────────────────────────────────
class TestS3Config:
def test_s3_adds_storage(self):
with patch.dict(os.environ, _env(S3_BUCKET="my-bucket"), clear=True):
cfg = build_config()
assert "s3_storage" in cfg["steps"]["storages"]
assert "local_storage" in cfg["steps"]["storages"] # local still there
def test_s3_config_values(self):
env = _env(
S3_BUCKET="my-bucket",
S3_KEY="AKID",
S3_SECRET="shhh",
S3_REGION="eu-west-1",
)
with patch.dict(os.environ, env, clear=True):
cfg = build_config()
s3 = cfg["s3_storage"]
assert s3["bucket"] == "my-bucket"
assert s3["key"] == "AKID"
assert s3["secret"] == "shhh"
assert s3["region"] == "eu-west-1"
assert s3["private"] is False
assert s3["random_no_duplicate"] is True
def test_s3_defaults(self):
with patch.dict(os.environ, _env(S3_BUCKET="b"), clear=True):
cfg = build_config()
s3 = cfg["s3_storage"]
assert s3["region"] == "us-east-1"
assert "{region}" in s3["endpoint_url"]
def test_s3_private_flag(self):
with patch.dict(os.environ, _env(S3_BUCKET="b", S3_PRIVATE="true"), clear=True):
cfg = build_config()
assert cfg["s3_storage"]["private"] is True
def test_s3_custom_endpoint(self):
endpoint = "https://nyc3.digitaloceanspaces.com"
with patch.dict(os.environ, _env(S3_BUCKET="b", S3_ENDPOINT=endpoint), clear=True):
cfg = build_config()
assert cfg["s3_storage"]["endpoint_url"] == endpoint
# ── Telegram ──────────────────────────────────────────────────────────
class TestTelegramConfig:
def test_telegram_added_when_both_set(self):
env = _env(TELEGRAM_API_ID="12345", TELEGRAM_API_HASH="abc")
with patch.dict(os.environ, env, clear=True):
cfg = build_config()
assert "telegram_extractor" in cfg["steps"]["extractors"]
assert cfg["telegram_extractor"]["api_id"] == "12345"
assert cfg["telegram_extractor"]["api_hash"] == "abc"
def test_telegram_not_added_if_only_id(self):
with patch.dict(os.environ, _env(TELEGRAM_API_ID="12345"), clear=True):
cfg = build_config()
assert "telegram_extractor" not in cfg["steps"]["extractors"]
def test_telegram_not_added_if_only_hash(self):
with patch.dict(os.environ, _env(TELEGRAM_API_HASH="abc"), clear=True):
cfg = build_config()
assert "telegram_extractor" not in cfg["steps"]["extractors"]
def test_telegram_bot_token_optional(self):
env = _env(TELEGRAM_API_ID="12345", TELEGRAM_API_HASH="abc", TELEGRAM_BOT_TOKEN="bot:tok")
with patch.dict(os.environ, env, clear=True):
cfg = build_config()
assert cfg["telegram_extractor"]["bot_token"] == "bot:tok"
def test_telegram_no_bot_token(self):
env = _env(TELEGRAM_API_ID="12345", TELEGRAM_API_HASH="abc")
with patch.dict(os.environ, env, clear=True):
cfg = build_config()
assert "bot_token" not in cfg["telegram_extractor"]
# ── Optional enrichers / databases ────────────────────────────────────
class TestOptionalModules:
def test_screenshots_disabled_by_default(self):
with patch.dict(os.environ, _env(), clear=True):
cfg = build_config()
assert "screenshot_enricher" not in cfg["steps"]["enrichers"]
def test_screenshots_enabled(self):
with patch.dict(os.environ, _env(ENABLE_SCREENSHOTS="true"), clear=True):
cfg = build_config()
assert "screenshot_enricher" in cfg["steps"]["enrichers"]
assert cfg["screenshot_enricher"]["width"] == 1280
def test_thumbnails_enabled(self):
with patch.dict(os.environ, _env(ENABLE_THUMBNAILS="true"), clear=True):
cfg = build_config()
assert "thumbnail_enricher" in cfg["steps"]["enrichers"]
assert cfg["thumbnail_enricher"]["max_thumbnails"] == 16
def test_csv_db_enabled(self):
with patch.dict(os.environ, _env(ENABLE_CSV_DB="true"), clear=True):
cfg = build_config()
assert "csv_db" in cfg["steps"]["databases"]
assert cfg["csv_db"]["csv_file"] == "/app/local_archive/db.csv"
def test_case_insensitive_boolean(self):
with patch.dict(os.environ, _env(ENABLE_SCREENSHOTS="TRUE"), clear=True):
cfg = build_config()
assert "screenshot_enricher" in cfg["steps"]["enrichers"]
# ── Combined / full config ────────────────────────────────────────────
class TestCombinedConfig:
def test_all_optional_modules_together(self):
"""Enable everything at once and verify no conflicts."""
env = _env(
GSHEET_URL="https://example.com/sheet",
S3_BUCKET="bucket",
S3_KEY="key",
S3_SECRET="secret",
TELEGRAM_API_ID="123",
TELEGRAM_API_HASH="abc",
TELEGRAM_BOT_TOKEN="tok",
ENABLE_SCREENSHOTS="true",
ENABLE_THUMBNAILS="true",
ENABLE_CSV_DB="true",
)
with patch.dict(os.environ, env, clear=True):
cfg = build_config()
steps = cfg["steps"]
assert "gsheet_feeder" in steps["feeders"]
assert "telegram_extractor" in steps["extractors"]
assert "screenshot_enricher" in steps["enrichers"]
assert "thumbnail_enricher" in steps["enrichers"]
assert "csv_db" in steps["databases"]
assert "gsheet_db" in steps["databases"]
assert "s3_storage" in steps["storages"]
assert "local_storage" in steps["storages"]
# All module configs present
for key in [
"gsheet_feeder",
"s3_storage",
"telegram_extractor",
"screenshot_enricher",
"thumbnail_enricher",
"csv_db",
]:
assert key in cfg, f"{key} config missing"
def test_full_config_valid_yaml(self):
env = _env(
GSHEET_URL="https://example.com/sheet",
S3_BUCKET="bucket",
TELEGRAM_API_ID="123",
TELEGRAM_API_HASH="abc",
ENABLE_SCREENSHOTS="true",
ENABLE_CSV_DB="true",
)
with patch.dict(os.environ, env, clear=True):
cfg = build_config()
dumped = yaml.dump(cfg)
reloaded = yaml.safe_load(dumped)
assert reloaded == cfg
# ── main() writes file ───────────────────────────────────────────────
class TestMainFunction:
def test_main_writes_config_file(self, tmp_path):
config_path = tmp_path / "orchestration.yaml"
with patch.dict(os.environ, _env(), clear=True), patch("deploy.generate_config.CONFIG_PATH", config_path):
main()
assert config_path.exists()
cfg = yaml.safe_load(config_path.read_text())
assert cfg["steps"]["feeders"] == ["cli_feeder"]
def test_main_creates_parent_dirs(self, tmp_path):
config_path = tmp_path / "nested" / "dir" / "orchestration.yaml"
with patch.dict(os.environ, _env(), clear=True), patch("deploy.generate_config.CONFIG_PATH", config_path):
main()
assert config_path.exists()

View File

@@ -0,0 +1,124 @@
"""Tests for deploy/gsheet_poller.py background Google Sheets polling."""
import os
from unittest.mock import patch, MagicMock
from deploy.gsheet_poller import start_poller, _poll_once
# ── start_poller ──────────────────────────────────────────────────────
class TestStartPoller:
def test_disabled_when_no_gsheet_url(self):
"""No thread should be started when GSHEET_URL is empty."""
with (
patch.dict(os.environ, {"GSHEET_URL": ""}, clear=False),
patch("deploy.gsheet_poller.threading.Thread") as mock_thread,
):
start_poller()
mock_thread.assert_not_called()
def test_disabled_when_gsheet_url_absent(self):
env = {k: v for k, v in os.environ.items() if k != "GSHEET_URL"}
with patch.dict(os.environ, env, clear=True), patch("deploy.gsheet_poller.threading.Thread") as mock_thread:
start_poller()
mock_thread.assert_not_called()
def test_starts_thread_when_gsheet_url_set(self):
with (
patch.dict(os.environ, {"GSHEET_URL": "https://example.com/sheet"}, clear=False),
patch("deploy.gsheet_poller.threading.Thread") as mock_thread,
):
mock_instance = MagicMock()
mock_thread.return_value = mock_instance
start_poller()
mock_thread.assert_called_once()
assert mock_thread.call_args.kwargs["daemon"] is True
assert mock_thread.call_args.kwargs["name"] == "gsheet-poller"
mock_instance.start.assert_called_once()
def test_default_interval_300(self):
env = {"GSHEET_URL": "https://example.com/sheet"}
# Remove POLL_INTERVAL if present
clean_env = {k: v for k, v in os.environ.items() if k != "POLL_INTERVAL"}
clean_env.update(env)
with (
patch.dict(os.environ, clean_env, clear=True),
patch("deploy.gsheet_poller.threading.Thread") as mock_thread,
):
mock_thread.return_value = MagicMock()
start_poller()
# interval should be passed as arg to _poll_loop
args = mock_thread.call_args.kwargs.get("args") or mock_thread.call_args[1].get("args")
assert args == (300,)
def test_custom_interval(self):
with (
patch.dict(os.environ, {"GSHEET_URL": "x", "POLL_INTERVAL": "600"}, clear=False),
patch("deploy.gsheet_poller.threading.Thread") as mock_thread,
):
mock_thread.return_value = MagicMock()
start_poller()
args = mock_thread.call_args.kwargs.get("args") or mock_thread.call_args[1].get("args")
assert args == (600,)
def test_interval_minimum_enforced(self):
"""Intervals below 60 should be clamped to 60."""
with (
patch.dict(os.environ, {"GSHEET_URL": "x", "POLL_INTERVAL": "10"}, clear=False),
patch("deploy.gsheet_poller.threading.Thread") as mock_thread,
):
mock_thread.return_value = MagicMock()
start_poller()
args = mock_thread.call_args.kwargs.get("args") or mock_thread.call_args[1].get("args")
assert args == (60,)
# ── _poll_once ────────────────────────────────────────────────────────
class TestPollOnce:
def test_calls_subprocess_with_config(self):
with patch("deploy.gsheet_poller.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(returncode=0, stderr="")
_poll_once()
mock_run.assert_called_once()
cmd = mock_run.call_args[0][0]
assert "auto_archiver" in " ".join(cmd)
assert "--config" in cmd
def test_handles_nonzero_exit(self):
"""Should not raise on non-zero exit, just log a warning."""
with patch("deploy.gsheet_poller.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(returncode=1, stderr="some error")
_poll_once() # should not raise
def test_handles_timeout(self):
"""Should not raise on timeout, just log."""
import subprocess
with patch("deploy.gsheet_poller.subprocess.run") as mock_run:
mock_run.side_effect = subprocess.TimeoutExpired(cmd="test", timeout=600)
_poll_once() # should not raise
def test_handles_exception(self):
"""Should not raise on arbitrary exceptions."""
with patch("deploy.gsheet_poller.subprocess.run") as mock_run:
mock_run.side_effect = OSError("broken")
_poll_once() # should not raise
def test_uses_correct_config_path(self):
with patch("deploy.gsheet_poller.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(returncode=0, stderr="")
_poll_once()
cmd = mock_run.call_args[0][0]
config_idx = cmd.index("--config")
assert cmd[config_idx + 1] == "/app/secrets/orchestration.yaml"
def test_timeout_set(self):
with patch("deploy.gsheet_poller.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(returncode=0, stderr="")
_poll_once()
assert mock_run.call_args[1]["timeout"] == 600

310
deploy/tests/test_web_ui.py Normal file
View File

@@ -0,0 +1,310 @@
"""Tests for deploy/web_ui.py FastAPI web interface."""
from unittest.mock import patch, AsyncMock
import pytest
from fastapi.testclient import TestClient
# ── Fixtures ──────────────────────────────────────────────────────────
@pytest.fixture(autouse=True)
def _reset_state():
"""Reset in-memory state between tests."""
import deploy.web_ui as mod
mod._valid_sessions.clear()
mod._jobs.clear()
yield
mod._valid_sessions.clear()
mod._jobs.clear()
@pytest.fixture
def client_no_auth():
"""Test client with auth disabled (no AUTH_PASSWORD)."""
with patch.object(__import__("deploy.web_ui", fromlist=["web_ui"]), "AUTH_PASSWORD", ""):
from deploy.web_ui import app
yield TestClient(app, raise_server_exceptions=False)
@pytest.fixture
def client_with_auth():
"""Test client with auth enabled."""
with patch.object(__import__("deploy.web_ui", fromlist=["web_ui"]), "AUTH_PASSWORD", "secret123"):
from deploy.web_ui import app
yield TestClient(app, raise_server_exceptions=False)
def _login(client, password="secret123"):
"""Helper: log in and return the session cookie."""
resp = client.post("/login", data={"password": password}, follow_redirects=False)
return resp.cookies.get("aa_session")
# ── Health check ──────────────────────────────────────────────────────
class TestHealthCheck:
def test_status_returns_ok(self, client_no_auth):
resp = client_no_auth.get("/status")
assert resp.status_code == 200
assert resp.json() == {"status": "ok"}
def test_status_no_auth_required(self, client_with_auth):
resp = client_with_auth.get("/status")
assert resp.status_code == 200
assert resp.json() == {"status": "ok"}
# ── Auth disabled ─────────────────────────────────────────────────────
class TestNoAuth:
def test_index_accessible(self, client_no_auth):
resp = client_no_auth.get("/")
assert resp.status_code == 200
assert "Auto Archiver" in resp.text
def test_login_page_redirects_to_index(self, client_no_auth):
resp = client_no_auth.get("/login", follow_redirects=False)
assert resp.status_code == 302
assert resp.headers["location"] == "/"
def test_login_post_redirects_to_index(self, client_no_auth):
resp = client_no_auth.post("/login", data={"password": "anything"}, follow_redirects=False)
assert resp.status_code == 302
def test_no_logout_link_shown(self, client_no_auth):
resp = client_no_auth.get("/")
assert "Logout" not in resp.text
# ── Auth enabled ──────────────────────────────────────────────────────
class TestAuth:
def test_index_redirects_to_login(self, client_with_auth):
resp = client_with_auth.get("/", follow_redirects=False)
assert resp.status_code == 307
assert resp.headers["location"] == "/login"
def test_login_page_renders(self, client_with_auth):
resp = client_with_auth.get("/login")
assert resp.status_code == 200
assert "Password" in resp.text
def test_wrong_password_returns_401(self, client_with_auth):
resp = client_with_auth.post("/login", data={"password": "wrong"})
assert resp.status_code == 401
assert "Wrong password" in resp.text
def test_correct_password_sets_cookie(self, client_with_auth):
resp = client_with_auth.post("/login", data={"password": "secret123"}, follow_redirects=False)
assert resp.status_code == 302
assert "aa_session" in resp.cookies
def test_authenticated_access(self, client_with_auth):
cookie = _login(client_with_auth)
client_with_auth.cookies.set("aa_session", cookie)
resp = client_with_auth.get("/")
assert resp.status_code == 200
assert "Auto Archiver" in resp.text
def test_logout_clears_session(self, client_with_auth):
cookie = _login(client_with_auth)
client_with_auth.cookies.set("aa_session", cookie)
resp = client_with_auth.get("/logout", follow_redirects=False)
assert resp.status_code == 302
# After logout, index should redirect to login again
client_with_auth.cookies.clear()
resp = client_with_auth.get("/", follow_redirects=False)
assert resp.status_code == 307
def test_logout_link_shown_when_auth_enabled(self, client_with_auth):
cookie = _login(client_with_auth)
client_with_auth.cookies.set("aa_session", cookie)
resp = client_with_auth.get("/")
assert "Logout" in resp.text
def test_results_requires_auth(self, client_with_auth):
resp = client_with_auth.get("/results", follow_redirects=False)
assert resp.status_code == 307
def test_invalid_session_rejected(self, client_with_auth):
client_with_auth.cookies.set("aa_session", "bogus-token")
resp = client_with_auth.get("/", follow_redirects=False)
assert resp.status_code == 307
# ── Archive submission ────────────────────────────────────────────────
class TestArchive:
def test_archive_creates_job(self, client_no_auth):
with patch("deploy.web_ui._run_archive", new_callable=AsyncMock):
resp = client_no_auth.post(
"/archive",
data={"urls": "https://example.com\nhttps://example.org"},
follow_redirects=False,
)
assert resp.status_code == 303
assert resp.headers["location"] == "/"
from deploy.web_ui import _jobs
assert len(_jobs) == 1
assert _jobs[0]["urls"] == ["https://example.com", "https://example.org"]
assert _jobs[0]["status"] == "running"
def test_archive_empty_urls_returns_400(self, client_no_auth):
resp = client_no_auth.post("/archive", data={"urls": " \n \n"})
assert resp.status_code == 400
def test_archive_strips_whitespace(self, client_no_auth):
with patch("deploy.web_ui._run_archive", new_callable=AsyncMock):
client_no_auth.post(
"/archive",
data={"urls": " https://example.com \n\n https://example.org \n"},
follow_redirects=False,
)
from deploy.web_ui import _jobs
assert _jobs[0]["urls"] == ["https://example.com", "https://example.org"]
def test_archive_requires_auth(self, client_with_auth):
resp = client_with_auth.post(
"/archive",
data={"urls": "https://example.com"},
follow_redirects=False,
)
assert resp.status_code == 307
# ── Results page ──────────────────────────────────────────────────────
class TestResults:
def test_results_empty(self, client_no_auth, tmp_path):
with patch("deploy.web_ui.ARCHIVE_DIR", tmp_path):
resp = client_no_auth.get("/results")
assert resp.status_code == 200
assert "No archived files yet" in resp.text
def test_results_lists_files(self, client_no_auth, tmp_path):
(tmp_path / "test.html").write_text("<html>archived</html>")
(tmp_path / "video.mp4").write_bytes(b"\x00" * 10)
with patch("deploy.web_ui.ARCHIVE_DIR", tmp_path):
resp = client_no_auth.get("/results")
assert resp.status_code == 200
assert "test.html" in resp.text
assert "video.mp4" in resp.text
def test_results_nonexistent_dir(self, client_no_auth, tmp_path):
with patch("deploy.web_ui.ARCHIVE_DIR", tmp_path / "nonexistent"):
resp = client_no_auth.get("/results")
assert resp.status_code == 200
assert "No archived files yet" in resp.text
# ── File serving ──────────────────────────────────────────────────────
class TestFileServing:
def test_serve_existing_file(self, client_no_auth, tmp_path):
(tmp_path / "report.html").write_text("<html>done</html>")
with patch("deploy.web_ui.ARCHIVE_DIR", tmp_path):
resp = client_no_auth.get("/files/report.html")
assert resp.status_code == 200
def test_serve_nonexistent_file(self, client_no_auth, tmp_path):
with patch("deploy.web_ui.ARCHIVE_DIR", tmp_path):
resp = client_no_auth.get("/files/nope.txt")
assert resp.status_code == 404
def test_path_traversal_blocked(self, client_no_auth, tmp_path):
# Create a file outside the archive dir
outside = tmp_path / "outside"
outside.mkdir()
(outside / "secret.txt").write_text("secret")
archive = tmp_path / "archive"
archive.mkdir()
# Symlink into archive pointing outside
(archive / "escape").symlink_to(outside / "secret.txt")
with patch("deploy.web_ui.ARCHIVE_DIR", archive):
resp = client_no_auth.get("/files/escape")
assert resp.status_code == 403
# ── Job rendering ─────────────────────────────────────────────────────
class TestJobRendering:
def test_no_jobs_shows_message(self, client_no_auth):
resp = client_no_auth.get("/")
assert "No archiving jobs yet" in resp.text
def test_jobs_shown_in_table(self, client_no_auth):
from deploy.web_ui import _jobs
_jobs.append(
{
"id": 1,
"urls": ["https://example.com"],
"status": "done",
"started": "2026-01-01 00:00 UTC",
"output": "",
}
)
resp = client_no_auth.get("/")
assert "example.com" in resp.text
assert "done" in resp.text
def test_many_urls_truncated(self, client_no_auth):
from deploy.web_ui import _jobs
_jobs.append(
{
"id": 1,
"urls": [f"https://example.com/{i}" for i in range(10)],
"status": "running",
"started": "2026-01-01 00:00 UTC",
"output": "",
}
)
resp = client_no_auth.get("/")
assert "+7 more" in resp.text
# ── HTML template rendering ──────────────────────────────────────────
class TestTemplates:
"""Verify HTML templates can be .format()-ed without KeyError."""
def test_login_html_renders(self):
from deploy.web_ui import LOGIN_HTML
result = LOGIN_HTML.format(error="")
assert "Auto Archiver" in result
def test_login_html_renders_with_error(self):
from deploy.web_ui import LOGIN_HTML
result = LOGIN_HTML.format(error='<p class="err">Nope</p>')
assert "Nope" in result
def test_main_html_renders(self):
from deploy.web_ui import MAIN_HTML
result = MAIN_HTML.format(logout="", jobs_html="")
assert "Auto Archiver" in result
def test_results_html_renders(self):
from deploy.web_ui import RESULTS_HTML
result = RESULTS_HTML.format(file_list="<p>empty</p>")
assert "Archived Files" in result

269
deploy/web_ui.py Normal file
View File

@@ -0,0 +1,269 @@
#!/usr/bin/env python3
"""
Minimal web UI for auto-archiver cloud deployments.
Provides:
- GET / → HTML form to submit URLs for archiving
- POST /archive → Runs auto-archiver on submitted URLs
- GET /results → Lists archived files available for download
- GET /files/{path} → Serves archived files
- GET /status → Health check
"""
import asyncio
import html
import os
import secrets
from datetime import datetime, timezone
from pathlib import Path
from fastapi import Depends, FastAPI, Form, HTTPException, Request, status
from fastapi.responses import FileResponse, HTMLResponse, RedirectResponse
AUTH_PASSWORD = os.environ.get("AUTH_PASSWORD", "")
ARCHIVE_DIR = Path("/app/local_archive")
CONFIG_PATH = Path("/app/secrets/orchestration.yaml")
COOKIE_NAME = "aa_session"
# In-memory session tokens (reset on restart, which is fine for this use case)
_valid_sessions: set[str] = set()
# In-memory job log
_jobs: list[dict] = []
app = FastAPI(title="Auto Archiver", docs_url=None, redoc_url=None)
# ── Auth helpers ──────────────────────────────────────────────────────
def _check_auth(request: Request):
"""Dependency: redirect to /login if auth is enabled and session is missing."""
if not AUTH_PASSWORD:
return # auth disabled
token = request.cookies.get(COOKIE_NAME, "")
if token not in _valid_sessions:
raise HTTPException(
status_code=status.HTTP_307_TEMPORARY_REDIRECT,
headers={"Location": "/login"},
)
# ── Pages ─────────────────────────────────────────────────────────────
LOGIN_HTML = """<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1">
<title>Auto Archiver Login</title>
<style>
body {{ font-family: system-ui, sans-serif; max-width: 420px; margin: 80px auto; padding: 0 1rem; }}
h1 {{ font-size: 1.4rem; }}
input[type=password], button {{ font-size: 1rem; padding: .5rem .8rem; }}
input[type=password] {{ width: 100%; box-sizing: border-box; margin: .5rem 0; }}
button {{ cursor: pointer; background: #2563eb; color: #fff; border: none; border-radius: 4px; }}
.err {{ color: #dc2626; }}
</style></head><body>
<h1>🔐 Auto Archiver</h1>
<form method="POST" action="/login">
<label>Password<br><input type="password" name="password" autofocus required></label><br>
<button type="submit">Log in</button>
{error}
</form></body></html>"""
MAIN_HTML = """<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1">
<title>Auto Archiver</title>
<style>
body {{ font-family: system-ui, sans-serif; max-width: 700px; margin: 2rem auto; padding: 0 1rem; line-height: 1.6; }}
h1 {{ font-size: 1.5rem; }}
textarea {{ width: 100%; box-sizing: border-box; font-size: .95rem; font-family: monospace; }}
button {{ font-size: 1rem; padding: .5rem 1.2rem; cursor: pointer; background: #2563eb; color: #fff; border: none; border-radius: 4px; margin-top: .5rem; }}
table {{ border-collapse: collapse; width: 100%; margin-top: 1rem; }}
th, td {{ border: 1px solid #e5e7eb; padding: .4rem .6rem; text-align: left; font-size: .9rem; }}
th {{ background: #f9fafb; }}
.status {{ padding: 2px 8px; border-radius: 4px; font-size: .85rem; }}
.running {{ background: #fef3c7; color: #92400e; }}
.done {{ background: #d1fae5; color: #065f46; }}
.failed {{ background: #fee2e2; color: #991b1b; }}
a {{ color: #2563eb; }}
.info {{ color: #6b7280; font-size: .9rem; }}
nav {{ display: flex; gap: 1rem; align-items: center; }}
nav a {{ text-decoration: none; }}
</style></head><body>
<nav>
<h1>📦 Auto Archiver</h1>
<a href="/results">Browse files</a>
{logout}
</nav>
<form method="POST" action="/archive">
<label for="urls"><strong>URLs to archive</strong> (one per line)</label><br>
<textarea id="urls" name="urls" rows="5" placeholder="https://example.com/post&#10;https://youtube.com/watch?v=..." required></textarea><br>
<button type="submit">Archive</button>
</form>
{jobs_html}
</body></html>"""
RESULTS_HTML = """<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1">
<title>Auto Archiver Files</title>
<style>
body {{ font-family: system-ui, sans-serif; max-width: 700px; margin: 2rem auto; padding: 0 1rem; }}
h1 {{ font-size: 1.4rem; }}
a {{ color: #2563eb; }}
li {{ margin: .3rem 0; font-family: monospace; font-size: .9rem; }}
</style></head><body>
<h1>📁 Archived Files</h1>
<p><a href="/">← Back</a></p>
{file_list}
</body></html>"""
# ── Routes ────────────────────────────────────────────────────────────
@app.get("/login", response_class=HTMLResponse)
async def login_page():
if not AUTH_PASSWORD:
return RedirectResponse("/", status_code=302)
return LOGIN_HTML.format(error="")
@app.post("/login")
async def login_submit(password: str = Form(...)):
if not AUTH_PASSWORD:
return RedirectResponse("/", status_code=302)
if password != AUTH_PASSWORD:
return HTMLResponse(
LOGIN_HTML.format(error='<p class="err">Wrong password.</p>'),
status_code=401,
)
token = secrets.token_urlsafe(32)
_valid_sessions.add(token)
resp = RedirectResponse("/", status_code=302)
resp.set_cookie(COOKIE_NAME, token, httponly=True, samesite="lax", max_age=86400 * 30)
return resp
@app.get("/", response_class=HTMLResponse)
async def index(request: Request, _=Depends(_check_auth)):
logout = '<a href="/logout">Logout</a>' if AUTH_PASSWORD else ""
jobs_html = _render_jobs()
return MAIN_HTML.format(logout=logout, jobs_html=jobs_html)
@app.post("/archive")
async def archive(request: Request, urls: str = Form(...), _=Depends(_check_auth)):
url_list = [u.strip() for u in urls.strip().splitlines() if u.strip()]
if not url_list:
raise HTTPException(400, "No URLs provided")
job = {
"id": len(_jobs) + 1,
"urls": url_list,
"status": "running",
"started": datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M UTC"),
"output": "",
}
_jobs.insert(0, job)
# Run in background so the user sees the page immediately
asyncio.create_task(_run_archive(job))
return RedirectResponse("/", status_code=303)
@app.get("/results", response_class=HTMLResponse)
async def results(request: Request, _=Depends(_check_auth)):
if not ARCHIVE_DIR.exists():
return RESULTS_HTML.format(file_list="<p>No archived files yet.</p>")
files = sorted(ARCHIVE_DIR.rglob("*"), key=lambda p: p.stat().st_mtime, reverse=True)
files = [f for f in files if f.is_file()]
if not files:
return RESULTS_HTML.format(file_list="<p>No archived files yet.</p>")
items = []
for f in files[:200]: # cap listing
rel = f.relative_to(ARCHIVE_DIR)
items.append(f'<li><a href="/files/{rel}">{html.escape(str(rel))}</a></li>')
return RESULTS_HTML.format(file_list="<ul>" + "\n".join(items) + "</ul>")
@app.get("/files/{path:path}")
async def serve_file(path: str, request: Request, _=Depends(_check_auth)):
full = ARCHIVE_DIR / path
if not full.exists() or not full.is_file():
raise HTTPException(404, "File not found")
# Security: ensure the resolved path is within ARCHIVE_DIR
try:
full.resolve().relative_to(ARCHIVE_DIR.resolve())
except ValueError:
raise HTTPException(403, "Forbidden")
return FileResponse(full)
@app.get("/status")
async def health():
return {"status": "ok"}
@app.get("/logout")
async def logout(request: Request):
token = request.cookies.get(COOKIE_NAME, "")
_valid_sessions.discard(token)
resp = RedirectResponse("/login", status_code=302)
resp.delete_cookie(COOKIE_NAME)
return resp
# ── Helpers ───────────────────────────────────────────────────────────
async def _run_archive(job: dict):
"""Run auto-archiver as a subprocess for the given URLs."""
cmd = [
"python3",
"-m",
"auto_archiver",
"--config",
str(CONFIG_PATH),
] + job["urls"]
try:
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.STDOUT,
cwd="/app",
)
stdout, _ = await proc.communicate()
job["output"] = stdout.decode(errors="replace")[-5000:] # keep last 5k chars
job["status"] = "done" if proc.returncode == 0 else "failed"
except Exception as e:
job["output"] = str(e)
job["status"] = "failed"
def _render_jobs() -> str:
if not _jobs:
return '<p class="info">No archiving jobs yet. Submit URLs above to get started.</p>'
rows = []
for j in _jobs[:50]:
urls_str = html.escape(", ".join(j["urls"][:3]))
if len(j["urls"]) > 3:
urls_str += f" (+{len(j['urls']) - 3} more)"
status_cls = j["status"]
rows.append(
f"<tr><td>{j['id']}</td>"
f"<td>{urls_str}</td>"
f'<td><span class="status {status_cls}">{j["status"]}</span></td>'
f"<td>{j['started']}</td></tr>"
)
return (
"<h2>Recent Jobs</h2>"
"<table><thead><tr><th>#</th><th>URLs</th><th>Status</th><th>Started</th></tr></thead>"
"<tbody>" + "\n".join(rows) + "</tbody></table>"
)

View File

@@ -6,6 +6,9 @@ services:
context: . context: .
dockerfile: Dockerfile dockerfile: Dockerfile
container_name: auto-archiver container_name: auto-archiver
# Override user to match host UID/GID and avoid permission issues on volumes.
# Set USER_ID and GROUP_ID env vars, or defaults to 1000:1000.
user: "${USER_ID:-1000}:${GROUP_ID:-1000}"
volumes: volumes:
- ./secrets:/app/secrets - ./secrets:/app/secrets
- ./local_archive:/app/local_archive - ./local_archive:/app/local_archive

View File

@@ -21,7 +21,7 @@ This allows you to run the auto-archiver without the `poetry run` prefix.
### Optional Development Packages ### Optional Development Packages
Install development packages (used for unit tests etc.) using: Install development packages (used for unit tests etc.) using:
`poetry install -with dev` `poetry install --with dev`
```{toctree} ```{toctree}
@@ -33,4 +33,4 @@ docs
release release
settings_page settings_page
style_guide style_guide
``` ```

View File

@@ -50,7 +50,7 @@ Note not all warnings can be fixed automatically.
Most fixes are safe, but some non-standard practices such as dynamic loading are not picked up by linters. Ensure you check any modifications by this before committing them. Most fixes are safe, but some non-standard practices such as dynamic loading are not picked up by linters. Ensure you check any modifications by this before committing them.
```shell ```shell
make ruff-fix make ruff-clean
``` ```
**Changing Configurations ⚙️** **Changing Configurations ⚙️**
@@ -67,4 +67,4 @@ One example is to extend the selected rules for linting the `pyproject.toml` fil
extend-select = ["B"] extend-select = ["B"]
``` ```
Then re-run the `make ruff-check` command to see the new rules in action. Then re-run the `make ruff-check` command to see the new rules in action.

View File

@@ -8,7 +8,7 @@
## Running Tests ## Running Tests
1. Make sure you've installed the dev dependencies with `pytest install --with dev` 1. Make sure you've installed the dev dependencies with `poetry install --with dev`
2. Tests can be run as follows: 2. Tests can be run as follows:
```{code} bash ```{code} bash
#### Command prefix of 'poetry run' removed here for simplicity #### Command prefix of 'poetry run' removed here for simplicity
@@ -26,7 +26,7 @@ pytest -ra -v tests/test_file.py
pytest -ra -v tests/test_file.py::test_function_name pytest -ra -v tests/test_file.py::test_function_name
``` ```
3. Some tests require environment variables to be set. You can use the example `.env.test.example` file as a template. Copy it to `.env.test` and fill in the required values. This file will be loaded automatically by `pytest`. 3. Some tests require environment variables to be set. You can use the example `tests/.env.test.example` file as a template. Copy it to `tests/.env.test` and fill in the required values. This file will be loaded automatically by `pytest`.
```{code} bash ```{code} bash
cp .env.test.example .env.test cp tests/.env.test.example tests/.env.test
``` ```

View File

@@ -24,7 +24,7 @@ This will disable all logs from Auto Archiver, but it does not disable logs for
#### Logging Level #### Logging Level
There are 7 logging levels in total, with 5 of them used in this tool. They are: `DEBUG`, `INFO`, `SUCCESS`, `WARNING` and `ERROR`. There are 7 logging levels in total, with 5 of them used in this tool. They are: `DEBUG`, `INFO`, `SUCCESS`, `WARNING` and `ERROR`. If you select a level, only that and higher (more serious) levels will be included. `DEBUG` is the most verbose, while `ERROR` is the least verbose.
Change the warning level by setting the value in your orchestration config file: Change the warning level by setting the value in your orchestration config file:
@@ -42,6 +42,20 @@ For normal usage, it is recommended to use the `INFO` level, or if you prefer qu
```{note} To learn about all logging levels, see the [loguru documentation](https://loguru.readthedocs.io/en/stable/api/logger.html) ```{note} To learn about all logging levels, see the [loguru documentation](https://loguru.readthedocs.io/en/stable/api/logger.html)
``` ```
### Logging Format
By default, the console logs are formatted in a human-readable way and the file logs are formatted in JSON. This is new from version 1.1.1. If you want to change the format of the console logs to JSON too you can set the `format:` option in your logging settings.
```{code} yaml
:caption: orchestration.yaml
logging:
format: json
```
When the Auto Archiver is writing logs it will include context about specific tasks, so if you are archiving a URL from a Google Sheet, both the URL (and a unique `trace_id` for that URL's archiving attempt) and the Spreadsheet name and row will be included in the logs. This is useful for debugging and understanding what the Auto Archiver is doing.
Using JSON allows you to easily parse the logs and extract specific information, tools like [`jq`](https://jqlang.org/) can be used to filter and search through the logs.
### Logging to a file ### Logging to a file
As default, auto-archiver will log to the console. But if you wish to store your logs for future reference, or you are running the auto-archiver from within code a implementation, then you may wish to enable file logging. This can be done by setting the `file:` config value in the logging settings. As default, auto-archiver will log to the console. But if you wish to store your logs for future reference, or you are running the auto-archiver from within code a implementation, then you may wish to enable file logging. This can be done by setting the `file:` config value in the logging settings.
@@ -84,6 +98,7 @@ The below example logs only `DEBUG` logs to the console and to the file `/my/fil
logging: logging:
level: DEBUG level: DEBUG
format: json
file: /my/file.log file: /my/file.log
rotation: 1 week rotation: 1 week
``` ```

View File

@@ -4,8 +4,9 @@ Extractor modules are used to extract the content of a given URL. Typically, one
Extractors that are able to extract content from a wide range of websites include: Extractors that are able to extract content from a wide range of websites include:
1. Generic Extractor: parses videos and images on sites using the powerful yt-dlp library. 1. Generic Extractor: parses videos and images on sites using the powerful yt-dlp library.
2. Wayback Machine Extractor: sends pages to the Wayback machine for archiving, and stores the link. 2. Antibot Extractor: uses a headless browser to bypass bot detection and extract content.
3. WACZ Extractor: runs a web browser to 'browse' the URL and save a copy of the page in WACZ format. 3. WACZ Extractor: runs a web browser to 'browse' the URL and save a copy of the page in WACZ format.
4. Wayback Machine Extractor: sends pages to the Wayback machine for archiving, and stores the archived link.
```{include} autogen/extractor.md ```{include} autogen/extractor.md
``` ```

2979
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[project] [project]
name = "auto-archiver" name = "auto-archiver"
version = "1.1.0" version = "1.2.5"
description = "Automatically archive links to videos, images, and social media content from Google Sheets (and more)." description = "Automatically archive links to videos, images, and social media content from Google Sheets (and more)."
requires-python = ">=3.10,<3.13" requires-python = ">=3.10,<3.13"
@@ -50,14 +50,15 @@ dependencies = [
"retrying (>=0.0.0)", "retrying (>=0.0.0)",
"rich-argparse (>=1.6.0,<2.0.0)", "rich-argparse (>=1.6.0,<2.0.0)",
"ruamel-yaml (>=0.18.10,<0.19.0)", "ruamel-yaml (>=0.18.10,<0.19.0)",
"rfc3161-client (>=1.0.1,<2.0.0)", "rfc3161-client (>=1.0.5)",
"cryptography (>44.0.1,<45.0.0)", "cryptography (>=46.0.3)",
"opentimestamps (>=0.4.5,<0.5.0)", "opentimestamps (>=0.4.5,<0.5.0)",
"bgutil-ytdlp-pot-provider (>=1.0.0)", "bgutil-ytdlp-pot-provider (>=1.0.0)",
"yt-dlp[curl-cffi,default] (>=2025.5.22,<2026.0.0)", "yt-dlp[curl-cffi,default] (>=2025.5.22)",
"secretstorage (>=3.3.3,<4.0.0)", "secretstorage (>=3.3.3,<4.0.0)",
"seleniumbase (>=4.36.4,<5.0.0)", "seleniumbase (>=4.36.4,<5.0.0)",
"pyautogui (>=0.9.54,<0.10.0)", "pyautogui (>=0.9.54,<0.10.0)",
"pyperclip (>=1.9.0)",
] ]
[tool.poetry.group.dev.dependencies] [tool.poetry.group.dev.dependencies]
@@ -65,7 +66,7 @@ pytest = "^8.3.4"
autopep8 = "^2.3.1" autopep8 = "^2.3.1"
pytest-loguru = "^0.4.0" pytest-loguru = "^0.4.0"
pytest-mock = "^3.14.0" pytest-mock = "^3.14.0"
ruff = "^0.9.10" ruff = "^0.15.2"
pre-commit = "^4.1.0" pre-commit = "^4.1.0"
[tool.poetry.group.docs.dependencies] [tool.poetry.group.docs.dependencies]

99
railway.json Normal file
View File

@@ -0,0 +1,99 @@
{
"$schema": "https://railway.app/railway.schema.json",
"build": {
"dockerfilePath": "deploy/Dockerfile"
},
"deploy": {
"startCommand": "python3 -m deploy.start",
"healthcheckPath": "/status",
"healthcheckTimeout": 30,
"restartPolicyType": "ON_FAILURE",
"restartPolicyMaxRetries": 5
},
"variables": {
"AUTH_PASSWORD": {
"description": "Password to access your archiver web interface",
"required": true
},
"GSHEET_URL": {
"description": "Google Sheet URL to monitor for new URLs (leave empty to disable)",
"required": false,
"default": ""
},
"GOOGLE_SERVICE_ACCOUNT_JSON": {
"description": "Full JSON contents of your Google service account key (required for Sheets)",
"required": false,
"default": ""
},
"POLL_INTERVAL": {
"description": "Seconds between Google Sheet checks (min 60)",
"required": false,
"default": "300"
},
"S3_BUCKET": {
"description": "S3 bucket name for storage (leave empty for local-only)",
"required": false,
"default": ""
},
"S3_KEY": {
"description": "S3 access key ID",
"required": false,
"default": ""
},
"S3_SECRET": {
"description": "S3 secret access key",
"required": false,
"default": ""
},
"S3_REGION": {
"description": "S3 region (e.g. us-east-1, nyc3 for DO Spaces)",
"required": false,
"default": "us-east-1"
},
"S3_ENDPOINT": {
"description": "S3 endpoint URL template",
"required": false,
"default": "https://s3.{region}.amazonaws.com"
},
"S3_CDN_URL": {
"description": "Public CDN URL template for archived files",
"required": false,
"default": "https://{bucket}.s3.{region}.amazonaws.com/{key}"
},
"TELEGRAM_API_ID": {
"description": "Telegram API ID from https://my.telegram.org",
"required": false,
"default": ""
},
"TELEGRAM_API_HASH": {
"description": "Telegram API hash from https://my.telegram.org",
"required": false,
"default": ""
},
"TELEGRAM_BOT_TOKEN": {
"description": "Telegram bot token from @BotFather",
"required": false,
"default": ""
},
"ENABLE_SCREENSHOTS": {
"description": "Set to true to capture full-page screenshots",
"required": false,
"default": "false"
},
"ENABLE_THUMBNAILS": {
"description": "Set to true to generate video thumbnails",
"required": false,
"default": "false"
},
"ENABLE_CSV_DB": {
"description": "Set to true to save a CSV log of archived items",
"required": false,
"default": "false"
},
"LOG_LEVEL": {
"description": "Logging level: DEBUG, INFO, WARNING, ERROR",
"required": false,
"default": "INFO"
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -14,7 +14,7 @@
"@emotion/react": "latest", "@emotion/react": "latest",
"@emotion/styled": "latest", "@emotion/styled": "latest",
"@mui/icons-material": "^7.1.1", "@mui/icons-material": "^7.1.1",
"@mui/material": "latest", "@mui/material": "^7.1.1",
"react": "19.1.0", "react": "19.1.0",
"react-dom": "19.1.0", "react-dom": "19.1.0",
"react-markdown": "^10.0.0", "react-markdown": "^10.0.0",

View File

@@ -31,7 +31,7 @@ import {
Stack, Stack,
Button, Button,
} from '@mui/material'; } from '@mui/material';
import Grid from '@mui/material/Grid2'; import Grid from '@mui/material/Grid';
import { parseDocument, Document, YAMLSeq, YAMLMap, Scalar } from 'yaml' import { parseDocument, Document, YAMLSeq, YAMLMap, Scalar } from 'yaml'
import StepCard from './StepCard'; import StepCard from './StepCard';

View File

@@ -25,7 +25,7 @@ import {
Typography, Typography,
InputAdornment, InputAdornment,
} from '@mui/material'; } from '@mui/material';
import Grid from '@mui/material/Grid2'; import Grid from '@mui/material/Grid';
import DragIndicatorIcon from '@mui/icons-material/DragIndicator'; import DragIndicatorIcon from '@mui/icons-material/DragIndicator';
import Visibility from '@mui/icons-material/Visibility'; import Visibility from '@mui/icons-material/Visibility';
import VisibilityOff from '@mui/icons-material/VisibilityOff'; import VisibilityOff from '@mui/icons-material/VisibilityOff';

View File

@@ -14,7 +14,7 @@ You will need to provide your phone number and a 2FA code the first time you run
import os import os
from telethon.sync import TelegramClient from telethon.sync import TelegramClient
from loguru import logger from auto_archiver.utils.custom_logger import logger
# Create a # Create a
@@ -24,4 +24,4 @@ SESSION_FILE = "secrets/anon-insta"
os.makedirs("secrets", exist_ok=True) os.makedirs("secrets", exist_ok=True)
with TelegramClient(SESSION_FILE, API_ID, API_HASH) as client: with TelegramClient(SESSION_FILE, API_ID, API_HASH) as client:
logger.success(f"New session file created: {SESSION_FILE}.session") logger.success(f"new session file created: {SESSION_FILE}.session")

View File

@@ -7,7 +7,7 @@ from tempfile import TemporaryDirectory
from auto_archiver.utils import url as UrlUtil from auto_archiver.utils import url as UrlUtil
from auto_archiver.core.consts import MODULE_TYPES as CONF_MODULE_TYPES from auto_archiver.core.consts import MODULE_TYPES as CONF_MODULE_TYPES
from loguru import logger from auto_archiver.utils.custom_logger import logger
if TYPE_CHECKING: if TYPE_CHECKING:
from .module import ModuleFactory from .module import ModuleFactory

View File

@@ -10,7 +10,7 @@ from ruamel.yaml import YAML, CommentedMap
import json import json
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from copy import deepcopy from copy import deepcopy
from auto_archiver.core.consts import MODULE_TYPES from auto_archiver.core.consts import MODULE_TYPES
@@ -118,8 +118,7 @@ class DefaultValidatingParser(argparse.ArgumentParser):
""" """
Override of error to format a nicer looking error message using logger Override of error to format a nicer looking error message using logger
""" """
logger.error("Problem with configuration file (tip: use --help to see the available options):") logger.error(f"Problem with configuration file (tip: use --help to see the available options): \n{message}")
logger.error(message)
self.exit(2) self.exit(2)
def parse_known_args(self, args=None, namespace=None): def parse_known_args(self, args=None, namespace=None):
@@ -136,8 +135,7 @@ class DefaultValidatingParser(argparse.ArgumentParser):
try: try:
self._check_value(action, action.default) self._check_value(action, action.default)
except argparse.ArgumentError as e: except argparse.ArgumentError as e:
logger.error(f"You have an invalid setting in your configuration file ({action.dest}):") logger.error(f"You have an invalid setting in your configuration file ({action.dest}):\n {e}")
logger.error(e)
exit() exit()
return super().parse_known_args(args, namespace) return super().parse_known_args(args, namespace)

View File

@@ -12,7 +12,7 @@ from contextlib import suppress
import mimetypes import mimetypes
import os import os
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from retrying import retry from retrying import retry
import re import re
@@ -94,7 +94,7 @@ class Extractor(BaseModule):
to_filename = to_filename[-64:] to_filename = to_filename[-64:]
to_filename = os.path.join(self.tmp_dir, to_filename) to_filename = os.path.join(self.tmp_dir, to_filename)
if verbose: if verbose:
logger.debug(f"downloading {url[0:50]=} {to_filename=}") logger.debug(f"Downloading {to_filename=}")
headers = { headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
} }
@@ -117,7 +117,7 @@ class Extractor(BaseModule):
return to_filename return to_filename
except requests.RequestException as e: except requests.RequestException as e:
logger.warning(f"Failed to fetch the Media URL: {str(e)[:250]}") logger.warning(f"Failed to fetch the Media URL: {e}")
if try_best_quality: if try_best_quality:
return None, url return None, url

View File

@@ -11,7 +11,7 @@ from dataclasses import dataclass, field
from dataclasses_json import dataclass_json, config from dataclasses_json import dataclass_json, config
import mimetypes import mimetypes
from loguru import logger from auto_archiver.utils.custom_logger import logger
@dataclass_json # annotation order matters @dataclass_json # annotation order matters
@@ -86,7 +86,7 @@ class Media:
@property # getter .mimetype @property # getter .mimetype
def mimetype(self) -> str: def mimetype(self) -> str:
if not self.filename or len(self.filename) == 0: if not self.filename or len(self.filename) == 0:
logger.warning(f"cannot get mimetype from media without filename: {self}") logger.warning(f"Cannot get mimetype from media without filename: {self}")
return "" return ""
if not self._mimetype: if not self._mimetype:
self._mimetype = mimetypes.guess_type(self.filename)[0] self._mimetype = mimetypes.guess_type(self.filename)[0]
@@ -116,13 +116,12 @@ class Media:
# self.is_video() should be used together with this method # self.is_video() should be used together with this method
try: try:
streams = ffmpeg.probe(self.filename, select_streams="v")["streams"] streams = ffmpeg.probe(self.filename, select_streams="v")["streams"]
logger.debug(f"STREAMS FOR {self.filename} {streams}") logger.debug(f"Streams for {self.filename}: {streams}")
return any(s.get("duration_ts", 0) > 0 for s in streams) return any(s.get("duration_ts", 0) > 0 for s in streams)
except Error: except Error:
return False # ffmpeg errors when reading bad files return False # ffmpeg errors when reading bad files
except Exception as e: except Exception as e:
logger.error(e) logger.error(f"{e}: {traceback.format_exc()}")
logger.error(traceback.format_exc())
try: try:
fsize = os.path.getsize(self.filename) fsize = os.path.getsize(self.filename)
return fsize > 20_000 return fsize > 20_000

View File

@@ -17,7 +17,7 @@ from dataclasses_json import dataclass_json
import datetime import datetime
from urllib.parse import urlparse from urllib.parse import urlparse
from dateutil.parser import parse as parse_dt from dateutil.parser import parse as parse_dt
from loguru import logger from auto_archiver.utils.custom_logger import logger
from .media import Media from .media import Media
@@ -181,6 +181,9 @@ class Metadata:
media_hashes = set() media_hashes = set()
new_media = [] new_media = []
for m in self.media: for m in self.media:
if not m.filename:
new_media.append(m)
continue
h = m.get("hash") h = m.get("hash")
if not h: if not h:
h = calculate_hash_in_chunks(hashlib.sha256(), int(1.6e7), m.filename) h = calculate_hash_in_chunks(hashlib.sha256(), int(1.6e7), m.filename)

View File

@@ -16,7 +16,7 @@ import sys
from importlib.util import find_spec from importlib.util import find_spec
import os import os
from os.path import join from os.path import join
from loguru import logger from auto_archiver.utils.custom_logger import logger
import auto_archiver import auto_archiver
from auto_archiver.core.consts import DEFAULT_MANIFEST, MANIFEST_FILE, SetupError from auto_archiver.core.consts import DEFAULT_MANIFEST, MANIFEST_FILE, SetupError

View File

@@ -15,9 +15,11 @@ import traceback
from copy import copy from copy import copy
from rich_argparse import RichHelpFormatter from rich_argparse import RichHelpFormatter
from loguru import logger from auto_archiver.utils.custom_logger import format_for_human_readable_console, logger
import requests import requests
from auto_archiver.utils.misc import random_str
from .metadata import Metadata, Media from .metadata import Metadata, Media
from auto_archiver.version import __version__ from auto_archiver.version import __version__
from .config import ( from .config import (
@@ -342,7 +344,14 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
# add other logging info # add other logging info
if self.logger_id is None: # note - need direct comparison to None since need to consider falsy value 0 if self.logger_id is None: # note - need direct comparison to None since need to consider falsy value 0
use_level = logging_config["level"] use_level = logging_config["level"]
self.logger_id = logger.add(sys.stderr, level=use_level) self.logger_id = logger.add(
sys.stderr,
level=use_level,
catch=True,
format="<level>{extra[serialized]}</level>"
if logging_config.get("format", "").lower() == "json"
else format_for_human_readable_console(),
)
rotation = logging_config["rotation"] rotation = logging_config["rotation"]
log_file = logging_config["file"] log_file = logging_config["file"]
@@ -356,9 +365,10 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
f"{log_file}.{i}_{level.lower()}", f"{log_file}.{i}_{level.lower()}",
filter=lambda rec, lvl=level: rec["level"].name == lvl, filter=lambda rec, lvl=level: rec["level"].name == lvl,
rotation=rotation, rotation=rotation,
format="{extra[serialized]}",
) )
elif log_file: elif log_file:
logger.add(log_file, rotation=rotation, level=use_level) logger.add(log_file, rotation=rotation, level=use_level, format="{extra[serialized]}")
def install_modules(self, modules_by_type): def install_modules(self, modules_by_type):
""" """
@@ -466,13 +476,9 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
update_cmd = "`docker pull bellingcat/auto-archiver:latest`" update_cmd = "`docker pull bellingcat/auto-archiver:latest`"
else: else:
update_cmd = "`pip install --upgrade auto-archiver`" update_cmd = "`pip install --upgrade auto-archiver`"
logger.warning("")
logger.warning("********* IMPORTANT: UPDATE AVAILABLE ********")
logger.warning( logger.warning(
f"A new version of auto-archiver is available (v{latest_version}, you have v{current_version})" f"\n********* IMPORTANT: UPDATE AVAILABLE ********\nA new version of auto-archiver is available (v{latest_version}, you have v{current_version})\nMake sure to update to the latest version using: {update_cmd}\n"
) )
logger.warning(f"Make sure to update to the latest version using: {update_cmd}")
logger.warning("")
def setup(self, args: list): def setup(self, args: list):
""" """
@@ -522,7 +528,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
self.setup(args) self.setup(args)
return self.feed() return self.feed()
except Exception as e: except Exception as e:
logger.error(e) logger.error(f"{e}: {traceback.format_exc()}")
exit(1) exit(1)
def cleanup(self) -> None: def cleanup(self) -> None:
@@ -534,8 +540,10 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
url_count = 0 url_count = 0
for feeder in self.feeders: for feeder in self.feeders:
for item in feeder: for item in feeder:
yield self.feed_item(item) with logger.contextualize(url=item.get_url(), trace=random_str(12)):
url_count += 1 logger.info("Started processing")
yield self.feed_item(item)
url_count += 1
logger.info(f"Processed {url_count} URL(s)") logger.info(f"Processed {url_count} URL(s)")
self.cleanup() self.cleanup()
@@ -555,13 +563,13 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
return self.archive(item) return self.archive(item)
except KeyboardInterrupt: except KeyboardInterrupt:
# catches keyboard interruptions to do a clean exit # catches keyboard interruptions to do a clean exit
logger.warning(f"caught interrupt on {item=}") logger.warning("Caught interrupt")
for d in self.databases: for d in self.databases:
d.aborted(item) d.aborted(item)
self.cleanup() self.cleanup()
exit() exit()
except Exception as e: except Exception as e:
logger.error(f"Got unexpected error on item {item}: {e}\n{traceback.format_exc()}") logger.error(f"Got unexpected error: {e}\n{traceback.format_exc()}")
for d in self.databases: for d in self.databases:
if isinstance(e, AssertionError): if isinstance(e, AssertionError):
d.failed(item, str(e)) d.failed(item, str(e))
@@ -589,7 +597,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try: try:
check_url_or_raise(original_url) check_url_or_raise(original_url)
except ValueError as e: except ValueError as e:
logger.error(f"Error archiving URL {original_url}: {e}") logger.error(f"Error archiving: {e}")
raise e raise e
# 1 - sanitize - each archiver is responsible for cleaning/expanding its own URLs # 1 - sanitize - each archiver is responsible for cleaning/expanding its own URLs
@@ -599,7 +607,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
result.set_url(url) result.set_url(url)
if original_url != url: if original_url != url:
logger.debug(f"Sanitized URL from {original_url} to {url}") logger.debug(f"Sanitized URL to {url}")
result.set("original_url", original_url) result.set("original_url", original_url)
# 2 - notify start to DBs, propagate already archived if feature enabled in DBs # 2 - notify start to DBs, propagate already archived if feature enabled in DBs
@@ -614,25 +622,25 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try: try:
d.done(cached_result, cached=True) d.done(cached_result, cached=True)
except Exception as e: except Exception as e:
logger.error(f"ERROR database {d.name}: {e}: {traceback.format_exc()}") logger.error(f"Database {d.name}: {e}: {traceback.format_exc()}")
return cached_result return cached_result
# 3 - call extractors until one succeeds # 3 - call extractors until one succeeds
for a in self.extractors: for a in self.extractors:
logger.info(f"Trying extractor {a.name} for {url}") logger.info(f"Trying extractor {a.name}")
try: try:
result.merge(a.download(result)) result.merge(a.download(result))
if result.is_success(): if result.is_success():
break break
except Exception as e: except Exception as e:
logger.error(f"ERROR archiver {a.name}: {e}: {traceback.format_exc()}") logger.error(f"Extractor {a.name}: {e}: {traceback.format_exc()}")
# 4 - call enrichers to work with archived content # 4 - call enrichers to work with archived content
for e in self.enrichers: for e in self.enrichers:
try: try:
e.enrich(result) e.enrich(result)
except Exception as exc: except Exception as exc:
logger.error(f"ERROR enricher {e.name}: {exc}: {traceback.format_exc()}") logger.error(f"Enricher {e.name}: {exc}: {traceback.format_exc()}")
# 5 - store all downloaded/generated media # 5 - store all downloaded/generated media
result.store(storages=self.storages) result.store(storages=self.storages)
@@ -651,7 +659,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try: try:
d.done(result) d.done(result)
except Exception as e: except Exception as e:
logger.error(f"ERROR database {d.name}: {e}: {traceback.format_exc()}") logger.error(f"Database {d.name}: {e}: {traceback.format_exc()}")
return result return result

View File

@@ -24,7 +24,7 @@ from abc import abstractmethod
from typing import IO from typing import IO
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from slugify import slugify from slugify import slugify
from auto_archiver.utils.misc import random_str from auto_archiver.utils.misc import random_str

View File

@@ -7,7 +7,7 @@ from urllib.parse import urljoin
import glob import glob
import importlib.util import importlib.util
from loguru import logger from auto_archiver.utils.custom_logger import logger
import selenium import selenium
from seleniumbase import SB from seleniumbase import SB
@@ -16,6 +16,7 @@ from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
from auto_archiver.modules.antibot_extractor_enricher.dropins.default import DefaultDropin from auto_archiver.modules.antibot_extractor_enricher.dropins.default import DefaultDropin
from auto_archiver.utils.misc import random_str from auto_archiver.utils.misc import random_str
from auto_archiver.utils.url import is_relevant_url from auto_archiver.utils.url import is_relevant_url
from auto_archiver.utils.deletion_detection import detect_deletion, flag_as_deleted
class AntibotExtractorEnricher(Extractor, Enricher): class AntibotExtractorEnricher(Extractor, Enricher):
@@ -57,7 +58,7 @@ class AntibotExtractorEnricher(Extractor, Enricher):
continue # Skip imported modules/classes/functions continue # Skip imported modules/classes/functions
if isinstance(obj, type) and issubclass(obj, Dropin): if isinstance(obj, type) and issubclass(obj, Dropin):
dropins.append(obj) dropins.append(obj)
logger.debug(f"ANTIBOT loaded drop-in classes: {', '.join([d.__name__ for d in dropins])}") logger.debug(f"Loaded drop-in classes: {', '.join([d.__name__ for d in dropins])}")
return dropins return dropins
def sanitize_url(self, url: str) -> str: def sanitize_url(self, url: str) -> str:
@@ -72,6 +73,7 @@ class AntibotExtractorEnricher(Extractor, Enricher):
if self.enrich(result): if self.enrich(result):
result.status = "antibot" result.status = "antibot"
return result return result
return False
def _prepare_user_data_dir(self): def _prepare_user_data_dir(self):
if self.user_data_dir: if self.user_data_dir:
@@ -81,30 +83,59 @@ class AntibotExtractorEnricher(Extractor, Enricher):
os.makedirs(self.user_data_dir, exist_ok=True) os.makedirs(self.user_data_dir, exist_ok=True)
def enrich(self, to_enrich: Metadata, custom_data_dir: bool = True) -> bool: def enrich(self, to_enrich: Metadata, custom_data_dir: bool = True) -> bool:
if to_enrich.get_media_by_id("html_source_code"):
logger.info("Antibot has already been executed, skipping.")
return True
using_user_data_dir = self.user_data_dir if custom_data_dir else None using_user_data_dir = self.user_data_dir if custom_data_dir else None
url = to_enrich.get_url() url = to_enrich.get_url()
url_sample = url[:75]
# Use xvfb in Docker environments where no display is available
use_xvfb = bool(os.environ.get("RUNNING_IN_DOCKER"))
try: try:
with SB(uc=True, agent=self.agent, headed=None, user_data_dir=using_user_data_dir, proxy=self.proxy) as sb: with SB(
logger.info(f"ANTIBOT selenium browser is up with agent {self.agent}, opening {url_sample}...") uc=True,
agent=self.agent,
headed=None,
user_data_dir=using_user_data_dir,
proxy=self.proxy,
xvfb=use_xvfb,
) as sb:
logger.info(f"Selenium browser is up with agent {self.agent}, opening url...")
sb.uc_open_with_reconnect(url, 4) sb.uc_open_with_reconnect(url, 4)
logger.debug(f"ANTIBOT handling CAPTCHAs for {url_sample}...") logger.debug("Handling CAPTCHAs for...")
sb.uc_gui_handle_cf() sb.uc_gui_handle_cf()
sb.uc_gui_click_rc() # NB: using handle instead of click breaks some sites like reddit, for now we separate here but can have dropins deciding this in the future sb.uc_gui_click_rc() # NB: using handle instead of click breaks some sites like reddit, for now we separate here but can have dropins deciding this in the future
dropin = self._get_suitable_dropin(url, sb) dropin = self._get_suitable_dropin(url, sb)
dropin.open_page(url) if not dropin.open_page(url):
# Check for deletion indicators
page_title = sb.get_title()
html_source = sb.get_page_source()
deletion_info = detect_deletion(html_content=html_source, page_title=page_title, url=url)
if deletion_info:
flag_as_deleted(to_enrich, deletion_info)
return to_enrich
logger.warning("Failed to open drop-in page (not detected as deleted)")
return False
if self.detect_auth_wall and self._hit_auth_wall(sb): if self.detect_auth_wall and (dropin.hit_auth_wall() and self._hit_auth_wall(sb)):
logger.warning(f"ANTIBOT SKIP since auth wall or CAPTCHA was detected for {url_sample}") logger.warning("Skipping since auth wall or CAPTCHA was detected")
return False return False
sb.wait_for_ready_state_complete() sb.wait_for_ready_state_complete()
sb.sleep(1) # margin for the page to load completely sb.sleep(1) # margin for the page to load completely
to_enrich.set_title(sb.get_title()) page_title = sb.get_title()
html_source = sb.get_page_source()
# Check if the page indicates content was deleted
deletion_info = detect_deletion(html_content=html_source, page_title=page_title, url=url)
if deletion_info:
flag_as_deleted(to_enrich, deletion_info)
to_enrich.set_title(page_title)
self._enrich_html_source_code(sb, to_enrich) self._enrich_html_source_code(sb, to_enrich)
self._enrich_full_page_screenshot(sb, to_enrich) self._enrich_full_page_screenshot(sb, to_enrich)
@@ -125,18 +156,18 @@ class AntibotExtractorEnricher(Extractor, Enricher):
js_css_selector=dropin.js_for_video_css_selectors(), js_css_selector=dropin.js_for_video_css_selectors(),
max_media=self.max_download_videos - downloaded_videos, max_media=self.max_download_videos - downloaded_videos,
) )
logger.info(f"ANTIBOT completed for {url_sample}") logger.info("Completed")
return to_enrich return to_enrich
except selenium.common.exceptions.SessionNotCreatedException as e: except selenium.common.exceptions.SessionNotCreatedException as e:
if custom_data_dir: # the retry logic only works once if custom_data_dir: # the retry logic only works once
logger.error( logger.error(
f"ANTIBOT session not created error: {e}. Please remove the user_data_dir {self.user_data_dir} and try again, will retry without user data dir though." f"Session not created error: {e}. Please remove the user_data_dir {self.user_data_dir} and try again, will retry without user data dir though."
) )
return self.enrich(to_enrich, custom_data_dir=False) return self.enrich(to_enrich, custom_data_dir=False)
raise e # re-raise raise e # re-raise
except Exception as e: except Exception as e:
logger.error(f"ANTIBOT runtime error: {e}: {traceback.format_exc()}") logger.error(f"Runtime error: {e}: {traceback.format_exc()}")
return False return False
def _get_suitable_dropin(self, url: str, sb: SB): def _get_suitable_dropin(self, url: str, sb: SB):
@@ -146,7 +177,7 @@ class AntibotExtractorEnricher(Extractor, Enricher):
""" """
for dropin in self.dropins: for dropin in self.dropins:
if dropin.suitable(url): if dropin.suitable(url):
logger.debug(f"ANTIBOT using drop-in {dropin.__name__} for {url}") logger.debug(f"Using drop-in {dropin.__name__}")
return dropin(sb, self) return dropin(sb, self)
return DefaultDropin(sb, self) return DefaultDropin(sb, self)
@@ -275,8 +306,14 @@ class AntibotExtractorEnricher(Extractor, Enricher):
return return
url = to_enrich.get_url() url = to_enrich.get_url()
all_urls = set() all_urls = set()
logger.debug(f"Extracting media for {js_css_selector=}")
try:
sources = sb.execute_script(js_css_selector)
except selenium.common.exceptions.JavascriptException as e:
logger.error(f"Error executing JavaScript selector {js_css_selector}: {e}")
return
sources = sb.execute_script(js_css_selector)
# js_for_css_selectors # js_for_css_selectors
for src in sources: for src in sources:
if len(all_urls) >= max_media: if len(all_urls) >= max_media:

View File

@@ -0,0 +1 @@
*.py

View File

@@ -1,6 +1,8 @@
import json
import os import os
import traceback
from typing import Mapping from typing import Mapping
from loguru import logger from auto_archiver.utils.custom_logger import logger
from seleniumbase import SB from seleniumbase import SB
import yt_dlp import yt_dlp
@@ -73,8 +75,11 @@ class Dropin:
You can overwrite this instead of `images_selector` for more control over scraped images. You can overwrite this instead of `images_selector` for more control over scraped images.
""" """
if not self.images_selectors():
return "return [];"
safe_selector = json.dumps(self.images_selectors())
return f""" return f"""
return Array.from(document.querySelectorAll("{self.images_selectors()}")).map(el => el.src || el.href).filter(Boolean); return Array.from(document.querySelectorAll({safe_selector})).map(el => el.src || el.href).filter(Boolean);
""" """
def js_for_video_css_selectors(self) -> str: def js_for_video_css_selectors(self) -> str:
@@ -83,8 +88,11 @@ class Dropin:
You can overwrite this instead of `video_selector` for more control over scraped videos. You can overwrite this instead of `video_selector` for more control over scraped videos.
""" """
if not self.video_selectors():
return "return [];"
safe_selector = json.dumps(self.video_selectors())
return f""" return f"""
return Array.from(document.querySelectorAll("{self.video_selectors()}")).map(el => el.src || el.href).filter(Boolean); return Array.from(document.querySelectorAll({safe_selector})).map(el => el.src || el.href).filter(Boolean);
""" """
def open_page(self, url) -> bool: def open_page(self, url) -> bool:
@@ -102,6 +110,12 @@ class Dropin:
""" """
return 0, 0 return 0, 0
def hit_auth_wall(self) -> bool:
"""
Custom check to see if the current page is behind an authentication wall, if True is returned the default global auth wall detector is used instead. If false, no auth wall is detected and the page is considered open.
"""
return True
def _get_username_password(self, site) -> tuple[str, str]: def _get_username_password(self, site) -> tuple[str, str]:
""" """
Get the username and password for the site from the extractor's auth data. Get the username and password for the site from the extractor's auth data.
@@ -143,7 +157,7 @@ class Dropin:
with yt_dlp.YoutubeDL(validated_options) as ydl: with yt_dlp.YoutubeDL(validated_options) as ydl:
for url in video_urls: for url in video_urls:
try: try:
logger.debug(f"Downloading video from URL: {url}") logger.debug(f"Downloading video from url: {url}")
info = ydl.extract_info(url, download=True) info = ydl.extract_info(url, download=True)
filename = ydl_entry_to_filename(ydl, info) filename = ydl_entry_to_filename(ydl, info)
if not filename: # Failed to download video. if not filename: # Failed to download video.
@@ -155,5 +169,5 @@ class Dropin:
to_enrich.add_media(media) to_enrich.add_media(media)
downloaded += 1 downloaded += 1
except Exception as e: except Exception as e:
logger.error(f"Error downloading {url}: {e}") logger.error(f"Download failed: {e} {traceback.format_exc()}")
return downloaded return downloaded

View File

@@ -1,5 +1,5 @@
from typing import Mapping from typing import Mapping
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
@@ -62,7 +62,7 @@ class LinkedinDropin(Dropin):
self.sb.wait_for_ready_state_complete() self.sb.wait_for_ready_state_complete()
username, password = self._get_username_password("linkedin.com") username, password = self._get_username_password("linkedin.com")
logger.debug("LinkedinDropin Logging in to Linkedin with username: {}", username) logger.debug("Logging in to Linkedin with username: {}", username)
self.sb.type("#username", username) self.sb.type("#username", username)
self.sb.type("#password", password) self.sb.type("#password", password)
self.sb.click_if_visible("#password-visibility-toggle", timeout=0.5) self.sb.click_if_visible("#password-visibility-toggle", timeout=0.5)

View File

@@ -3,7 +3,7 @@ from typing import Mapping
from auto_archiver.core.metadata import Metadata from auto_archiver.core.metadata import Metadata
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
from loguru import logger from auto_archiver.utils.custom_logger import logger
class RedditDropin(Dropin): class RedditDropin(Dropin):
@@ -50,7 +50,7 @@ class RedditDropin(Dropin):
self._close_cookies_banner() self._close_cookies_banner()
username, password = self._get_username_password("reddit.com") username, password = self._get_username_password("reddit.com")
logger.debug("RedditDropin Logging in to Reddit with username: {}", username) logger.debug("Logging in to Reddit with username: {}", username)
self.sb.type("#login-username", username) self.sb.type("#login-username", username)
self.sb.type("#login-password", password) self.sb.type("#login-password", password)
@@ -68,7 +68,7 @@ class RedditDropin(Dropin):
self.sb.click_link_text("Log in") self.sb.click_link_text("Log in")
self.sb.wait_for_ready_state_complete() self.sb.wait_for_ready_state_complete()
if self.sb.is_text_visible("Welcome back"): if self.sb.is_text_visible("Welcome back"):
logger.debug("RedditDropin Login successful") logger.debug("Login successful")
self.sb.click_if_visible("this link") self.sb.click_if_visible("this link")
def _close_cookies_banner(self): def _close_cookies_banner(self):
@@ -88,5 +88,5 @@ class RedditDropin(Dropin):
.map(el => el.src || el.href) .map(el => el.src || el.href)
.filter(url => url && /\.(m3u8|mpd|ism)$/.test(url)); .filter(url => url && /\.(m3u8|mpd|ism)$/.test(url));
""") """)
logger.debug("RedditDropin Found {} video URLs", len(filtered_urls)) logger.debug("Found {} video URLs", len(filtered_urls))
return 0, self._download_videos_with_ytdlp(filtered_urls, to_enrich) return 0, self._download_videos_with_ytdlp(filtered_urls, to_enrich)

View File

@@ -0,0 +1,56 @@
from contextlib import suppress
from typing import Mapping
from auto_archiver.utils.custom_logger import logger
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
class TikTokDropin(Dropin):
"""
A class to handle TikTok drop-in functionality for the antibot extractor enricher module.
"""
def documentation() -> Mapping[str, str]:
return {
"name": "TikTok Dropin",
"description": "Handles TikTok posts and works without authentication.\nNOTE: This dropin is highly susceptible to TikTok's bot detection mechanisms and may not work reliably if you reuse the same IP. The GenericExtractor is recommended for TikTok posts, as it handles video/image download more reliable. In the future we plan to implement better anti captcha measures for this dropin.",
"site": "tiktok.com",
}
@staticmethod
def suitable(url: str) -> bool:
return "tiktok.com" in url
@staticmethod
def images_selectors() -> str:
return '[data-e2e="detail-photo"] img'
@staticmethod
def video_selectors() -> str:
return None # TikTok videos should be handled by the generic extractor
def open_page(self, url) -> bool:
self.sb.wait_for_ready_state_complete()
self._close_cookies_banner()
# TODO: implement login logic
if url != self.sb.get_current_url():
return False
if self.sb.is_text_visible("Video currently unavailable"):
logger.debug("Video may have been removed or is private.")
return False
return True
def hit_auth_wall(self) -> bool:
return False # TikTok does not require authentication for public posts
def _close_cookies_banner(self):
with suppress(Exception): # selenium.common.exceptions.JavascriptException
self.sb.execute_script("""
document
.querySelector("tiktok-cookie-banner")
.shadowRoot.querySelector("faceplate-dialog")
.querySelector("button")
.click()
""")
self.sb.click_if_visible("Skip")

View File

@@ -4,7 +4,7 @@ from typing import Mapping
from auto_archiver.core.metadata import Metadata from auto_archiver.core.metadata import Metadata
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
from loguru import logger from auto_archiver.utils.custom_logger import logger
class VkDropin(Dropin): class VkDropin(Dropin):

View File

@@ -2,7 +2,7 @@ from typing import Union
import os import os
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Database from auto_archiver.core import Database
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -36,9 +36,9 @@ class AAApiDb(Database):
if not self.store_results: if not self.store_results:
return return
if cached: if cached:
logger.debug(f"skipping saving archive of {item.get_url()} to the AA API because it was cached") logger.debug("Skipping saving archive to AA API because it was cached")
return return
logger.debug(f"saving archive of {item.get_url()} to the AA API.") logger.debug("Saving archive to the AA API.")
payload = { payload = {
"author_id": self.author_id, "author_id": self.author_id,

View File

@@ -3,7 +3,7 @@ import os
from typing import IO, Iterator, Optional, Union from typing import IO, Iterator, Optional, Union
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Database, Feeder, Media, Metadata, Storage from auto_archiver.core import Database, Feeder, Media, Metadata, Storage
from auto_archiver.utils import calculate_file_hash from auto_archiver.utils import calculate_file_hash
@@ -66,13 +66,13 @@ class AtlosFeederDbStorage(Feeder, Database, Storage):
"""Mark an item as failed in Atlos, if the ID exists.""" """Mark an item as failed in Atlos, if the ID exists."""
atlos_id = item.metadata.get("atlos_id") atlos_id = item.metadata.get("atlos_id")
if not atlos_id: if not atlos_id:
logger.info(f"Item {item.get_url()} has no Atlos ID, skipping") logger.info("No Atlos ID available, skipping")
return return
self._post( self._post(
f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver", f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver",
json={"metadata": {"processed": True, "status": "error", "error": reason}}, json={"metadata": {"processed": True, "status": "error", "error": reason}},
) )
logger.info(f"Stored failure for {item.get_url()} (ID {atlos_id}) on Atlos: {reason}") logger.info(f"Stored failure ID {atlos_id} on Atlos: {reason}")
def fetch(self, item: Metadata) -> Union[Metadata, bool]: def fetch(self, item: Metadata) -> Union[Metadata, bool]:
"""check and fetch if the given item has been archived already, each """check and fetch if the given item has been archived already, each
@@ -88,7 +88,7 @@ class AtlosFeederDbStorage(Feeder, Database, Storage):
"""Mark an item as successfully archived in Atlos.""" """Mark an item as successfully archived in Atlos."""
atlos_id = item.metadata.get("atlos_id") atlos_id = item.metadata.get("atlos_id")
if not atlos_id: if not atlos_id:
logger.info(f"Item {item.get_url()} has no Atlos ID, skipping") logger.info("Item has no Atlos ID, skipping")
return return
self._post( self._post(
f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver", f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver",
@@ -100,7 +100,7 @@ class AtlosFeederDbStorage(Feeder, Database, Storage):
} }
}, },
) )
logger.info(f"Stored success for {item.get_url()} (ID {atlos_id}) on Atlos") logger.info(f"Stored success ID {atlos_id} on Atlos")
# ! Atlos Module - Storage Methods # ! Atlos Module - Storage Methods

View File

@@ -1,5 +1,3 @@
from loguru import logger
from auto_archiver.core.feeder import Feeder from auto_archiver.core.feeder import Feeder
from auto_archiver.core.metadata import Metadata from auto_archiver.core.metadata import Metadata
from auto_archiver.core.consts import SetupError from auto_archiver.core.consts import SetupError
@@ -16,8 +14,5 @@ class CLIFeeder(Feeder):
def __iter__(self) -> Metadata: def __iter__(self) -> Metadata:
urls = self.config["urls"] urls = self.config["urls"]
for url in urls: for url in urls:
logger.debug(f"Processing {url}")
m = Metadata().set_url(url) m = Metadata().set_url(url)
yield m yield m
logger.success(f"Processed {len(urls)} URL(s)")

View File

@@ -1,4 +1,4 @@
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Database from auto_archiver.core import Database
from auto_archiver.core import Metadata from auto_archiver.core import Metadata

View File

@@ -1,5 +1,5 @@
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from csv import DictWriter from csv import DictWriter
from dataclasses import asdict from dataclasses import asdict

View File

@@ -1,4 +1,4 @@
from loguru import logger from auto_archiver.utils.custom_logger import logger
import csv import csv
from auto_archiver.core import Feeder from auto_archiver.core import Feeder
@@ -35,5 +35,4 @@ class CSVFeeder(Feeder):
logger.warning(f"Not a valid URL in row: {row}, skipping") logger.warning(f"Not a valid URL in row: {row}, skipping")
continue continue
url = row[url_column] url = row[url_column]
logger.debug(f"Processing {url}")
yield Metadata().set_url(url) yield Metadata().set_url(url)

View File

@@ -8,7 +8,7 @@ from google.oauth2 import service_account
from google.oauth2.credentials import Credentials from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload from googleapiclient.http import MediaFileUpload
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Media from auto_archiver.core import Media
from auto_archiver.core import Storage from auto_archiver.core import Storage
@@ -62,7 +62,7 @@ class GDriveStorage(Storage):
parent_id, folder_id = self.root_folder_id, None parent_id, folder_id = self.root_folder_id, None
path_parts = media.key.split(os.path.sep) path_parts = media.key.split(os.path.sep)
filename = path_parts[-1] filename = path_parts[-1]
logger.info(f"looking for folders for {path_parts[0:-1]} before getting url for {filename=}") logger.info(f"Looking for folders for {path_parts[0:-1]} before getting url for {filename=}")
for folder in path_parts[0:-1]: for folder in path_parts[0:-1]:
folder_id = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=True) folder_id = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=True)
parent_id = folder_id parent_id = folder_id
@@ -70,7 +70,7 @@ class GDriveStorage(Storage):
file_id = self._get_id_from_parent_and_name(folder_id, filename, raise_on_missing=True) file_id = self._get_id_from_parent_and_name(folder_id, filename, raise_on_missing=True)
if not file_id: if not file_id:
# #
logger.info(f"file {filename} not found in folder {folder_id}") logger.info(f"File {filename} not found in folder {folder_id}")
return None return None
return f"https://drive.google.com/file/d/{file_id}/view?usp=sharing" return f"https://drive.google.com/file/d/{file_id}/view?usp=sharing"
@@ -83,7 +83,7 @@ class GDriveStorage(Storage):
parent_id, upload_to = self.root_folder_id, None parent_id, upload_to = self.root_folder_id, None
path_parts = media.key.split(os.path.sep) path_parts = media.key.split(os.path.sep)
filename = path_parts[-1] filename = path_parts[-1]
logger.info(f"checking folders {path_parts[0:-1]} exist (or creating) before uploading {filename=}") logger.info(f"Checking folders {path_parts[0:-1]} exist (or creating) before uploading {filename=}")
for folder in path_parts[0:-1]: for folder in path_parts[0:-1]:
upload_to = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=False) upload_to = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=False)
if upload_to is None: if upload_to is None:
@@ -91,7 +91,7 @@ class GDriveStorage(Storage):
parent_id = upload_to parent_id = upload_to
# upload file to gd # upload file to gd
logger.debug(f"uploading {filename=} to folder id {upload_to}") logger.debug(f"Uploading {filename=} to folder id {upload_to}")
file_metadata = {"name": [filename], "parents": [upload_to]} file_metadata = {"name": [filename], "parents": [upload_to]}
try: try:
media = MediaFileUpload(media.filename, resumable=True) media = MediaFileUpload(media.filename, resumable=True)
@@ -100,11 +100,11 @@ class GDriveStorage(Storage):
.create(supportsAllDrives=True, body=file_metadata, media_body=media, fields="id") .create(supportsAllDrives=True, body=file_metadata, media_body=media, fields="id")
.execute() .execute()
) )
logger.debug(f"uploadf: uploaded file {gd_file['id']} successfully in folder={upload_to}") logger.debug(f"Uploadf: uploaded file {gd_file['id']} successfully in folder={upload_to}")
except FileNotFoundError as e: except FileNotFoundError as e:
logger.error(f"gd uploadf: file not found {media.filename=} - {e}") logger.error(f"GD uploadf: file not found {media.filename=} - {e}")
except Exception as e: except Exception as e:
logger.error(f"gd uploadf: error uploading {media.filename=} to {upload_to} - {e}") logger.error(f"GD uploadf: error uploading {media.filename=} to {upload_to} - {e}")
# must be implemented even if unused # must be implemented even if unused
def uploadf(self, file: IO[bytes], key: str, **kwargs: dict) -> bool: def uploadf(self, file: IO[bytes], key: str, **kwargs: dict) -> bool:
@@ -133,7 +133,7 @@ class GDriveStorage(Storage):
self.api_cache = getattr(self, "api_cache", {}) self.api_cache = getattr(self, "api_cache", {})
cache_key = f"{parent_id}_{name}_{use_mime_type}" cache_key = f"{parent_id}_{name}_{use_mime_type}"
if cache_key in self.api_cache: if cache_key in self.api_cache:
logger.debug(f"cache hit for {cache_key=}") logger.debug(f"Cache hit for {cache_key=}")
return self.api_cache[cache_key] return self.api_cache[cache_key]
# API logic # API logic
@@ -168,7 +168,7 @@ class GDriveStorage(Storage):
else: else:
logger.debug(f"{debug_header} not found, attempt {attempt + 1}/{retries}.") logger.debug(f"{debug_header} not found, attempt {attempt + 1}/{retries}.")
if attempt < retries - 1: if attempt < retries - 1:
logger.debug(f"sleeping for {sleep_seconds} second(s)") logger.debug(f"Sleeping for {sleep_seconds} second(s)")
time.sleep(sleep_seconds) time.sleep(sleep_seconds)
if raise_on_missing: if raise_on_missing:

View File

@@ -58,7 +58,11 @@ If you are having issues with the extractor, you can review the version of `yt-d
}, },
"proxy": { "proxy": {
"default": "", "default": "",
"help": "http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port", "help": "http/https/socks proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port",
},
"proxy_on_failure_only": {
"default": True,
"help": "Applies only if a proxy is set. In that case if this setting is True, the extractor will only use the proxy if the initial request fails; if it is False, the extractor will always use the proxy.",
}, },
"end_means_success": { "end_means_success": {
"default": True, "default": True,

View File

@@ -1,4 +1,4 @@
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core.extractor import Extractor from auto_archiver.core.extractor import Extractor
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media
@@ -39,12 +39,18 @@ class Bluesky(GenericDropin):
media_url = "https://bsky.social/xrpc/com.atproto.sync.getBlob?cid={}&did={}" media_url = "https://bsky.social/xrpc/com.atproto.sync.getBlob?cid={}&did={}"
for image_media in image_medias: for image_media in image_medias:
url = media_url.format(image_media["image"]["ref"]["$link"], post["author"]["did"]) url = media_url.format(image_media["image"]["ref"]["$link"], post["author"]["did"])
image_media = archiver.download_from_url(url) filename = archiver.download_from_url(url)
media.append(Media(image_media)) if filename:
media.append(Media(filename))
else:
logger.warning(f"Failed to download Bluesky image from {url}")
for video_media in video_medias: for video_media in video_medias:
url = media_url.format(video_media["ref"]["$link"], post["author"]["did"]) url = media_url.format(video_media["ref"]["$link"], post["author"]["did"])
video_media = archiver.download_from_url(url) filename = archiver.download_from_url(url)
media.append(Media(video_media)) if filename:
media.append(Media(filename))
else:
logger.warning(f"Failed to download Bluesky video from {url}")
return media return media
def _get_post_data(self, post: dict) -> dict: def _get_post_data(self, post: dict) -> dict:

View File

@@ -34,7 +34,7 @@ def _extract_metadata(self, webpage, video_id):
..., ...,
"attachments", "attachments",
..., ...,
lambda k, v: (k == "media" and str(v["id"]) == video_id and v["__typename"] == "Video"), lambda k, v: k == "media" and str(v["id"]) == video_id and v["__typename"] == "Video",
), ),
expected_type=dict, expected_type=dict,
) )

View File

@@ -4,6 +4,7 @@ import datetime
import os import os
import importlib import importlib
import subprocess import subprocess
import traceback
import zipfile import zipfile
from typing import Generator, Type from typing import Generator, Type
@@ -14,12 +15,13 @@ from yt_dlp.extractor.common import InfoExtractor
from yt_dlp.utils import MaxDownloadsReached from yt_dlp.utils import MaxDownloadsReached
import pysubs2 import pysubs2
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core.extractor import Extractor from auto_archiver.core.extractor import Extractor
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
from auto_archiver.utils import get_datetime_from_str from auto_archiver.utils import get_datetime_from_str
from auto_archiver.utils.misc import ydl_entry_to_filename from auto_archiver.utils.misc import ydl_entry_to_filename
from auto_archiver.utils.deletion_detection import detect_deletion, flag_as_deleted
from .dropin import GenericDropin from .dropin import GenericDropin
@@ -63,8 +65,7 @@ class GenericExtractor(Extractor):
if os.environ.get("AUTO_ARCHIVER_ALLOW_RESTART", "1") != "1": if os.environ.get("AUTO_ARCHIVER_ALLOW_RESTART", "1") != "1":
logger.warning("yt-dlp or plugin was updated — please restart auto-archiver manually") logger.warning("yt-dlp or plugin was updated — please restart auto-archiver manually")
else: else:
logger.warning("yt-dlp or plugin was updated — restarting auto-archiver") logger.warning("yt-dlp or plugin was updated — restarting auto-archiver\n ======= RESTARTING ======= ")
logger.warning(" ======= RESTARTING ======= ")
os.execv(sys.executable, [sys.executable] + sys.argv) os.execv(sys.executable, [sys.executable] + sys.argv)
def update_package(self, package_name: str) -> bool: def update_package(self, package_name: str) -> bool:
@@ -80,7 +81,7 @@ class GenericExtractor(Extractor):
return True return True
logger.info(f"{package_name} already up to date") logger.info(f"{package_name} already up to date")
except Exception as e: except Exception as e:
logger.error(f"Error updating {package_name}: {e}") logger.error(f"Failed to update {package_name}: {e}")
return False return False
def setup_po_tokens(self) -> None: def setup_po_tokens(self) -> None:
@@ -203,10 +204,13 @@ class GenericExtractor(Extractor):
if thumbnail_url: if thumbnail_url:
try: try:
cover_image_path = self.download_from_url(thumbnail_url) cover_image_path = self.download_from_url(thumbnail_url)
media = Media(cover_image_path) if cover_image_path:
metadata.add_media(media, id="cover") media = Media(cover_image_path)
metadata.add_media(media, id="cover")
else:
logger.warning(f"Failed to download cover image from {thumbnail_url}")
except Exception as e: except Exception as e:
logger.error(f"Error downloading cover image {thumbnail_url}: {e}") logger.error(f"Could not download cover image {thumbnail_url}: {e}")
dropin = self.dropin_for_name(info_extractor.ie_key()) dropin = self.dropin_for_name(info_extractor.ie_key())
if dropin: if dropin:
@@ -306,9 +310,9 @@ class GenericExtractor(Extractor):
result.set_url(url) result.set_url(url)
if "description" in video_data and not result.get("content"): if "description" in video_data and not result.get("content"):
result.set_content(video_data.get("description")) result.set_content(video_data.pop("description"))
# extract comments if enabled # extract comments if enabled
if self.comments and video_data.get("comments", []) is not None: if self.comments and video_data.get("comments", None) is not None:
result.set( result.set(
"comments", "comments",
[ [
@@ -354,7 +358,7 @@ class GenericExtractor(Extractor):
if not dropin: if not dropin:
# TODO: add a proper link to 'how to create your own dropin' # TODO: add a proper link to 'how to create your own dropin'
logger.debug(f"""Could not find valid dropin for {info_extractor.ie_key()}. logger.debug(f"""Could not find valid dropin for {info_extractor.ie_key()}.
Why not try creating your own, and make sure it has a valid function called 'create_metadata'. Learn more: https://auto-archiver.readthedocs.io/en/latest/user_guidelines.html#""") Why not try creating your own, and make sure it has a valid function called 'create_metadata'. Learn more: https://auto-archiver.readthedocs.io/en/latest/modules/autogen/extractor/generic_extractor.html#dropins""")
return False return False
post_data = dropin.extract_post(url, ie_instance) post_data = dropin.extract_post(url, ie_instance)
@@ -375,7 +379,7 @@ class GenericExtractor(Extractor):
if "entries" in data: if "entries" in data:
entries = data.get("entries", []) entries = data.get("entries", [])
if not len(entries): if not len(entries):
logger.info("YoutubeDLArchiver could not find any video") logger.info("GenericExtractor could not find any video")
return False return False
else: else:
entries = [data] entries = [data]
@@ -407,9 +411,9 @@ class GenericExtractor(Extractor):
logger.error(f"Error loading subtitle file {val.get('filepath')}: {e}") logger.error(f"Error loading subtitle file {val.get('filepath')}: {e}")
result.add_media(new_media) result.add_media(new_media)
except Exception as e: except Exception as e:
logger.error(f"Error processing entry {entry}: {e}") logger.error(f"Error processing entry {str(entry)[:256]}: {e} {traceback.format_exc()}")
if not len(result.media): if not len(result.media):
logger.info(f"No media found for entry {entry}, skipping.") logger.info(f"No media found for entry {str(entry)[:256]}, skipping.")
return False return False
return self.add_metadata(data, info_extractor, url, result) return self.add_metadata(data, info_extractor, url, result)
@@ -484,6 +488,13 @@ class GenericExtractor(Extractor):
# don't download since it can be a live stream # don't download since it can be a live stream
data = ydl.extract_info(url, ie_key=info_extractor.ie_key(), download=False) data = ydl.extract_info(url, ie_key=info_extractor.ie_key(), download=False)
# Check for deletion indicators in video data
deletion_info = detect_deletion(video_data=data, url=url)
if deletion_info:
result = Metadata()
flag_as_deleted(result, deletion_info)
return result
result = _helper_for_successful_extract_info(data, info_extractor, url, ydl) result = _helper_for_successful_extract_info(data, info_extractor, url, ydl)
except MaxDownloadsReached: except MaxDownloadsReached:
@@ -503,6 +514,16 @@ class GenericExtractor(Extractor):
try: try:
result = self.get_metadata_for_post(info_extractor, url, ydl) result = self.get_metadata_for_post(info_extractor, url, ydl)
except (yt_dlp.utils.DownloadError, yt_dlp.utils.ExtractorError) as post_e: except (yt_dlp.utils.DownloadError, yt_dlp.utils.ExtractorError) as post_e:
# Check if the error indicates deletion
deletion_info = detect_deletion(error_message=str(post_e), url=url)
if deletion_info:
result = Metadata()
flag_as_deleted(result, deletion_info)
return result
if "NSFW tweet requires authentication." in str(post_e):
logger.warning(str(post_e))
return False
logger.error("Error downloading metadata for post: {error}", error=str(post_e)) logger.error("Error downloading metadata for post: {error}", error=str(post_e))
return False return False
except Exception as generic_e: except Exception as generic_e:
@@ -514,7 +535,7 @@ class GenericExtractor(Extractor):
) )
return False return False
if result: if result and not result.is_success():
extractor_name = "yt-dlp" extractor_name = "yt-dlp"
if info_extractor: if info_extractor:
extractor_name += f"_{info_extractor.ie_key()}" extractor_name += f"_{info_extractor.ie_key()}"
@@ -526,7 +547,7 @@ class GenericExtractor(Extractor):
return result return result
def download(self, item: Metadata) -> Metadata: def download(self, item: Metadata, skip_proxy: bool = False) -> Metadata:
url = item.get_url() url = item.get_url()
# TODO: this is a temporary hack until this issue is closed: https://github.com/yt-dlp/yt-dlp/issues/11025 # TODO: this is a temporary hack until this issue is closed: https://github.com/yt-dlp/yt-dlp/issues/11025
@@ -534,6 +555,16 @@ class GenericExtractor(Extractor):
url = url.replace("https://ya.ru", "https://yandex.ru") url = url.replace("https://ya.ru", "https://yandex.ru")
item.set("replaced_url", url) item.set("replaced_url", url)
# proxy_on_failure_only logic
if self.proxy and self.proxy_on_failure_only and not skip_proxy:
# when proxy_on_failure_only is True, we first try to download without a proxy and only continue with execution if that fails
try:
if without_proxy := self.download(item, skip_proxy=True):
logger.info("Downloaded successfully without proxy.")
return without_proxy
except Exception:
logger.debug("Download without proxy failed, trying with proxy...")
ydl_options = [ ydl_options = [
"-o", "-o",
os.path.join(self.tmp_dir, "%(id)s.%(ext)s"), os.path.join(self.tmp_dir, "%(id)s.%(ext)s"),
@@ -547,7 +578,7 @@ class GenericExtractor(Extractor):
] ]
# proxy handling # proxy handling
if self.proxy: if self.proxy and not skip_proxy:
ydl_options.extend(["--proxy", self.proxy]) ydl_options.extend(["--proxy", self.proxy])
# max_downloads handling # max_downloads handling
@@ -560,17 +591,17 @@ class GenericExtractor(Extractor):
# order of importance: username/password -> api_key -> cookie -> cookies_from_browser -> cookies_file # order of importance: username/password -> api_key -> cookie -> cookies_from_browser -> cookies_file
if auth: if auth:
if "username" in auth and "password" in auth: if "username" in auth and "password" in auth:
logger.debug(f"Using provided auth username and password for {url}") logger.debug("Using provided auth username and password")
ydl_options.extend(("--username", auth["username"])) ydl_options.extend(("--username", auth["username"]))
ydl_options.extend(("--password", auth["password"])) ydl_options.extend(("--password", auth["password"]))
elif "cookie" in auth: elif "cookie" in auth:
logger.debug(f"Using provided auth cookie for {url}") logger.debug("Using provided auth cookie")
yt_dlp.utils.std_headers["cookie"] = auth["cookie"] yt_dlp.utils.std_headers["cookie"] = auth["cookie"]
elif "cookies_from_browser" in auth: elif "cookies_from_browser" in auth:
logger.debug(f"Using extracted cookies from browser {auth['cookies_from_browser']} for {url}") logger.debug(f"Using extracted cookies from browser {auth['cookies_from_browser']}")
ydl_options.extend(("--cookies-from-browser", auth["cookies_from_browser"])) ydl_options.extend(("--cookies-from-browser", auth["cookies_from_browser"]))
elif "cookies_file" in auth: elif "cookies_file" in auth:
logger.debug(f"Using cookies from file {auth['cookies_file']} for {url}") logger.debug(f"Using cookies from file {auth['cookies_file']}")
ydl_options.extend(("--cookies", auth["cookies_file"])) ydl_options.extend(("--cookies", auth["cookies_file"]))
# Applying user-defined extractor_args # Applying user-defined extractor_args
@@ -584,7 +615,7 @@ class GenericExtractor(Extractor):
ydl_options.extend(["--extractor-args", f"{key}:{arg_str}"]) ydl_options.extend(["--extractor-args", f"{key}:{arg_str}"])
if self.ytdlp_args: if self.ytdlp_args:
logger.debug("Adding additional ytdlp arguments: {self.ytdlp_args}") logger.debug(f"Adding additional ytdlp arguments: {self.ytdlp_args}")
ydl_options += self.ytdlp_args.split(" ") ydl_options += self.ytdlp_args.split(" ")
*_, validated_options = yt_dlp.parse_options(ydl_options) *_, validated_options = yt_dlp.parse_options(ydl_options)
@@ -592,9 +623,9 @@ class GenericExtractor(Extractor):
validated_options validated_options
) # allsubtitles and subtitleslangs not working as expected, so default lang is always "en" ) # allsubtitles and subtitleslangs not working as expected, so default lang is always "en"
result: Metadata = None
for info_extractor in self.suitable_extractors(url): for info_extractor in self.suitable_extractors(url):
result = self.download_for_extractor(info_extractor, url, ydl) local_result: Metadata = self.download_for_extractor(info_extractor, url, ydl)
if result: if local_result:
return result result = result.merge(local_result) if result else local_result
return result if result else False
return False

View File

@@ -1,5 +1,6 @@
import re
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from yt_dlp.extractor.tiktok import TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE from yt_dlp.extractor.tiktok import TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE
@@ -14,70 +15,109 @@ class Tiktok(GenericDropin):
It's useful for capturing content that requires a login, like sensitive content. It's useful for capturing content that requires a login, like sensitive content.
""" """
# Regex pattern to match TikTok photo post URLs
PHOTO_URL_REGEX = r"https?://(?:www\.)?tiktok\.com/@[\w\.-]+/photo/\d+"
TIKWM_ENDPOINT = "https://www.tikwm.com/api/?url={url}" TIKWM_ENDPOINT = "https://www.tikwm.com/api/?url={url}"
def suitable(self, url, info_extractor) -> bool: def suitable(self, url, info_extractor) -> bool:
"""This dropin (which uses Tikvm) is suitable for *all* Tiktok type URLs - videos, lives, VMs, and users. """This dropin (which uses Tikvm) is suitable for *all* Tiktok type URLs - videos, lives, VMs, and users.
Return the 'suitable' method from the TikTokIE class.""" Return the 'suitable' method from the TikTokIE class."""
return any(extractor().suitable(url) for extractor in (TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE)) return any(extractor().suitable(url) for extractor in (TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE)) or (
re.match(self.PHOTO_URL_REGEX, url) is not None
)
def extract_post(self, url: str, ie_instance): def extract_post(self, url: str, ie_instance):
logger.debug(f"Using Tikwm API to attempt to download tiktok video from {url=}") logger.debug("Using Tikwm API to attempt to download tiktok video")
endpoint = self.TIKWM_ENDPOINT.format(url=url) endpoint = self.TIKWM_ENDPOINT.format(url=url)
r = requests.get(endpoint) r = requests.get(endpoint)
if r.status_code != 200: if r.status_code != 200:
raise ValueError(f"unexpected status code '{r.status_code}' from tikwm.com for {url=}:") raise ValueError(f"Unexpected status code '{r.status_code}' from tikwm.com")
try: try:
json_response = r.json() json_response = r.json()
except ValueError: except ValueError:
raise ValueError(f"failed to parse JSON response from tikwm.com for {url=}") raise ValueError("Failed to parse JSON response from tikwm.com")
if not json_response.get("msg") == "success" or not (api_data := json_response.get("data", {})): if not json_response.get("msg") == "success" or not (api_data := json_response.get("data", {})):
raise ValueError(f"failed to get a valid response from tikwm.com for {url=}: {repr(json_response)}") raise ValueError(f"Unable to download with tikwm.com: {repr(json_response)}")
# tries to get the non-watermarked version first # tries to get the non-watermarked version first
video_url = api_data.pop("play", api_data.pop("wmplay", None)) play_url = api_data.pop("play", api_data.pop("wmplay", None))
if not video_url: if play_url and "mime_type=audio" in play_url:
raise ValueError(f"no valid video URL found in response from tikwm.com for {url=}") play_url = None
if play_url:
api_data["video_url"] = video_url api_data["video_url"] = play_url
return api_data return api_data
def keys_to_clean(self, video_data: dict, info_extractor): def keys_to_clean(self, video_data: dict, info_extractor):
return ["video_url", "title", "create_time", "author", "cover", "origin_cover", "ai_dynamic_cover", "duration"] return [
"video_url",
"title",
"create_time",
"author",
"cover",
"origin_cover",
"ai_dynamic_cover",
"duration",
"size",
"wm_size",
"music",
"music_info",
"play_count",
"digg_count",
"comment_count",
"share_count",
"download_count",
"collect_count",
"anchors",
"anchors_extras",
"is_ad",
"commerce_info",
"commercial_video_info",
"item_comment_settings",
"mentioned_users",
] # all of these will be added via api_data in a single metadata field vs individual ones in the generic extractor
def create_metadata(self, post: dict, ie_instance, archiver, url): def create_metadata(self, post: dict, ie_instance, archiver, url):
# prepare result, start by downloading video # prepare result, start by downloading video
result = Metadata() result = Metadata()
video_url = post.pop("video_url") is_success = False
# get the cover if possible # get the cover if possible
cover_url = post.pop("origin_cover", post.pop("cover", post.pop("ai_dynamic_cover", None))) cover_url = post.pop("origin_cover", post.pop("cover", post.pop("ai_dynamic_cover", None)))
if cover_url and (cover_downloaded := archiver.download_from_url(cover_url)): if cover_url and (cover_downloaded := archiver.download_from_url(cover_url)):
result.add_media(Media(cover_downloaded)) result.add_media(Media(cover_downloaded))
# get the video or fail for image_url in post.pop("images", []):
video_downloaded = archiver.download_from_url(video_url, f"vid_{post.get('id', '')}") if image_downloaded := archiver.download_from_url(image_url):
if not video_downloaded: result.add_media(Media(image_downloaded))
logger.error(f"failed to download video from {video_url}") is_success = True # this is an images post and we got it/them
return False
video_media = Media(video_downloaded) # get the video if present, could be an image post
if duration := post.get("duration", None): if video_url := post.pop("video_url", None):
video_media.set("duration", duration) video_downloaded = archiver.download_from_url(video_url, f"vid_{post.get('id', '')}")
result.add_media(video_media) if not video_downloaded:
logger.error("Failed to download video")
return False
video_media = Media(video_downloaded)
if duration := post.pop("duration", None):
video_media.set("duration", duration)
result.add_media(video_media)
is_success = True # this is a video post and we got it
# add remaining metadata # add remaining metadata
result.set_title(post.get("title", "")) result.set_title(post.pop("title", ""))
if created_at := post.get("create_time", None): if created_at := post.pop("create_time", None):
result.set_timestamp(datetime.fromtimestamp(created_at, tz=timezone.utc)) result.set_timestamp(datetime.fromtimestamp(created_at, tz=timezone.utc))
if author := post.get("author", None): if author := post.pop("author", None):
result.set("author", author) result.set("author", author)
result.set("api_data", post) result.set("api_data", {k: v for k, v in post.items() if v})
if is_success:
result.success("yt-dlp_TikTok")
else:
raise ValueError("Unable to download any media from TikTok post, possibly deleted or private.")
return result return result

View File

@@ -1,6 +1,7 @@
from typing import Type from typing import Type
from auto_archiver.utils import traverse_obj from auto_archiver.utils import traverse_obj
from auto_archiver.utils.custom_logger import logger
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media
from auto_archiver.core.extractor import Extractor from auto_archiver.core.extractor import Extractor
from yt_dlp.extractor.common import InfoExtractor from yt_dlp.extractor.common import InfoExtractor
@@ -58,6 +59,9 @@ class Truth(GenericDropin):
# add the media # add the media
for media in post.get("media_attachments", []): for media in post.get("media_attachments", []):
filename = archiver.download_from_url(media["url"]) filename = archiver.download_from_url(media["url"])
if not filename:
logger.warning(f"Failed to download media from {media['url']}")
continue
result.add_media(Media(filename), id=media.get("id")) result.add_media(Media(filename), id=media.get("id"))
return result return result

View File

@@ -1,13 +1,16 @@
import re import re
import mimetypes import mimetypes
from loguru import logger from auto_archiver.utils.custom_logger import logger
from slugify import slugify from slugify import slugify
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media
from auto_archiver.utils import url as UrlUtil, get_datetime_from_str from auto_archiver.utils import url as UrlUtil, get_datetime_from_str
from auto_archiver.core.extractor import Extractor from auto_archiver.core.extractor import Extractor
from auto_archiver.utils.deletion_detection import detect_deletion, flag_as_deleted
from auto_archiver.modules.generic_extractor.dropin import GenericDropin, InfoExtractor from auto_archiver.modules.generic_extractor.dropin import GenericDropin, InfoExtractor
import requests
from retrying import retry
class Twitter(GenericDropin): class Twitter(GenericDropin):
@@ -28,7 +31,85 @@ class Twitter(GenericDropin):
def extract_post(self, url: str, ie_instance: InfoExtractor): def extract_post(self, url: str, ie_instance: InfoExtractor):
twid = ie_instance._match_valid_url(url).group("id") twid = ie_instance._match_valid_url(url).group("id")
return ie_instance._extract_status(twid=twid) try:
post_data = ie_instance._extract_status(twid=twid)
if not post_data or not post_data.get("user") or not post_data.get("created_at"):
raise ValueError("Error retrieving post with twitter dropin")
return post_data
except Exception as e:
logger.debug(f"yt-dlp twitter extraction failed: {e}")
# try fxtwitter API as fallback
return self._fetch_fxtwitter(twid)
def _fetch_fxtwitter(self, twid: str) -> dict:
"""Fetch tweet data from fxtwitter API and convert to expected format."""
fxtwitter_url = f"https://api.fxtwitter.com/status/{twid}"
logger.info(f"Falling back to fxtwitter API for tweet extraction: {fxtwitter_url}")
@retry(wait_random_min=500, wait_random_max=2000, stop_max_attempt_number=3)
def fetch_fxtwitter_data(url):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/112.0"}
resp = requests.get(url, headers=headers, timeout=15)
if resp.status_code != 200:
raise ValueError(f"Failed to retrieve tweet from fxtwitter API: {resp.status_code}")
data = resp.json()
if "tweet" not in data:
raise ValueError(f"No tweet data in fxtwitter response: {data.get('message', 'Unknown error')}")
return data["tweet"]
tweet = fetch_fxtwitter_data(fxtwitter_url)
# Convert fxtwitter format to expected format
author = tweet.get("author", {}).get("name", "")
created_at = tweet.get("created_at", "") # Format: "Sun Feb 08 18:45:00 +0000 2026"
full_text = tweet.get("text", "") or tweet.get("raw_text", "")
# Convert media format
media = []
fx_media = tweet.get("media", {})
# Handle photos
for photo in fx_media.get("photos", []):
media.append({"type": "photo", "media_url_https": photo.get("url", "")})
# Handle videos
for video in fx_media.get("videos", []):
variants = video.get("variants", [])
# Convert to expected variant format
converted_variants = []
for var in variants:
converted_variants.append(
{
"url": var.get("url", ""),
"content_type": var.get("content_type", "video/mp4"),
"bitrate": var.get("bitrate", 0),
}
)
if converted_variants:
media.append({"type": "video", "video_info": {"variants": converted_variants}})
# Handle animated gifs (fxtwitter may include these in videos)
for item in fx_media.get("all", []):
if item.get("type") == "gif":
variants = item.get("variants", [])
converted_variants = []
for var in variants:
converted_variants.append(
{
"url": var.get("url", ""),
"content_type": var.get("content_type", "video/mp4"),
"bitrate": var.get("bitrate", 0),
}
)
if converted_variants:
media.append({"type": "animated_gif", "video_info": {"variants": converted_variants}})
return {
"user": {"name": author},
"created_at": created_at,
"full_text": full_text,
"entities": {"media": media},
}
def keys_to_clean(self, video_data, info_extractor): def keys_to_clean(self, video_data, info_extractor):
return ["user", "created_at", "entities", "favorited", "translator_type"] return ["user", "created_at", "entities", "favorited", "translator_type"]
@@ -37,7 +118,15 @@ class Twitter(GenericDropin):
result = Metadata() result = Metadata()
try: try:
if not tweet.get("user") or not tweet.get("created_at"): if not tweet.get("user") or not tweet.get("created_at"):
raise ValueError("Error retreiving post. Are you sure it exists?") # Check for deletion indicators
deletion_info = detect_deletion(
video_data=tweet, url=url, error_message="Missing user or created_at fields"
)
if deletion_info:
flag_as_deleted(result, deletion_info)
return result
raise ValueError("Error retrieving post. Are you sure it exists?")
timestamp = get_datetime_from_str(tweet["created_at"], "%a %b %d %H:%M:%S %z %Y") timestamp = get_datetime_from_str(tweet["created_at"], "%a %b %d %H:%M:%S %z %Y")
except (ValueError, KeyError) as ex: except (ValueError, KeyError) as ex:
logger.warning(f"Unable to parse tweet: {str(ex)}\nRetreived tweet data: {tweet}") logger.warning(f"Unable to parse tweet: {str(ex)}\nRetreived tweet data: {tweet}")
@@ -68,5 +157,8 @@ class Twitter(GenericDropin):
mimetype = variant["content_type"] mimetype = variant["content_type"]
ext = mimetypes.guess_extension(mimetype) ext = mimetypes.guess_extension(mimetype)
media.filename = archiver.download_from_url(media.get("src"), f"{slugify(url)}_{i}{ext}") media.filename = archiver.download_from_url(media.get("src"), f"{slugify(url)}_{i}{ext}")
if not media.filename:
logger.warning(f"Failed to download media from {media.get('src')}")
continue
result.add_media(media) result.add_media(media)
return result return result

View File

@@ -10,11 +10,12 @@ The filtered rows are processed into `Metadata` objects.
""" """
import os import os
import traceback
from typing import Tuple, Union, Iterator from typing import Tuple, Union, Iterator
from urllib.parse import quote from urllib.parse import quote
import gspread import gspread
from loguru import logger from auto_archiver.utils.custom_logger import logger
from slugify import slugify from slugify import slugify
from retrying import retry from retrying import retry
@@ -31,28 +32,39 @@ class GsheetsFeederDB(Feeder, Database):
if not self.sheet and not self.sheet_id: if not self.sheet and not self.sheet_id:
raise ValueError("You need to define either a 'sheet' name or a 'sheet_id' in your manifest.") raise ValueError("You need to define either a 'sheet' name or a 'sheet_id' in your manifest.")
def open_sheet(self): @retry(
wait_exponential_multiplier=1,
stop_max_attempt_number=6,
)
def open_sheet(self) -> gspread.Spreadsheet:
if self.sheet: if self.sheet:
return self.gsheets_client.open(self.sheet) return self.gsheets_client.open(self.sheet)
else: else:
return self.gsheets_client.open_by_key(self.sheet_id) return self.gsheets_client.open_by_key(self.sheet_id)
def __iter__(self) -> Iterator[Metadata]: @retry(
sh = self.open_sheet() wait_exponential_multiplier=1,
for ii, worksheet in enumerate(sh.worksheets()): stop_max_attempt_number=6,
if not self.should_process_sheet(worksheet.title): )
logger.debug(f"SKIPPED worksheet '{worksheet.title}' due to allow/block rules") def enumerate_sheets(self, sheet) -> Iterator[gspread.Worksheet]:
continue for worksheet in sheet.worksheets():
logger.info(f"Opening worksheet {ii=}: {worksheet.title=} header={self.header}") yield worksheet
gw = GWorksheet(worksheet, header_row=self.header, columns=self.columns)
if len(missing_cols := self.missing_required_columns(gw)):
logger.debug(
f"SKIPPED worksheet '{worksheet.title}' due to missing required column(s) for {missing_cols}"
)
continue
# process and yield metadata here: def __iter__(self) -> Iterator[Metadata]:
yield from self._process_rows(gw) spreadsheet = self.open_sheet()
for worksheet in self.enumerate_sheets(spreadsheet):
with logger.contextualize(worksheet=f"{spreadsheet.title}:{worksheet.title}"):
if not self.should_process_sheet(worksheet.title):
logger.debug("Skipped worksheet due to allow/block rules")
continue
logger.info(f"Opening worksheet header={self.header}")
gw = GWorksheet(worksheet, header_row=self.header, columns=self.columns)
if len(missing_cols := self.missing_required_columns(gw)):
logger.debug(f"Skipped worksheet due to missing required column(s) for {missing_cols}")
continue
# process and yield metadata here:
yield from self._process_rows(gw)
logger.info(f"Finished worksheet {worksheet.title}") logger.info(f"Finished worksheet {worksheet.title}")
def _process_rows(self, gw: GWorksheet): def _process_rows(self, gw: GWorksheet):
@@ -69,7 +81,9 @@ class GsheetsFeederDB(Feeder, Database):
# All checks done - archival process starts here # All checks done - archival process starts here
m = Metadata().set_url(url) m = Metadata().set_url(url)
self._set_context(m, gw, row) self._set_context(m, gw, row)
yield m
with logger.contextualize(row=row):
yield m
def _set_context(self, m: Metadata, gw: GWorksheet, row: int) -> Metadata: def _set_context(self, m: Metadata, gw: GWorksheet, row: int) -> Metadata:
# TODO: Check folder value not being recognised # TODO: Check folder value not being recognised
@@ -99,16 +113,16 @@ class GsheetsFeederDB(Feeder, Database):
return missing return missing
def started(self, item: Metadata) -> None: def started(self, item: Metadata) -> None:
logger.info(f"STARTED {item}") logger.info("STARTED")
gw, row = self._retrieve_gsheet(item) gw, row = self._retrieve_gsheet(item)
gw.set_cell(row, "status", "Archive in progress") gw.set_cell(row, "status", "Archive in progress")
def failed(self, item: Metadata, reason: str) -> None: def failed(self, item: Metadata, reason: str) -> None:
logger.error(f"FAILED {item}") logger.error("FAILED")
self._safe_status_update(item, f"Archive failed {reason}") self._safe_status_update(item, f"Archive failed {reason}")
def aborted(self, item: Metadata) -> None: def aborted(self, item: Metadata) -> None:
logger.warning(f"ABORTED {item}") logger.warning("ABORTED")
self._safe_status_update(item, "") self._safe_status_update(item, "")
def fetch(self, item: Metadata) -> Union[Metadata, bool]: def fetch(self, item: Metadata) -> Union[Metadata, bool]:
@@ -117,13 +131,13 @@ class GsheetsFeederDB(Feeder, Database):
def done(self, item: Metadata, cached: bool = False) -> None: def done(self, item: Metadata, cached: bool = False) -> None:
"""archival result ready - should be saved to DB""" """archival result ready - should be saved to DB"""
logger.success(f"DONE {item.get_url()}")
gw, row = self._retrieve_gsheet(item) gw, row = self._retrieve_gsheet(item)
# self._safe_status_update(item, 'done')
cell_updates = [] cell_updates = []
row_values = gw.get_row(row) row_values = gw.get_row(row)
logger.info("DONE")
def batch_if_valid(col, val, final_value=None): def batch_if_valid(col, val, final_value=None):
final_value = final_value or val final_value = final_value or val
try: try:
@@ -175,9 +189,7 @@ class GsheetsFeederDB(Feeder, Database):
) )
@retry( @retry(
wait_incrementing_start=1000, wait_exponential_multiplier=1,
wait_incrementing_increment=3000,
wait_incrementing_max=20_000,
stop_max_attempt_number=5, stop_max_attempt_number=5,
) )
def batch_set_cell_with_retry(gw, cell_updates: list): def batch_set_cell_with_retry(gw, cell_updates: list):
@@ -190,15 +202,13 @@ class GsheetsFeederDB(Feeder, Database):
gw, row = self._retrieve_gsheet(item) gw, row = self._retrieve_gsheet(item)
gw.set_cell(row, "status", new_status) gw.set_cell(row, "status", new_status)
except Exception as e: except Exception as e:
logger.debug(f"Unable to update sheet: {e}") logger.debug(f"Unable to update sheet: {e}: {traceback.format_exc()}")
def _retrieve_gsheet(self, item: Metadata) -> Tuple[GWorksheet, int]: def _retrieve_gsheet(self, item: Metadata) -> Tuple[GWorksheet, int]:
if gsheet := item.get_context("gsheet"): if gsheet := item.get_context("gsheet"):
gw: GWorksheet = gsheet.get("worksheet") gw: GWorksheet = gsheet.get("worksheet")
row: int = gsheet.get("row") row: int = gsheet.get("row")
elif self.sheet_id: elif self.sheet_id:
logger.error( logger.error("Unable to retrieve Gsheet, GsheetDB must be used alongside GsheetFeeder.")
f"Unable to retrieve Gsheet for {item.get_url()}, GsheetDB must be used alongside GsheetFeeder."
)
return gw, row return gw, row

View File

@@ -1,4 +1,5 @@
from gspread import utils from gspread import utils
from retrying import retry
class GWorksheet: class GWorksheet:
@@ -26,6 +27,10 @@ class GWorksheet:
"replaywebpage": "replaywebpage", "replaywebpage": "replaywebpage",
} }
@retry(
wait_exponential_multiplier=1,
stop_max_attempt_number=6,
)
def __init__(self, worksheet, columns=COLUMN_NAMES, header_row=1): def __init__(self, worksheet, columns=COLUMN_NAMES, header_row=1):
self.wks = worksheet self.wks = worksheet
self.columns = columns self.columns = columns

View File

@@ -9,7 +9,7 @@ making it suitable for handling large files efficiently.
""" """
import hashlib import hashlib
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -22,10 +22,12 @@ class HashEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug(f"Calculating media hashes with algo={self.algorithm}")
logger.debug(f"calculating media hashes for {url=} (using {self.algorithm})")
for i, m in enumerate(to_enrich.media): for i, m in enumerate(to_enrich.media):
if not m.filename:
logger.warning(f"Skipping hash for media without filename: {m}")
continue
if len(hd := self.calculate_hash(m.filename)): if len(hd := self.calculate_hash(m.filename)):
to_enrich.media[i].set("hash", f"{self.algorithm}:{hd}") to_enrich.media[i].set("hash", f"{self.algorithm}:{hd}")

View File

@@ -4,7 +4,7 @@ import os
import pathlib import pathlib
from jinja2 import Environment, FileSystemLoader from jinja2 import Environment, FileSystemLoader
from urllib.parse import quote from urllib.parse import quote
from loguru import logger from auto_archiver.utils.custom_logger import logger
import json import json
import base64 import base64
@@ -35,7 +35,7 @@ class HtmlFormatter(Formatter):
def format(self, item: Metadata) -> Media: def format(self, item: Metadata) -> Media:
url = item.get_url() url = item.get_url()
if item.is_empty(): if item.is_empty():
logger.debug(f"[SKIP] FORMAT there is no media or metadata to format: {url=}") logger.debug("Nothing to format, skipping")
return return
content = self.template.render( content = self.template.render(

View File

@@ -22,7 +22,7 @@
"full_profile_max_posts": { "full_profile_max_posts": {
"default": 0, "default": 0,
"type": "int", "type": "int",
"help": "Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights", "help": "Use to limit the number of posts to download when full_profile is true or when a URL for multiple posts is passed (like /stories /highlights ...). 0 means no limit. when full_profile is true the order of downloaded content is stories -> posts -> tagged posts -> highlights, so a value of 10 could download 2 stories, 7 posts, 1 tagged posts, and 0 highlights.",
}, },
"minimize_json_output": { "minimize_json_output": {
"default": True, "default": True,

View File

@@ -8,11 +8,13 @@ data, reducing JSON output size, and handling large profiles.
""" """
import math
import re import re
from datetime import datetime from datetime import datetime
import traceback
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from retrying import retry from retrying import retry
from tqdm import tqdm from tqdm import tqdm
@@ -35,17 +37,19 @@ class InstagramAPIExtractor(Extractor):
def setup(self) -> None: def setup(self) -> None:
if self.api_endpoint[-1] == "/": if self.api_endpoint[-1] == "/":
self.api_endpoint = self.api_endpoint[:-1] self.api_endpoint = self.api_endpoint[:-1]
self.full_profile_max_posts = int(self.full_profile_max_posts or 0)
if self.full_profile_max_posts == 0:
self.full_profile_max_posts = math.inf
def download(self, item: Metadata) -> Metadata: def download(self, item: Metadata) -> Metadata:
url = item.get_url() url = item.get_url()
url.replace("instagr.com", "instagram.com").replace("instagr.am", "instagram.com") url.replace("instagr.com", "instagram.com").replace("instagr.am", "instagram.com")
insta_matches = self.valid_url.findall(url) insta_matches = self.valid_url.findall(url)
logger.info(f"{insta_matches=}")
if not len(insta_matches) or len(insta_matches[0]) != 3: if not len(insta_matches) or len(insta_matches[0]) != 3:
return return
if len(insta_matches) > 1: if len(insta_matches) > 1:
logger.warning(f"Multiple instagram matches found in {url=}, using the first one") logger.debug("Multiple instagram matches found, using the first one")
return return
g1, g2, g3 = insta_matches[0][0], insta_matches[0][1], insta_matches[0][2] g1, g2, g3 = insta_matches[0][0], insta_matches[0][1], insta_matches[0][2]
if g1 == "": if g1 == "":
@@ -61,13 +65,13 @@ class InstagramAPIExtractor(Extractor):
return self.download_post(item, id=g3, context="story") return self.download_post(item, id=g3, context="story")
return self.download_stories(item, g2) return self.download_stories(item, g2)
else: else:
logger.warning(f"Unknown instagram regex group match {g1=} found in {url=}") logger.warning(f"Unknown instagram regex group match {g1=}")
return return
@retry(wait_random_min=1000, wait_random_max=3000, stop_max_attempt_number=5) @retry(wait_random_min=1000, wait_random_max=3000, stop_max_attempt_number=5)
def call_api(self, path: str, params: dict) -> dict: def call_api(self, path: str, params: dict) -> dict:
headers = {"accept": "application/json", "x-access-key": self.access_token} headers = {"accept": "application/json", "x-access-key": self.access_token}
logger.debug(f"calling {self.api_endpoint}/{path} with {params=}") logger.debug(f"Calling {self.api_endpoint}/{path} with {params=}")
return requests.get(f"{self.api_endpoint}/{path}", headers=headers, params=params).json() return requests.get(f"{self.api_endpoint}/{path}", headers=headers, params=params).json()
def cleanup_dict(self, d: dict | list) -> dict: def cleanup_dict(self, d: dict | list) -> dict:
@@ -95,67 +99,89 @@ class InstagramAPIExtractor(Extractor):
result.set_title(user.get("full_name", username)).set("data", user) result.set_title(user.get("full_name", username)).set("data", user)
if pic_url := user.get("profile_pic_url_hd", user.get("profile_pic_url")): if pic_url := user.get("profile_pic_url_hd", user.get("profile_pic_url")):
filename = self.download_from_url(pic_url) filename = self.download_from_url(pic_url)
result.add_media(Media(filename=filename), id="profile_picture") if filename:
result.add_media(Media(filename=filename), id="profile_picture")
else:
logger.warning(f"Failed to download profile picture from {pic_url}")
count_posts = 0
if self.full_profile: if self.full_profile:
user_id = user.get("pk") user_id = user.get("pk")
# download all stories # download all stories
try: try:
stories = self._download_stories_reusable(result, username) stories = self._download_stories_reusable(
result, username, max_to_download=self.full_profile_max_posts - count_posts
)
count_posts += len(stories)
result.set("#stories", len(stories)) result.set("#stories", len(stories))
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading stories for {username}") result.append("errors", f"Error downloading stories for {username}")
logger.error(f"Error downloading stories for {username}: {e}") logger.error(f"Error downloading stories for {username}: {e} {traceback.format_exc()}")
# download all posts # download all posts
try: try:
self.download_all_posts(result, user_id) if count_posts < self.full_profile_max_posts:
count_posts += self.download_all_posts(
result, user_id, max_to_download=self.full_profile_max_posts - count_posts
)
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading posts for {username}") result.append("errors", f"Error downloading posts for {username}")
logger.error(f"Error downloading posts for {username}: {e}") logger.error(f"Error downloading posts for {username}: {e} {traceback.format_exc()}")
# download all tagged # download all tagged
try: try:
self.download_all_tagged(result, user_id) if count_posts < self.full_profile_max_posts:
count_posts += self.download_all_tagged(
result, user_id, max_to_download=self.full_profile_max_posts - count_posts
)
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading tagged posts for {username}") result.append("errors", f"Error downloading tagged posts for {username}")
logger.error(f"Error downloading tagged posts for {username}: {e}") logger.error(f"Error downloading tagged posts for {username}: {e} {traceback.format_exc()}")
# download all highlights # download all highlights
try: try:
self.download_all_highlights(result, username, user_id) if count_posts < self.full_profile_max_posts:
count_posts += self.download_all_highlights(
result, username, user_id, max_to_download=self.full_profile_max_posts - count_posts
)
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading highlights for {username}") result.append("errors", f"Error downloading highlights for {username}")
logger.error(f"Error downloading highlights for {username}: {e}") logger.error(f"Error downloading highlights for {username}: {e} {traceback.format_exc()}")
result.set_url(url) # reset as scrape_item modifies it result.set_url(url) # reset as scrape_item modifies it
return result.success("insta profile") return result.success("insta profile")
def download_all_highlights(self, result, username, user_id): def download_all_highlights(self, result, username, user_id, max_to_download: int) -> int:
count_highlights = 0 count_highlights = 0
highlights = self.call_api("v1/user/highlights", {"user_id": user_id}) highlights = self.call_api("v1/user/highlights", {"user_id": user_id})
highlights = highlights[: min(max_to_download, len(highlights))] # newest to oldest
for h in highlights: for h in highlights:
try: try:
h_info = self._download_highlights_reusable(result, h.get("pk")) h_info = self._download_highlights_reusable(result, h.get("pk"), max_to_download=max_to_download)
count_highlights += len(h_info.get("items", [])) count_highlights += len(h_info.get("items", []))
except Exception as e: except Exception as e:
result.append( result.append(
"errors", "errors",
f"Error downloading highlight id{h.get('pk')} for {username}", f"Error downloading highlight id{h.get('pk')} for {username}",
) )
logger.error(f"Error downloading highlight id{h.get('pk')} for {username}: {e}") logger.error(
if self.full_profile_max_posts and count_highlights >= self.full_profile_max_posts: f"Error downloading highlight id{h.get('pk')} for {username}: {e} {traceback.format_exc()}"
logger.info(f"HIGHLIGHTS reached full_profile_max_posts={self.full_profile_max_posts}") )
if count_highlights >= max_to_download:
logger.debug(f"HIGHLIGHTS reached max_to_download={self.full_profile_max_posts}")
break break
result.set("#highlights", count_highlights) result.set("#highlights", count_highlights)
return count_highlights
def download_post(self, result: Metadata, code: str = None, id: str = None, context: str = None) -> Metadata: def download_post(self, result: Metadata, code: str = None, id: str = None, context: str = "") -> Metadata:
if id: if id:
post = self.call_api("v1/media/by/id", {"id": id}) post = self.call_api("v1/media/by/id", {"id": id})
else: else:
post = self.call_api("v1/media/by/code", {"code": code}) post = self.call_api("v1/media/by/code", {"code": code})
assert post, f"Post {id or code} not found" assert post, f"Post {id or code} not found"
result.set(f"{context}_data", post)
if caption_text := post.get("caption_text"): if caption_text := post.get("caption_text"):
result.set_title(caption_text) result.set_title(caption_text)
@@ -166,54 +192,58 @@ class InstagramAPIExtractor(Extractor):
return result.success(f"insta {context or 'post'}") return result.success(f"insta {context or 'post'}")
def download_highlights(self, result: Metadata, id: str) -> Metadata: def download_highlights(self, result: Metadata, id: str) -> Metadata:
h_info = self._download_highlights_reusable(result, id) h_info = self._download_highlights_reusable(result, id, self.full_profile_max_posts)
items = len(h_info.get("items", [])) items = len(h_info.get("items", []))
del h_info["items"] del h_info["items"]
result.set_title(h_info.get("title")).set("data", h_info).set("#reels", items) result.set_title(h_info.get("title")).set("data", h_info).set("#reels", items)
return result.success("insta highlights") return result.success("insta highlights")
def _download_highlights_reusable(self, result: Metadata, id: str) -> dict: def _download_highlights_reusable(self, result: Metadata, id: str, max_to_download: int) -> dict:
full_h = self.call_api("v2/highlight/by/id", {"id": id}) full_h = self.call_api("v2/highlight/by/id", {"id": id})
h_info = full_h.get("response", {}).get("reels", {}).get(f"highlight:{id}") h_info = full_h.get("response", {}).get("reels", {}).get(f"highlight:{id}")
assert h_info, f"Highlight {id} not found: {full_h=}" assert h_info, f"Highlight {id} not found: {full_h=}"
if cover_media := h_info.get("cover_media", {}).get("cropped_image_version", {}).get("url"): if cover_media := h_info.get("cover_media", {}).get("cropped_image_version", {}).get("url"):
filename = self.download_from_url(cover_media) filename = self.download_from_url(cover_media)
result.add_media(Media(filename=filename), id=f"cover_media highlight {id}") if filename:
result.add_media(Media(filename=filename), id=f"cover_media highlight {id}")
else:
logger.warning(f"Failed to download cover media from {cover_media}")
items = h_info.get("items", [])[::-1] # newest to oldest items = h_info.get("items", [])[::-1] # newest to oldest
items = items[: min(max_to_download, len(items))]
for h in tqdm(items, desc="downloading highlights", unit="highlight"): for h in tqdm(items, desc="downloading highlights", unit="highlight"):
try: try:
self.scrape_item(result, h, "highlight") self.scrape_item(result, h, "highlight")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading highlight {h.get('id')}") result.append("errors", f"Error downloading highlight {h.get('id')}")
logger.error(f"Error downloading highlight, skipping {h.get('id')}: {e}") logger.error(f"Error downloading highlight, skipping {h.get('id')}: {e} {traceback.format_exc()}")
return h_info return h_info
def download_stories(self, result: Metadata, username: str) -> Metadata: def download_stories(self, result: Metadata, username: str) -> Metadata:
now = datetime.now().strftime("%Y-%m-%d_%H-%M") now = datetime.now().strftime("%Y-%m-%d_%H-%M")
stories = self._download_stories_reusable(result, username) stories = self._download_stories_reusable(result, username, max_to_download=self.full_profile_max_posts)
if stories == []: if stories == []:
return result.success("insta no story") return result.success("insta no story")
result.set_title(f"stories {username} at {now}").set("#stories", len(stories)) result.set_title(f"stories {username} at {now}").set("#stories", len(stories))
return result.success(f"insta stories {now}") return result.success(f"insta stories {now}")
def _download_stories_reusable(self, result: Metadata, username: str) -> list[dict]: def _download_stories_reusable(self, result: Metadata, username: str, max_to_download: int) -> list[dict]:
stories = self.call_api("v1/user/stories/by/username", {"username": username}) stories = self.call_api("v1/user/stories/by/username", {"username": username})
if not stories or not len(stories): if not stories or not len(stories):
return [] return []
stories = stories[::-1] # newest to oldest stories = stories[::-1][: min(max_to_download, len(stories))] # newest to oldest
for s in tqdm(stories, desc="downloading stories", unit="story"): for s in tqdm(stories, desc="downloading stories", unit="story"):
try: try:
self.scrape_item(result, s, "story") self.scrape_item(result, s, "story")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading story {s.get('id')}") result.append("errors", f"Error downloading story {s.get('id')}")
logger.error(f"Error downloading story, skipping {s.get('id')}: {e}") logger.error(f"Error downloading story, skipping {s.get('id')}: {e} {traceback.format_exc()}")
return stories return stories
def download_all_posts(self, result: Metadata, user_id: str): def download_all_posts(self, result: Metadata, user_id: str, max_to_download: int) -> int:
end_cursor = None end_cursor = None
pbar = tqdm(desc="downloading posts") pbar = tqdm(desc="downloading posts")
@@ -223,22 +253,23 @@ class InstagramAPIExtractor(Extractor):
if not posts or not isinstance(posts, list) or len(posts) != 2: if not posts or not isinstance(posts, list) or len(posts) != 2:
break break
posts, end_cursor = posts[0], posts[1] posts, end_cursor = posts[0], posts[1]
logger.info(f"parsing {len(posts)} posts, next {end_cursor=}") posts = posts[: min(max_to_download, len(posts))]
logger.info(f"Parsing {len(posts)} posts, next {end_cursor=} {post_count=} {max_to_download=}")
for p in posts: for p in posts:
try: try:
self.scrape_item(result, p, "post") self.scrape_item(result, p, "post")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading post {p.get('id')}") result.append("errors", f"Error downloading post {p.get('id')}")
logger.error(f"Error downloading post, skipping {p.get('id')}: {e}") logger.error(f"Error downloading post, skipping {p.get('id')}: {e} {traceback.format_exc()}")
pbar.update(1) pbar.update(1)
post_count += 1 post_count += 1
if self.full_profile_max_posts and post_count >= self.full_profile_max_posts: if post_count >= max_to_download:
logger.info(f"POSTS reached full_profile_max_posts={self.full_profile_max_posts}") logger.info(f"POSTS reached max_to_download={self.full_profile_max_posts}")
break break
result.set("#posts", post_count) result.set("#posts", post_count)
return post_count
def download_all_tagged(self, result: Metadata, user_id: str): def download_all_tagged(self, result: Metadata, user_id: str, max_to_download: int) -> int:
next_page_id = "" next_page_id = ""
pbar = tqdm(desc="downloading tagged posts") pbar = tqdm(desc="downloading tagged posts")
@@ -250,22 +281,23 @@ class InstagramAPIExtractor(Extractor):
break break
next_page_id = resp.get("next_page_id") next_page_id = resp.get("next_page_id")
logger.info(f"parsing {len(posts)} tagged posts, next {next_page_id=}") logger.info(f"Parsing {len(posts)} tagged posts, next {next_page_id=}")
posts = posts[: min(max_to_download, len(posts))]
for p in posts: for p in posts:
try: try:
self.scrape_item(result, p, "tagged") self.scrape_item(result, p, "tagged")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading tagged post {p.get('id')}") result.append("errors", f"Error downloading tagged post {p.get('id')}")
logger.error(f"Error downloading tagged post, skipping {p.get('id')}: {e}") logger.error(f"Error downloading tagged post, skipping {p.get('id')}: {e} {traceback.format_exc()}")
pbar.update(1) pbar.update(1)
tagged_count += 1 tagged_count += 1
if self.full_profile_max_posts and tagged_count >= self.full_profile_max_posts: if tagged_count >= max_to_download:
logger.info(f"TAGS reached full_profile_max_posts={self.full_profile_max_posts}") logger.info(f"TAGS reached max_to_download={self.full_profile_max_posts}")
break break
result.set("#tagged", tagged_count) result.set("#tagged", tagged_count)
return tagged_count
### reusable parsing utils below # reusable parsing utils below
def scrape_item(self, result: Metadata, item: dict, context: str = None) -> dict: def scrape_item(self, result: Metadata, item: dict, context: str = None) -> dict:
""" """
@@ -319,7 +351,10 @@ class InstagramAPIExtractor(Extractor):
image_media = None image_media = None
if image_url := item.get("thumbnail_url"): if image_url := item.get("thumbnail_url"):
filename = self.download_from_url(image_url, verbose=False) filename = self.download_from_url(image_url, verbose=False)
image_media = Media(filename=filename) if filename:
image_media = Media(filename=filename)
else:
logger.warning(f"Failed to download thumbnail from {image_url}")
# retrieve video info # retrieve video info
best_id = item.get("id", item.get("pk")) best_id = item.get("id", item.get("pk"))
@@ -331,16 +366,19 @@ class InstagramAPIExtractor(Extractor):
if video_url := item.get("video_url"): if video_url := item.get("video_url"):
filename = self.download_from_url(video_url, verbose=False) filename = self.download_from_url(video_url, verbose=False)
video_media = Media(filename=filename) if filename:
if taken_at: video_media = Media(filename=filename)
video_media.set("date", taken_at) if taken_at:
if code: video_media.set("date", taken_at)
video_media.set("url", f"https://www.instagram.com/p/{code}") if code:
if caption_text: video_media.set("url", f"https://www.instagram.com/p/{code}")
video_media.set("text", caption_text) if caption_text:
video_media.set("preview", [image_media]) video_media.set("text", caption_text)
video_media.set("data", [item]) video_media.set("preview", [image_media])
return item, video_media, f"{context or 'video'} {best_id}" video_media.set("data", [item])
return item, video_media, f"{context or 'video'} {best_id}"
else:
logger.warning(f"Failed to download video from {video_url}")
elif image_media: elif image_media:
if taken_at: if taken_at:
image_media.set("date", taken_at) image_media.set("date", taken_at)

View File

@@ -7,8 +7,9 @@ highlights, and tagged posts. Authentication is required via username/password o
import re import re
import os import os
import shutil import shutil
import traceback
import instaloader import instaloader
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -29,8 +30,9 @@ class InstagramExtractor(Extractor):
# TODO: links to stories # TODO: links to stories
def setup(self) -> None: def setup(self) -> None:
logger.warning("Instagram Extractor is not actively maintained, and may not work as expected.") logger.warning(
logger.warning("Please consider using the Instagram Tbot Extractor or Instagram API Extractor instead.") "Instagram Extractor is not actively maintained, and may not work as expected.\nPlease consider using the Instagram Tbot Extractor or Instagram API Extractor instead."
)
self.insta = instaloader.Instaloader( self.insta = instaloader.Instaloader(
download_geotags=True, download_geotags=True,
@@ -43,8 +45,7 @@ class InstagramExtractor(Extractor):
self.insta.load_session_from_file(self.username, self.session_file) self.insta.load_session_from_file(self.username, self.session_file)
except Exception: except Exception:
try: try:
logger.debug("Session file failed", exc_info=True) logger.info("No valid session file found - Attempting login with username and password.")
logger.info("No valid session file found - Attempting login with use and password.")
self.insta.login(self.username, self.password) self.insta.login(self.username, self.password)
self.insta.save_session_to_file(self.session_file) self.insta.save_session_to_file(self.session_file)
except Exception as e: except Exception as e:
@@ -79,7 +80,7 @@ class InstagramExtractor(Extractor):
return result return result
def download_post(self, url: str, post_id: str) -> Metadata: def download_post(self, url: str, post_id: str) -> Metadata:
logger.debug(f"Instagram {post_id=} detected in {url=}") logger.debug(f"Instagram {post_id=} detected")
post = instaloader.Post.from_shortcode(self.insta.context, post_id) post = instaloader.Post.from_shortcode(self.insta.context, post_id)
if self.insta.download_post(post, target=post.owner_username): if self.insta.download_post(post, target=post.owner_username):
@@ -87,7 +88,7 @@ class InstagramExtractor(Extractor):
def download_profile(self, url: str, username: str) -> Metadata: def download_profile(self, url: str, username: str) -> Metadata:
# gets posts, posts where username is tagged, igtv postss, stories, and highlights # gets posts, posts where username is tagged, igtv postss, stories, and highlights
logger.debug(f"Instagram {username=} detected in {url=}") logger.debug(f"Instagram {username=} detected")
profile = instaloader.Profile.from_username(self.insta.context, username) profile = instaloader.Profile.from_username(self.insta.context, username)
try: try:
@@ -95,27 +96,27 @@ class InstagramExtractor(Extractor):
try: try:
self.insta.download_post(post, target=f"profile_post_{post.owner_username}") self.insta.download_post(post, target=f"profile_post_{post.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download post: {post.shortcode}: {e}") logger.error(f"Failed to download post: {post.shortcode}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed profile.get_posts: {e}") logger.error(f"Failed profile.get_posts: {e}: {traceback.format_exc()}")
try: try:
for post in profile.get_tagged_posts(): for post in profile.get_tagged_posts():
try: try:
self.insta.download_post(post, target=f"tagged_post_{post.owner_username}") self.insta.download_post(post, target=f"tagged_post_{post.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download tagged post: {post.shortcode}: {e}") logger.error(f"Failed to download tagged post: {post.shortcode}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed profile.get_tagged_posts: {e}") logger.error(f"Failed profile.get_tagged_posts: {e} {traceback.format_exc()}")
try: try:
for post in profile.get_igtv_posts(): for post in profile.get_igtv_posts():
try: try:
self.insta.download_post(post, target=f"igtv_post_{post.owner_username}") self.insta.download_post(post, target=f"igtv_post_{post.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download igtv post: {post.shortcode}: {e}") logger.error(f"Failed to download igtv post: {post.shortcode}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed profile.get_igtv_posts: {e}") logger.error(f"Failed profile.get_igtv_posts: {e} {traceback.format_exc()}")
try: try:
for story in self.insta.get_stories([profile.userid]): for story in self.insta.get_stories([profile.userid]):
@@ -123,9 +124,9 @@ class InstagramExtractor(Extractor):
try: try:
self.insta.download_storyitem(item, target=f"story_item_{story.owner_username}") self.insta.download_storyitem(item, target=f"story_item_{story.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download story item: {item}: {e}") logger.error(f"Failed to download story item: {item}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed get_stories: {e}") logger.error(f"Failed get_stories: {e} {traceback.format_exc()}")
try: try:
for highlight in self.insta.get_highlights(profile.userid): for highlight in self.insta.get_highlights(profile.userid):
@@ -133,9 +134,9 @@ class InstagramExtractor(Extractor):
try: try:
self.insta.download_storyitem(item, target=f"highlight_item_{highlight.owner_username}") self.insta.download_storyitem(item, target=f"highlight_item_{highlight.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download highlight item: {item}: {e}") logger.error(f"Failed to download highlight item: {item}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed get_highlights: {e}") logger.error(f"Failed get_highlights: {e} {traceback.format_exc()}")
return self.process_downloads(url, f"@{username}", profile._asdict(), None) return self.process_downloads(url, f"@{username}", profile._asdict(), None)
@@ -158,4 +159,4 @@ class InstagramExtractor(Extractor):
return result.success("instagram") return result.success("instagram")
except Exception as e: except Exception as e:
logger.error(f"Could not fetch instagram post {url} due to: {e}") logger.error(f"Could not fetch instagram post due to: {e} {traceback.format_exc()}")

View File

@@ -12,7 +12,7 @@ import shutil
import time import time
from sqlite3 import OperationalError from sqlite3 import OperationalError
from loguru import logger from auto_archiver.utils.custom_logger import logger
from telethon.sync import TelegramClient from telethon.sync import TelegramClient
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
@@ -32,7 +32,7 @@ class InstagramTbotExtractor(Extractor):
1. makes a copy of session_file that is removed in cleanup 1. makes a copy of session_file that is removed in cleanup
2. checks if the session file is valid 2. checks if the session file is valid
""" """
logger.info(f"SETUP {self.name} checking login...") logger.debug(f"SETUP {self.name} checking login...")
self._prepare_session_file() self._prepare_session_file()
self._initialize_telegram_client() self._initialize_telegram_client()
@@ -58,10 +58,10 @@ class InstagramTbotExtractor(Extractor):
"If you do, disable at least one of the archivers for the first-time setup of the telethon session: {e}" "If you do, disable at least one of the archivers for the first-time setup of the telethon session: {e}"
) )
with self.client.start(): with self.client.start():
logger.info(f"SETUP {self.name} login works.") logger.debug(f"SETUP {self.name} login works.")
def cleanup(self) -> None: def cleanup(self) -> None:
logger.info(f"CLEANUP {self.name}.") logger.debug(f"CLEANUP {self.name}.")
session_file_name = self.session_file + ".session" session_file_name = self.session_file + ".session"
if os.path.exists(session_file_name): if os.path.exists(session_file_name):
os.remove(session_file_name) os.remove(session_file_name)
@@ -79,17 +79,17 @@ class InstagramTbotExtractor(Extractor):
# This may be outdated and replaced by the below message, but keeping until confirmed # This may be outdated and replaced by the below message, but keeping until confirmed
if "You must enter a URL to a post" in message: if "You must enter a URL to a post" in message:
logger.debug(f"invalid link {url=} for {self.name}: {message}") logger.debug(f"Invalid link for {self.name}: {message}")
return False return False
if "Media not found or unavailable" in message: if "Media not found or unavailable" in message:
logger.debug(f"No media found for link {url=} for {self.name}: {message}") logger.debug(f"No media found for {self.name}: {message}")
return False return False
if message: if message:
result.set_content(message).set_title(message[:128]) result.set_content(message).set_title(message[:128])
elif result.is_empty(): elif result.is_empty():
logger.debug(f"No media found for link {url=} for {self.name}: {message}") logger.debug(f"No media found for {self.name}: {message}")
return False return False
return result.success("insta-via-bot") return result.success("insta-via-bot")

View File

@@ -1,5 +1,5 @@
import json import json
from loguru import logger from auto_archiver.utils.custom_logger import logger
import os import os
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
@@ -8,9 +8,7 @@ from auto_archiver.core import Media, Metadata
class JsonEnricher(Enricher): class JsonEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("Enriching as JSON")
logger.debug(f"JSON Enricher for {url=}")
item_path = os.path.join(self.tmp_dir, "metadata.json") item_path = os.path.join(self.tmp_dir, "metadata.json")
with open(item_path, mode="w", encoding="utf-8") as outf: with open(item_path, mode="w", encoding="utf-8") as outf:

View File

@@ -1,7 +1,7 @@
import shutil import shutil
from typing import IO from typing import IO
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Media from auto_archiver.core import Media
from auto_archiver.core import Storage from auto_archiver.core import Storage
@@ -38,8 +38,7 @@ class LocalStorage(Storage):
os.makedirs(os.path.dirname(dest), exist_ok=True) os.makedirs(os.path.dirname(dest), exist_ok=True)
logger.debug(f"[{self.__class__.__name__}] storing file {media.filename} with key {media.key} to {dest}") logger.debug(f"[{self.__class__.__name__}] storing file {media.filename} with key {media.key} to {dest}")
res = shutil.copy2(media.filename, dest) shutil.copy2(media.filename, dest)
logger.info(res)
return True return True
# must be implemented even if unused # must be implemented even if unused

View File

@@ -1,6 +1,6 @@
import datetime import datetime
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -12,22 +12,22 @@ class MetaEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url()
if to_enrich.is_empty(): if to_enrich.is_empty():
logger.debug(f"[SKIP] META_ENRICHER there is no media or metadata to enrich: {url=}") logger.debug("[SKIP] META_ENRICHER there is no media or metadata to enrich")
return return
logger.debug(f"calculating archive metadata information for {url=}") logger.debug("Calculating archive metadata information")
self.enrich_file_sizes(to_enrich) self.enrich_file_sizes(to_enrich)
self.enrich_archive_duration(to_enrich) self.enrich_archive_duration(to_enrich)
def enrich_file_sizes(self, to_enrich: Metadata): def enrich_file_sizes(self, to_enrich: Metadata):
logger.debug( logger.debug(f"Calculating archive file sizes for {len(to_enrich.media)} media files")
f"calculating archive file sizes for url={to_enrich.get_url()} ({len(to_enrich.media)} media files)"
)
total_size = 0 total_size = 0
for media in to_enrich.get_all_media(): for media in to_enrich.get_all_media():
if not media.filename:
logger.warning(f"Skipping file size for media without filename: {media}")
continue
file_stats = os.stat(media.filename) file_stats = os.stat(media.filename)
media.set("bytes", file_stats.st_size) media.set("bytes", file_stats.st_size)
media.set("size", self.human_readable_bytes(file_stats.st_size)) media.set("size", self.human_readable_bytes(file_stats.st_size))
@@ -44,7 +44,7 @@ class MetaEnricher(Enricher):
size /= 1024 size /= 1024
def enrich_archive_duration(self, to_enrich): def enrich_archive_duration(self, to_enrich):
logger.debug(f"calculating archive duration for url={to_enrich.get_url()} ") logger.debug("Calculating archive duration")
archive_duration = datetime.datetime.now(datetime.timezone.utc) - to_enrich.get("_processed_at") archive_duration = datetime.datetime.now(datetime.timezone.utc) - to_enrich.get("_processed_at")
to_enrich.set("archive_duration_seconds", archive_duration.seconds) to_enrich.set("archive_duration_seconds", archive_duration.seconds)

View File

@@ -3,6 +3,13 @@
"type": ["enricher"], "type": ["enricher"],
"requires_setup": True, "requires_setup": True,
"dependencies": {"python": ["loguru"], "bin": ["exiftool"]}, "dependencies": {"python": ["loguru"], "bin": ["exiftool"]},
"configs": {
"look_for_keys": {
"default": [],
"help": "list of lowercased metadata keys that will be included in the enriched metadata. Special keys: 'author', 'datetimes', 'location' to include related metadata fields. The default empty list `[]` means all metadata will be included.",
"type": "list",
},
},
"description": """ "description": """
Extracts metadata information from files using ExifTool. Extracts metadata information from files using ExifTool.

View File

@@ -1,6 +1,6 @@
import subprocess import subprocess
import traceback import traceback
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -12,11 +12,12 @@ class MetadataEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("Extracting EXIF metadata")
logger.debug(f"extracting EXIF metadata for {url=}")
for i, m in enumerate(to_enrich.media): for i, m in enumerate(to_enrich.media):
if len(md := self.get_metadata(m.filename)): if len(md := self.get_metadata(m.filename)):
if self.look_for_keys != []:
md = self.select_metadata(md, self.look_for_keys)
to_enrich.media[i].set("metadata", md) to_enrich.media[i].set("metadata", md)
def get_metadata(self, filename: str) -> dict: def get_metadata(self, filename: str) -> dict:
@@ -24,15 +25,44 @@ class MetadataEnricher(Enricher):
# Run ExifTool command to extract metadata from the file # Run ExifTool command to extract metadata from the file
cmd = ["exiftool", filename] cmd = ["exiftool", filename]
result = subprocess.run(cmd, capture_output=True, text=True) result = subprocess.run(cmd, capture_output=True, text=True)
# Process the output to extract individual metadata fields # Process the output to extract individual metadata fields
metadata = {} metadata = {}
for line in result.stdout.splitlines(): for line in result.stdout.splitlines():
field, value = line.strip().split(":", 1) field, value = line.strip().split(":", 1)
metadata[field.strip()] = value.strip() metadata[field.strip()] = value.strip()
return metadata return metadata
except FileNotFoundError: except FileNotFoundError as e:
logger.error("[exif_enricher] ExifTool not found. Make sure ExifTool is installed and added to PATH.") logger.error(f"ExifTool not found. Make sure ExifTool is installed and added to PATH. {e}")
except Exception as e: except Exception as e:
logger.error(f"Error occurred: {e}: {traceback.format_exc()}") logger.error(f"Error occurred: {e}: {traceback.format_exc()}")
return {} return {}
def select_metadata(self, all_md, requested_metadata_keys):
"""
coordinates the selection of metadata from the general exiftool output to the user-specified grocery list
"""
# defining the batches of metadata that get pulled for special terms
author_key_terms = ["author", "producer", "creator"]
datetime_key_terms = ["date", "time"]
location_key_terms = ["gps", "latitude", "longitude"]
specified_md = {}
for md_key in all_md.keys():
md_key_lower = md_key.lower()
# checking for special baskets within the grocery list of requested metadata
if ("author" in requested_metadata_keys) and any(
term in md_key_lower and len(all_md[md_key]) for term in author_key_terms
):
specified_md[md_key] = all_md[md_key]
if ("datetime" in requested_metadata_keys) and any(
term in md_key_lower and len(all_md[md_key]) for term in datetime_key_terms
):
specified_md[md_key] = all_md[md_key]
if ("location" in requested_metadata_keys) and any(
term in md_key_lower and len(all_md[md_key]) for term in location_key_terms
):
specified_md[md_key] = all_md[md_key]
# if the metadata value is requested directly
if md_key_lower in requested_metadata_keys or md_key in requested_metadata_keys and len(all_md[md_key]):
specified_md[md_key] = all_md[md_key]
return specified_md

View File

@@ -1,6 +1,7 @@
import os import os
import traceback
from loguru import logger from auto_archiver.utils.custom_logger import logger
import opentimestamps import opentimestamps
from opentimestamps.calendar import RemoteCalendar, DEFAULT_CALENDAR_WHITELIST from opentimestamps.calendar import RemoteCalendar, DEFAULT_CALENDAR_WHITELIST
from opentimestamps.core.timestamp import Timestamp, DetachedTimestampFile from opentimestamps.core.timestamp import Timestamp, DetachedTimestampFile
@@ -14,13 +15,12 @@ from auto_archiver.utils.misc import get_current_timestamp
class OpentimestampsEnricher(Enricher): class OpentimestampsEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("OpenTimestamps timestamping files")
logger.debug(f"OpenTimestamps timestamping files for {url=}")
# Get the media files to timestamp # Get the media files to timestamp
media_files = [m for m in to_enrich.media if m.filename and not m.get("opentimestamps")] media_files = [m for m in to_enrich.media if m.filename and not m.get("opentimestamps")]
if not media_files: if not media_files:
logger.debug(f"No files found to timestamp in {url=}") logger.debug("No files found to timestamp")
return return
timestamp_files = [] timestamp_files = []
@@ -94,7 +94,7 @@ class OpentimestampsEnricher(Enricher):
detached_timestamp.serialize(ctx) detached_timestamp.serialize(ctx)
f.write(ctx.getbytes()) f.write(ctx.getbytes())
except Exception as e: except Exception as e:
logger.warning(f"Failed to serialize timestamp file: {e}") logger.warning(f"Failed to serialize timestamp file: {e} {traceback.format_exc()}")
continue continue
# Create media for the timestamp file # Create media for the timestamp file
@@ -113,16 +113,16 @@ class OpentimestampsEnricher(Enricher):
media.set("opentimestamps", True) media.set("opentimestamps", True)
except Exception as e: except Exception as e:
logger.warning(f"Error while timestamping {media.filename}: {e}") logger.warning(f"Error while timestamping {media.filename}: {e} {traceback.format_exc()}")
# Add timestamp files to the metadata # Add timestamp files to the metadata
if timestamp_files: if timestamp_files:
to_enrich.set("opentimestamped", True) to_enrich.set("opentimestamped", True)
to_enrich.set("opentimestamps_count", len(timestamp_files)) to_enrich.set("opentimestamps_count", len(timestamp_files))
logger.info(f"{len(timestamp_files)} OpenTimestamps proofs created for {url=}") logger.info(f"{len(timestamp_files)} OpenTimestamps proofs created")
else: else:
to_enrich.set("opentimestamped", False) to_enrich.set("opentimestamped", False)
logger.warning(f"No successful timestamps created for {url=}") logger.warning("No successful timestamps created")
def verify_timestamp(self, detached_timestamp): def verify_timestamp(self, detached_timestamp):
""" """

View File

@@ -15,7 +15,7 @@ import traceback
import pdqhash import pdqhash
import numpy as np import numpy as np
from PIL import Image, UnidentifiedImageError from PIL import Image, UnidentifiedImageError
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -28,8 +28,7 @@ class PdqHashEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("Calculating perceptual hashes")
logger.debug(f"calculating perceptual hashes for {url=}")
media_with_hashes = [] media_with_hashes = []
for m in to_enrich.media: for m in to_enrich.media:
@@ -44,7 +43,7 @@ class PdqHashEnricher(Enricher):
media.set("pdq_hash", hd) media.set("pdq_hash", hd)
media_with_hashes.append(media.filename) media_with_hashes.append(media.filename)
logger.debug(f"calculated '{len(media_with_hashes)}' perceptual hashes for {url=}: {media_with_hashes}") logger.debug(f"Calculated '{len(media_with_hashes)}' perceptual hashes: {media_with_hashes}")
def calculate_pdq_hash(self, filename): def calculate_pdq_hash(self, filename):
# returns a hexadecimal string with the perceptual hash for the given filename # returns a hexadecimal string with the perceptual hash for the given filename

View File

@@ -2,7 +2,7 @@ from typing import IO
import boto3 import boto3
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Media from auto_archiver.core import Media
from auto_archiver.core import Storage from auto_archiver.core import Storage
@@ -56,7 +56,7 @@ class S3Storage(Storage):
if existing_key := self.file_in_folder(path): if existing_key := self.file_in_folder(path):
media._key = existing_key media._key = existing_key
media.set("previously archived", True) media.set("previously archived", True)
logger.debug(f"skipping upload of {media.filename} because it already exists in {media.key}") logger.debug(f"Skipping upload of {media.filename} because it already exists in {media.key}")
return False return False
_, ext = os.path.splitext(media.key) _, ext = os.path.splitext(media.key)

View File

@@ -2,7 +2,7 @@ import ssl
import os import os
from slugify import slugify from slugify import slugify
from urllib.parse import urlparse from urllib.parse import urlparse
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -19,10 +19,10 @@ class SSLEnricher(Enricher):
url = to_enrich.get_url() url = to_enrich.get_url()
parsed = urlparse(url) parsed = urlparse(url)
assert parsed.scheme in ["https"], f"Invalid URL scheme {url=}" assert parsed.scheme in ["https"], "Invalid URL scheme"
domain = parsed.netloc domain = parsed.netloc
logger.debug(f"fetching SSL certificate for {domain=} in {url=}") logger.debug(f"Fetching SSL certificate for {domain=}")
cert = ssl.get_server_certificate((domain, 443)) cert = ssl.get_server_certificate((domain, 443))
cert_fn = os.path.join(self.tmp_dir, f"{slugify(domain)}.pem") cert_fn = os.path.join(self.tmp_dir, f"{slugify(domain)}.pem")

View File

@@ -2,7 +2,7 @@ import requests
import re import re
import html import html
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -38,7 +38,7 @@ class TelegramExtractor(Extractor):
video = s.find("video") video = s.find("video")
if video is None: if video is None:
logger.warning("could not find video") logger.warning("Could not find video")
image_tags = s.find_all(class_="tgme_widget_message_photo_wrap") image_tags = s.find_all(class_="tgme_widget_message_photo_wrap")
image_urls = [] image_urls = []
@@ -49,10 +49,18 @@ class TelegramExtractor(Extractor):
if not len(image_urls): if not len(image_urls):
return False return False
for img_url in image_urls: for img_url in image_urls:
result.add_media(Media(self.download_from_url(img_url))) filename = self.download_from_url(img_url)
if not filename:
logger.warning(f"Failed to download image from {img_url}")
continue
result.add_media(Media(filename))
else: else:
video_url = video.get("src") video_url = video.get("src")
m_video = Media(self.download_from_url(video_url)) video_filename = self.download_from_url(video_url)
if not video_filename:
logger.warning(f"Failed to download video from {video_url}")
return False
m_video = Media(video_filename)
# extract duration from HTML # extract duration from HTML
try: try:
duration = s.find_all("time")[0].contents[0] duration = s.find_all("time")[0].contents[0]

View File

@@ -1,3 +1,4 @@
import asyncio
import os import os
import shutil import shutil
import re import re
@@ -5,6 +6,7 @@ import time
from pathlib import Path from pathlib import Path
from datetime import date from datetime import date
from telethon import functions
from telethon.sync import TelegramClient from telethon.sync import TelegramClient
from telethon.errors import ChannelInvalidError from telethon.errors import ChannelInvalidError
from telethon.tl.functions.messages import ImportChatInviteRequest from telethon.tl.functions.messages import ImportChatInviteRequest
@@ -16,7 +18,7 @@ from telethon.errors.rpcerrorlist import (
) )
from tqdm import tqdm from tqdm import tqdm
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -24,7 +26,7 @@ from auto_archiver.utils import random_str
class TelethonExtractor(Extractor): class TelethonExtractor(Extractor):
valid_url = re.compile(r"https:\/\/t\.me(\/c){0,1}\/(.+)\/(\d+)") valid_url = re.compile(r"https:\/\/t\.me(\/c){0,1}\/(.+?)(\/s){0,1}\/(\d+)")
invite_pattern = re.compile(r"t.me(\/joinchat){0,1}\/\+?(.+)") invite_pattern = re.compile(r"t.me(\/joinchat){0,1}\/\+?(.+)")
def setup(self) -> None: def setup(self) -> None:
@@ -52,6 +54,16 @@ class TelethonExtractor(Extractor):
logger.debug(f"Making a copy of the session file {base_session_filepath} to {self.session_file}.session") logger.debug(f"Making a copy of the session file {base_session_filepath} to {self.session_file}.session")
shutil.copy(base_session_filepath, f"{self.session_file}.session") shutil.copy(base_session_filepath, f"{self.session_file}.session")
# ensure a running event loop exists (Needed when used by Celery workers which may close the default one)
try:
loop = asyncio.get_event_loop()
if loop.is_closed():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
# initiate the client # initiate the client
self.client = TelegramClient(self.session_file, self.api_id, self.api_hash) self.client = TelegramClient(self.session_file, self.api_id, self.api_hash)
@@ -64,7 +76,7 @@ class TelethonExtractor(Extractor):
# get currently joined channels # get currently joined channels
# https://docs.telethon.dev/en/stable/modules/custom.html#module-telethon.tl.custom.dialog # https://docs.telethon.dev/en/stable/modules/custom.html#module-telethon.tl.custom.dialog
joined_channel_ids = [c.id for c in self.client.get_dialogs() if c.is_channel] joined_channel_ids = [c.id for c in self.client.get_dialogs() if c.is_channel]
logger.info(f"already part of {len(joined_channel_ids)} channels") logger.info(f"Already part of {len(joined_channel_ids)} channels")
i = 0 i = 0
pbar = tqdm(desc=f"joining {len(self.channel_invites)} invite links", total=len(self.channel_invites)) pbar = tqdm(desc=f"joining {len(self.channel_invites)} invite links", total=len(self.channel_invites))
@@ -79,22 +91,22 @@ class TelethonExtractor(Extractor):
else: else:
ent = self.client.get_entity(invite) # fails if not a member ent = self.client.get_entity(invite) # fails if not a member
logger.warning( logger.warning(
f"please add the property id='{ent.id}' to the 'channel_invites' configuration where {invite=}, not doing so can lead to a minutes-long setup time due to telegram's rate limiting." f"Please add the property id='{ent.id}' to the 'channel_invites' configuration where {invite=}, not doing so can lead to a minutes-long setup time due to telegram's rate limiting."
) )
except ValueError: except ValueError:
logger.info(f"joining new channel {invite=}") logger.info(f"Joining new channel {invite=}")
try: try:
self.client(ImportChatInviteRequest(match.group(2))) self.client(ImportChatInviteRequest(match.group(2)))
except UserAlreadyParticipantError: except UserAlreadyParticipantError:
logger.info(f"already joined {invite=}") logger.info(f"Already joined {invite=}")
except InviteRequestSentError: except InviteRequestSentError:
logger.warning(f"already sent a join request with {invite} still no answer") logger.warning(f"Already sent a join request with {invite} still no answer")
except InviteHashExpiredError: except InviteHashExpiredError:
logger.warning(f"{invite=} has expired please find a more recent one") logger.warning(f"{invite=} has expired please find a more recent one")
except Exception as e: except Exception as e:
logger.error(f"could not join channel with {invite=} due to {e}") logger.error(f"Could not join channel with {invite=} due to {e}")
except FloodWaitError as e: except FloodWaitError as e:
logger.warning(f"got a flood error, need to wait {e.seconds} seconds") logger.warning(f"Got a flood error, need to wait {e.seconds} seconds")
time.sleep(e.seconds) time.sleep(e.seconds)
continue continue
else: else:
@@ -116,68 +128,94 @@ class TelethonExtractor(Extractor):
url = item.get_url() url = item.get_url()
# detect URLs that we definitely cannot handle # detect URLs that we definitely cannot handle
match = self.valid_url.search(url) match = self.valid_url.search(url)
logger.debug(f"TELETHON: {match=}") logger.debug(f"Found telethon url {match=}")
if not match: if not match:
return False return False
is_private = match.group(1) == "/c" is_private = match.group(1) == "/c"
chat = int(match.group(2)) if is_private else match.group(2) chat = int(match.group(2)) if is_private else match.group(2)
post_id = int(match.group(3)) is_story = match.group(3) == "/s"
post_id = int(match.group(4))
result = Metadata() result = Metadata()
# NB: not using bot_token since then private channels cannot be archived: self.client.start(bot_token=self.bot_token) # NB: not using bot_token since then private channels cannot be archived: self.client.start(bot_token=self.bot_token)
with self.client.start(): with self.client.start():
# with self.client.start(bot_token=self.bot_token): # with self.client.start(bot_token=self.bot_token):
try: if is_story:
post = self.client.get_messages(chat, ids=post_id) try:
except ValueError as e: stories = self.client(functions.stories.GetStoriesByIDRequest(peer=chat, id=[post_id]))
logger.error(f"Could not fetch telegram {url} possibly it's private: {e}") if not stories.stories:
return False logger.info("No stories found, possibly it's private or the story has expired.")
except ChannelInvalidError as e: return False
logger.error( story = stories.stories[0]
f"Could not fetch telegram {url}. This error may be fixed if you setup a bot_token in addition to api_id and api_hash (but then private channels will not be archived, we need to update this logic to handle both): {e}" logger.debug(f"Got story {story.id=} {story.date=} {story.expire_date=}")
) result.set_timestamp(story.date).set("views", story.views.to_dict()).set(
return False "expire_date", story.expire_date
)
logger.debug(f"TELETHON GOT POST {post=}") # download the story media
if post is None: filename_dest = os.path.join(self.tmp_dir, f"{chat}_{post_id}", str(story.id))
return False if filename := self.client.download_media(story.media, filename_dest):
result.add_media(Media(filename))
except Exception as e:
logger.error(f"Error fetching story {post_id} from {chat}: {e}")
return False
else:
try:
post = self.client.get_messages(chat, ids=post_id)
except ValueError as e:
logger.error(f"Could not fetch telegram URL possibly it's private: {e}")
return False
except ChannelInvalidError as e:
logger.error(
f"Could not fetch telegram URL. This error may be fixed if you setup a bot_token in addition to api_id and api_hash (but then private channels will not be archived, we need to update this logic to handle both): {e}"
)
return False
media_posts = self._get_media_posts_in_group(chat, post) logger.debug(f"Got post {post=}")
logger.debug(f"got {len(media_posts)=} for {url=}") if post is None:
return False
tmp_dir = self.tmp_dir media_posts = self._get_media_posts_in_group(chat, post)
logger.debug(f"Got {len(media_posts)=}")
group_id = post.grouped_id if post.grouped_id is not None else post.id group_id = post.grouped_id if post.grouped_id is not None else post.id
title = post.message title = post.message
for mp in media_posts: for mp in media_posts:
if len(mp.message) > len(title): if len(mp.message) > len(title):
title = mp.message # save the longest text found (usually only 1) title = mp.message # save the longest text found (usually only 1)
# media can also be in entities # media can also be in entities
if mp.entities: if mp.entities:
other_media_urls = [ other_media_urls = [
e.url e.url
for e in mp.entities for e in mp.entities
if hasattr(e, "url") and e.url and self._guess_file_type(e.url) in ["video", "image", "audio"] if hasattr(e, "url")
] and e.url
if len(other_media_urls): and self._guess_file_type(e.url) in ["video", "image", "audio"]
logger.debug(f"Got {len(other_media_urls)} other media urls from {mp.id=}: {other_media_urls}") ]
for i, om_url in enumerate(other_media_urls): if len(other_media_urls):
filename = self.download_from_url(om_url, f"{chat}_{group_id}_{i}") logger.debug(
result.add_media(Media(filename=filename), id=f"{group_id}_{i}") f"Got {len(other_media_urls)} other media urls from {mp.id=}: {other_media_urls}"
)
for i, om_url in enumerate(other_media_urls):
filename = self.download_from_url(om_url, f"{chat}_{group_id}_{i}")
if not filename:
logger.warning(f"Failed to download media from {om_url}")
continue
result.add_media(Media(filename=filename), id=f"{group_id}_{i}")
filename_dest = os.path.join(tmp_dir, f"{chat}_{group_id}", str(mp.id)) filename_dest = os.path.join(self.tmp_dir, f"{chat}_{group_id}", str(mp.id))
filename = self.client.download_media(mp.media, filename_dest) filename = self.client.download_media(mp.media, filename_dest)
if not filename: if not filename:
logger.debug(f"Empty media found, skipping {str(mp)=}") logger.debug(f"Empty media found, skipping {str(mp)=}")
continue continue
result.add_media(Media(filename)) result.add_media(Media(filename))
result.set_title(title).set_timestamp(post.date).set("api_data", post.to_dict()) result.set_title(title).set_timestamp(post.date).set("api_data", post.to_dict())
if post.message != title: if post.message != title:
result.set_content(post.message) result.set_content(post.message)
return result.success("telethon") return result.success("telethon")
def _get_media_posts_in_group(self, chat, original_post, max_amp=10): def _get_media_posts_in_group(self, chat, original_post, max_amp=10):

View File

@@ -9,7 +9,7 @@ and identify important moments without watching the entire video.
import ffmpeg import ffmpeg
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Media, Metadata from auto_archiver.core import Media, Metadata
@@ -27,12 +27,12 @@ class ThumbnailEnricher(Enricher):
Calculates how many thumbnails to generate and at which timestamps based on the video duration, the number of thumbnails per minute and the max number of thumbnails. Calculates how many thumbnails to generate and at which timestamps based on the video duration, the number of thumbnails per minute and the max number of thumbnails.
Thumbnails are equally distributed across the video duration. Thumbnails are equally distributed across the video duration.
""" """
logger.debug(f"generating thumbnails for {to_enrich.get_url()}") logger.debug("Generating thumbnails")
for m_id, m in enumerate(to_enrich.media[::]): for m_id, m in enumerate(to_enrich.media[::]):
if m.is_video(): if m.is_video():
folder = os.path.join(self.tmp_dir, random_str(24)) folder = os.path.join(self.tmp_dir, random_str(24))
os.makedirs(folder, exist_ok=True) os.makedirs(folder, exist_ok=True)
logger.debug(f"generating thumbnails for {m.filename}") logger.debug(f"Generating thumbnails for {m.filename}")
duration = m.get("duration") duration = m.get("duration")
try: try:
@@ -42,10 +42,10 @@ class ThumbnailEnricher(Enricher):
) )
to_enrich.media[m_id].set("duration", duration) to_enrich.media[m_id].set("duration", duration)
except Exception as e: except Exception as e:
logger.warning(f"failed to get duration with FFMPEG from {m.filename}: {e}") logger.warning(f"Failed to get duration with FFMPEG from {m.filename}: {e}")
if not duration or type(duration) not in [float, int] or duration <= 0: if not duration or type(duration) not in [float, int] or duration <= 0:
logger.warning(f"cannot generate thumbnails for {m.filename} without valid duration") logger.warning(f"Cannot generate thumbnails for {m.filename} without valid duration")
continue continue
num_thumbs = int(min(max(1, (duration / 60) * self.thumbnails_per_minute), self.max_thumbnails)) num_thumbs = int(min(max(1, (duration / 60) * self.thumbnails_per_minute), self.max_thumbnails))

View File

@@ -20,7 +20,7 @@
# "http://tsa.sinpe.fi.cr/tsaHttp/", # self-signed # "http://tsa.sinpe.fi.cr/tsaHttp/", # self-signed
# "http://tsa.cra.ge/signserver/tsa?workerName=qtsa", # self-signed # "http://tsa.cra.ge/signserver/tsa?workerName=qtsa", # self-signed
"http://tss.cnbs.gob.hn/TSS/HttpTspServer", "http://tss.cnbs.gob.hn/TSS/HttpTspServer",
"http://dss.nowina.lu/pki-factory/tsa/good-tsa", # "http://dss.nowina.lu/pki-factory/tsa/good-tsa",
# "https://freetsa.org/tsr", # self-signed # "https://freetsa.org/tsr", # self-signed
], ],
"help": "List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line.", "help": "List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line.",

View File

@@ -4,12 +4,12 @@ from importlib.metadata import version
import hashlib import hashlib
from slugify import slugify from slugify import slugify
from retrying import retry
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from rfc3161_client import (decode_timestamp_response,TimestampRequestBuilder,TimeStampResponse, VerifierBuilder) from rfc3161_client import (decode_timestamp_response, TimestampRequestBuilder, TimeStampResponse, VerifierBuilder)
from rfc3161_client import VerificationError as Rfc3161VerificationError from rfc3161_client import VerificationError as Rfc3161VerificationError
from rfc3161_client.base import HashAlgorithm
from rfc3161_client.tsp import SignedData from rfc3161_client.tsp import SignedData
from cryptography import x509 from cryptography import x509
from cryptography.hazmat.primitives import serialization from cryptography.hazmat.primitives import serialization
@@ -49,8 +49,7 @@ class TimestampingEnricher(Enricher):
self.session.close() self.session.close()
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug(f"RFC3161 timestamping existing files")
logger.debug(f"RFC3161 timestamping existing files for {url=}")
# create a new text file with the existing media hashes # create a new text file with the existing media hashes
hashes = [ hashes = [
@@ -58,10 +57,9 @@ class TimestampingEnricher(Enricher):
] ]
if not len(hashes): if not len(hashes):
logger.debug(f"No hashes found in {url=}") logger.debug(f"No hashes found")
return return
hashes_fn = os.path.join(self.tmp_dir, "hashes.txt") hashes_fn = os.path.join(self.tmp_dir, "hashes.txt")
data_to_sign = "\n".join(hashes) data_to_sign = "\n".join(hashes)
@@ -74,9 +72,9 @@ class TimestampingEnricher(Enricher):
try: try:
message = bytes(data_to_sign, encoding='utf8') message = bytes(data_to_sign, encoding='utf8')
logger.debug(f"Timestamping {url=} with {tsa_url=}") logger.debug(f"Timestamping with {tsa_url=}")
signed: TimeStampResponse = self.sign_data(tsa_url, message) signed: TimeStampResponse = self.sign_data(tsa_url, message)
# fail if there's any issue with the certificates, uses certifi list of trusted CAs or the user-defined `cert_authorities` # fail if there's any issue with the certificates, uses certifi list of trusted CAs or the user-defined `cert_authorities`
root_cert = self.verify_signed(signed, message) root_cert = self.verify_signed(signed, message)
@@ -92,7 +90,7 @@ class TimestampingEnricher(Enricher):
timestamp_token_path = self.save_timestamp_token(signed.time_stamp_token(), tsa_url) timestamp_token_path = self.save_timestamp_token(signed.time_stamp_token(), tsa_url)
timestamp_tokens.append(Media(filename=timestamp_token_path).set("tsa", tsa_url).set("cert_chain", cert_chain)) timestamp_tokens.append(Media(filename=timestamp_token_path).set("tsa", tsa_url).set("cert_chain", cert_chain))
except Exception as e: except Exception as e:
logger.warning(f"Error while timestamping {url=} with {tsa_url=}: {e}") logger.warning(f"Error while timestamping with {tsa_url=}: {e}")
if len(timestamp_tokens): if len(timestamp_tokens):
hashes_media.set("timestamp_authority_files", timestamp_tokens) hashes_media.set("timestamp_authority_files", timestamp_tokens)
@@ -101,9 +99,9 @@ class TimestampingEnricher(Enricher):
hashes_media.set("cryptography v", version("cryptography")) hashes_media.set("cryptography v", version("cryptography"))
to_enrich.add_media(hashes_media, id="timestamped_hashes") to_enrich.add_media(hashes_media, id="timestamped_hashes")
to_enrich.set("timestamped", True) to_enrich.set("timestamped", True)
logger.info(f"{len(timestamp_tokens)} timestamp tokens created for {url=}") logger.info(f"{len(timestamp_tokens)} timestamp tokens created")
else: else:
logger.warning(f"No successful timestamps for {url=}") logger.warning(f"No successful timestamps found")
def save_timestamp_token(self, timestamp_token: bytes, tsa_url: str) -> str: def save_timestamp_token(self, timestamp_token: bytes, tsa_url: str) -> str:
""" """
@@ -114,7 +112,7 @@ class TimestampingEnricher(Enricher):
f.write(timestamp_token) f.write(timestamp_token)
return tst_path return tst_path
def verify_signed(self, timestamp_response: TimeStampResponse, message: bytes) -> x509.Certificate: def verify_signed(self, timestamp_response: TimeStampResponse, message: bytes) -> x509.Certificate:
""" """
Verify a Signed Timestamp Response is trusted by a known Certificate Authority. Verify a Signed Timestamp Response is trusted by a known Certificate Authority.
@@ -137,7 +135,7 @@ class TimestampingEnricher(Enricher):
if not cert_authorities: if not cert_authorities:
raise ValueError(f"No trusted roots found in {trusted_root_path}.") raise ValueError(f"No trusted roots found in {trusted_root_path}.")
timestamp_certs = self.tst_certs(timestamp_response) timestamp_certs = self.tst_certs(timestamp_response)
intermediate_certs = timestamp_certs[1:-1] intermediate_certs = timestamp_certs[1:-1]
@@ -149,7 +147,7 @@ class TimestampingEnricher(Enricher):
message_hash = hashlib.sha256(message).digest() message_hash = hashlib.sha256(message).digest()
else: else:
raise ValueError(f"Unsupported hash algorithm: {hash_algorithm}") raise ValueError(f"Unsupported hash algorithm: {hash_algorithm}")
for certificate in cert_authorities: for certificate in cert_authorities:
builder = VerifierBuilder() builder = VerifierBuilder()
builder.add_root_certificate(certificate) builder.add_root_certificate(certificate)
@@ -159,7 +157,6 @@ class TimestampingEnricher(Enricher):
verifier = builder.build() verifier = builder.build()
try: try:
verifier.verify(timestamp_response, message_hash) verifier.verify(timestamp_response, message_hash)
return certificate return certificate
@@ -172,23 +169,38 @@ class TimestampingEnricher(Enricher):
# see https://github.com/sigstore/sigstore-python/blob/99948d5b80525a5a104e904ffea58169dc6e0629/sigstore/_internal/timestamp.py#L84-L121 # see https://github.com/sigstore/sigstore-python/blob/99948d5b80525a5a104e904ffea58169dc6e0629/sigstore/_internal/timestamp.py#L84-L121
timestamp_request = ( timestamp_request = (
TimestampRequestBuilder().data(bytes_data).nonce(nonce=True).build() TimestampRequestBuilder().data(bytes_data).nonce(nonce=True).build()
) )
try:
@retry(
wait_exponential_multiplier=1,
stop_max_attempt_number=2,
)
def sign_with_retry():
response = self.session.post(tsa_url, data=timestamp_request.as_bytes(), timeout=10) response = self.session.post(tsa_url, data=timestamp_request.as_bytes(), timeout=10)
response.raise_for_status() response.raise_for_status()
return response
try:
response = sign_with_retry()
except requests.RequestException as e: except requests.RequestException as e:
logger.error(f"Error while sending request to {tsa_url=}: {e}") logger.error(f"Error while sending request to {tsa_url=}: {e}")
raise raise
@retry(
wait_exponential_multiplier=1,
stop_max_attempt_number=2,
)
def decode_with_retry(response):
return decode_timestamp_response(response.content)
# Check that we can parse the response but do not *verify* it # Check that we can parse the response but do not *verify* it
try: try:
timestamp_response = decode_timestamp_response(response.content) timestamp_response = decode_with_retry(response)
except ValueError as e: except ValueError as e:
logger.error(f"Invalid timestamp response from server {tsa_url}: {e}") logger.error(f"Invalid timestamp response from server {tsa_url}: {e}")
raise raise
return timestamp_response return timestamp_response
def tst_certs(self, tsp_response: TimeStampResponse): def tst_certs(self, tsp_response: TimeStampResponse):
signed_data: SignedData = tsp_response.signed_data signed_data: SignedData = tsp_response.signed_data
certs = [x509.load_der_x509_certificate(c) for c in signed_data.certificates] certs = [x509.load_der_x509_certificate(c) for c in signed_data.certificates]
@@ -197,7 +209,7 @@ class TimestampingEnricher(Enricher):
if len(certs) == 1: if len(certs) == 1:
return certs return certs
while(len(ordered_certs) < len(certs)): while (len(ordered_certs) < len(certs)):
if len(ordered_certs) == 0: if len(ordered_certs) == 0:
for cert in certs: for cert in certs:
if not [c for c in certs if cert.subject == c.issuer]: if not [c for c in certs if cert.subject == c.issuer]:
@@ -221,7 +233,7 @@ class TimestampingEnricher(Enricher):
cert_chain = [] cert_chain = []
for i, cert in enumerate(certificates): for i, cert in enumerate(certificates):
cert_fn = os.path.join(self.tmp_dir, f"{i+1} {str(cert.serial_number)[:20]}.crt") cert_fn = os.path.join(self.tmp_dir, f"{i + 1} {str(cert.serial_number)[:20]}.crt")
with open(cert_fn, "wb") as f: with open(cert_fn, "wb") as f:
f.write(cert.public_bytes(encoding=serialization.Encoding.PEM)) f.write(cert.public_bytes(encoding=serialization.Encoding.PEM))
cert_chain.append(Media(filename=cert_fn).set("subject", cert.subject.get_attributes_for_oid(x509.NameOID.COMMON_NAME)[0].value)) cert_chain.append(Media(filename=cert_fn).set("subject", cert.subject.get_attributes_for_oid(x509.NameOID.COMMON_NAME)[0].value))

View File

@@ -4,7 +4,7 @@ import re
import mimetypes import mimetypes
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from pytwitter import Api from pytwitter import Api
from slugify import slugify from slugify import slugify
@@ -45,10 +45,9 @@ class TwitterApiExtractor(Extractor):
if "https://t.co/" in url: if "https://t.co/" in url:
try: try:
r = requests.get(url, timeout=30) r = requests.get(url, timeout=30)
logger.debug(f"Expanded url {url} to {r.url}")
url = r.url url = r.url
except Exception: except Exception as e:
logger.error(f"Failed to expand url {url}") logger.error(f"Failed to expand Twitter URL: {e}")
return url return url
def download(self, item: Metadata) -> Metadata: def download(self, item: Metadata) -> Metadata:
@@ -67,7 +66,7 @@ class TwitterApiExtractor(Extractor):
return False, False return False, False
username, tweet_id = matches[0] # only one URL supported username, tweet_id = matches[0] # only one URL supported
logger.debug(f"Found {username=} and {tweet_id=} in {url=}") logger.debug(f"Found {username=} and {tweet_id=}")
return username, tweet_id return username, tweet_id
@@ -85,7 +84,7 @@ class TwitterApiExtractor(Extractor):
media_fields=["type", "duration_ms", "url", "variants"], media_fields=["type", "duration_ms", "url", "variants"],
tweet_fields=["attachments", "author_id", "created_at", "entities", "id", "text", "possibly_sensitive"], tweet_fields=["attachments", "author_id", "created_at", "entities", "id", "text", "possibly_sensitive"],
) )
logger.debug(tweet) logger.debug(f"Got {tweet=}")
except Exception as e: except Exception as e:
logger.error(f"Could not get tweet: {e}") logger.error(f"Could not get tweet: {e}")
return False return False
@@ -115,6 +114,9 @@ class TwitterApiExtractor(Extractor):
logger.info(f"Found media {media}") logger.info(f"Found media {media}")
ext = mimetypes.guess_extension(mimetype) ext = mimetypes.guess_extension(mimetype)
media.filename = self.download_from_url(media.get("src"), f"{slugify(url)}_{i}{ext}") media.filename = self.download_from_url(media.get("src"), f"{slugify(url)}_{i}{ext}")
if not media.filename:
logger.warning(f"Failed to download media from {media.get('src')}")
continue
result.add_media(media) result.add_media(media)
result.set_content( result.set_content(

View File

@@ -4,7 +4,7 @@ import os
import shutil import shutil
import subprocess import subprocess
from zipfile import ZipFile from zipfile import ZipFile
from loguru import logger from auto_archiver.utils.custom_logger import logger
from warcio.archiveiterator import ArchiveIterator from warcio.archiveiterator import ArchiveIterator
from auto_archiver.core import Media, Metadata from auto_archiver.core import Media, Metadata
@@ -24,8 +24,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
self.use_docker = os.environ.get("WACZ_ENABLE_DOCKER") or not os.environ.get("RUNNING_IN_DOCKER") self.use_docker = os.environ.get("WACZ_ENABLE_DOCKER") or not os.environ.get("RUNNING_IN_DOCKER")
self.docker_in_docker = os.environ.get("WACZ_ENABLE_DOCKER") and os.environ.get("RUNNING_IN_DOCKER") self.docker_in_docker = os.environ.get("WACZ_ENABLE_DOCKER") and os.environ.get("RUNNING_IN_DOCKER")
self.crawl_id = random_str(8) self.cwd_dind = f"/crawls/crawls{random_str(8)}"
self.cwd_dind = f"/crawls/crawls{self.crawl_id}"
self.browsertrix_home_host = os.environ.get("BROWSERTRIX_HOME_HOST") self.browsertrix_home_host = os.environ.get("BROWSERTRIX_HOME_HOST")
self.browsertrix_home_container = os.environ.get("BROWSERTRIX_HOME_CONTAINER") or self.browsertrix_home_host self.browsertrix_home_container = os.environ.get("BROWSERTRIX_HOME_CONTAINER") or self.browsertrix_home_host
# create crawls folder if not exists, so it can be safely removed in cleanup # create crawls folder if not exists, so it can be safely removed in cleanup
@@ -51,7 +50,8 @@ class WaczExtractorEnricher(Enricher, Extractor):
url = to_enrich.get_url() url = to_enrich.get_url()
collection = self.crawl_id crawl_id = random_str(8)
collection = crawl_id
browsertrix_home_host = self.browsertrix_home_host or os.path.abspath(self.tmp_dir) browsertrix_home_host = self.browsertrix_home_host or os.path.abspath(self.tmp_dir)
browsertrix_home_container = self.browsertrix_home_container or browsertrix_home_host browsertrix_home_container = self.browsertrix_home_container or browsertrix_home_host
@@ -83,8 +83,10 @@ class WaczExtractorEnricher(Enricher, Extractor):
# "--blockAds" # note: this has been known to cause issues on cloudflare protected sites # "--blockAds" # note: this has been known to cause issues on cloudflare protected sites
] ]
crawl_cwd_dind = os.path.join(self.cwd_dind, crawl_id)
if self.docker_in_docker: if self.docker_in_docker:
cmd.extend(["--cwd", self.cwd_dind]) os.makedirs(crawl_cwd_dind, exist_ok=True)
cmd.extend(["--cwd", crawl_cwd_dind])
if self.auth_for_site(url): if self.auth_for_site(url):
# there's an auth for this site, but browsertrix only supports username/password auth # there's an auth for this site, but browsertrix only supports username/password auth
@@ -94,7 +96,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
# call docker if explicitly enabled or we are running on the host (not in docker) # call docker if explicitly enabled or we are running on the host (not in docker)
if self.use_docker: if self.use_docker:
logger.debug(f"generating WACZ in Docker for {url=}") logger.debug("Generating WACZ in Docker")
logger.debug(f"{browsertrix_home_host=} {browsertrix_home_container=}") logger.debug(f"{browsertrix_home_host=} {browsertrix_home_container=}")
if self.docker_commands: if self.docker_commands:
cmd = self.docker_commands + cmd cmd = self.docker_commands + cmd
@@ -109,14 +111,14 @@ class WaczExtractorEnricher(Enricher, Extractor):
] + cmd ] + cmd
if self.profile: if self.profile:
profile_file = f"profile-{self.crawl_id}.tar.gz" profile_file = f"profile-{crawl_id}.tar.gz"
profile_fn = os.path.join(browsertrix_home_container, profile_file) profile_fn = os.path.join(browsertrix_home_container, profile_file)
logger.debug(f"copying {self.profile} to {profile_fn}") logger.debug(f"Copying {self.profile} to {profile_fn}")
shutil.copyfile(self.profile, profile_fn) shutil.copyfile(self.profile, profile_fn)
cmd.extend(["--profile", os.path.join("/crawls", profile_file)]) cmd.extend(["--profile", os.path.join("/crawls", profile_file)])
else: else:
logger.debug(f"generating WACZ without Docker for {url=}") logger.debug("Generating WACZ without Docker")
if self.profile: if self.profile:
cmd.extend(["--profile", os.path.join("/app", str(self.profile))]) cmd.extend(["--profile", os.path.join("/app", str(self.profile))])
@@ -137,7 +139,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
return False return False
if self.docker_in_docker: if self.docker_in_docker:
wacz_fn = os.path.join(self.cwd_dind, "collections", collection, f"{collection}.wacz") wacz_fn = os.path.join(crawl_cwd_dind, "collections", collection, f"{collection}.wacz")
elif self.use_docker: elif self.use_docker:
wacz_fn = os.path.join(browsertrix_home_container, "collections", collection, f"{collection}.wacz") wacz_fn = os.path.join(browsertrix_home_container, "collections", collection, f"{collection}.wacz")
else: else:
@@ -152,7 +154,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
self.extract_media_from_wacz(to_enrich, wacz_fn) self.extract_media_from_wacz(to_enrich, wacz_fn)
if self.docker_in_docker: if self.docker_in_docker:
jsonl_fn = os.path.join(self.cwd_dind, "collections", collection, "pages", "pages.jsonl") jsonl_fn = os.path.join(crawl_cwd_dind, "collections", collection, "pages", "pages.jsonl")
elif self.use_docker: elif self.use_docker:
jsonl_fn = os.path.join(browsertrix_home_container, "collections", collection, "pages", "pages.jsonl") jsonl_fn = os.path.join(browsertrix_home_container, "collections", collection, "pages", "pages.jsonl")
else: else:

View File

@@ -1,8 +1,8 @@
import json import json
from loguru import logger from auto_archiver.utils.custom_logger import logger
import time import time
import requests import requests
from urllib3.exceptions import MaxRetryError
from auto_archiver.core import Extractor, Enricher from auto_archiver.core import Extractor, Enricher
from auto_archiver.utils import url as UrlUtil from auto_archiver.utils import url as UrlUtil
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -31,21 +31,28 @@ class WaybackExtractorEnricher(Enricher, Extractor):
url = to_enrich.get_url() url = to_enrich.get_url()
if UrlUtil.is_auth_wall(url): if UrlUtil.is_auth_wall(url):
logger.debug(f"[SKIP] WAYBACK since url is behind AUTH WALL: {url=}") logger.debug("[SKIP] WAYBACK since url is behind AUTH WALL")
return return
logger.debug(f"calling wayback for {url=}")
if to_enrich.get("wayback"): if to_enrich.get("wayback"):
logger.info(f"Wayback enricher had already been executed: {to_enrich.get('wayback')}") logger.info(f"Wayback enricher had already been executed: {to_enrich.get('wayback')}")
return True return True
logger.debug("Calling Wayback")
ia_headers = {"Accept": "application/json", "Authorization": f"LOW {self.key}:{self.secret}"} ia_headers = {"Accept": "application/json", "Authorization": f"LOW {self.key}:{self.secret}"}
post_data = {"url": url} post_data = {"url": url}
if self.if_not_archived_within: if self.if_not_archived_within:
post_data["if_not_archived_within"] = self.if_not_archived_within post_data["if_not_archived_within"] = self.if_not_archived_within
# see https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA for more options # see https://docs.google.com/document/d/1Nsv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA for more options
r = requests.post("https://web.archive.org/save/", headers=ia_headers, data=post_data, proxies=proxies) try:
r = requests.post("https://web.archive.org/save/", headers=ia_headers, data=post_data, proxies=proxies)
except MaxRetryError as e:
logger.warning(
f"MaxRetryError during Wayback POST call to /save, this may be do to a high number of calls leading to rate limiting: {e}"
)
to_enrich.set("wayback", "failed: possible rate limit")
return False
if r.status_code != 200: if r.status_code != 200:
logger.error(em := f"Internet archive failed with status of {r.status_code}: {r.json()}") logger.error(em := f"Internet archive failed with status of {r.status_code}: {r.json()}")
@@ -68,7 +75,7 @@ class WaybackExtractorEnricher(Enricher, Extractor):
attempt = 1 attempt = 1
while not wayback_url and time.time() - start_time <= self.timeout: while not wayback_url and time.time() - start_time <= self.timeout:
try: try:
logger.debug(f"GETting status for {job_id=} on {url=} ({attempt=})") logger.debug(f"GETting status for {job_id=} ({attempt=})")
r_status = requests.get( r_status = requests.get(
f"https://web.archive.org/save/status/{job_id}", headers=ia_headers, proxies=proxies f"https://web.archive.org/save/status/{job_id}", headers=ia_headers, proxies=proxies
) )
@@ -76,16 +83,19 @@ class WaybackExtractorEnricher(Enricher, Extractor):
if r_status.status_code == 200 and r_json["status"] == "success": if r_status.status_code == 200 and r_json["status"] == "success":
wayback_url = f"https://web.archive.org/web/{r_json['timestamp']}/{r_json['original_url']}" wayback_url = f"https://web.archive.org/web/{r_json['timestamp']}/{r_json['original_url']}"
elif r_status.status_code != 200 or r_json["status"] != "pending": elif r_status.status_code != 200 or r_json["status"] != "pending":
if r_json.get("status_ext") in ["error:blocked-url", "error:unauthorized"]:
logger.warning("Wayback cannot currently archive the URL, skipping.")
to_enrich.set("wayback", r_json.get("status_ext"))
logger.error(f"Wayback failed with {r_json}") logger.error(f"Wayback failed with {r_json}")
return False return False
except requests.exceptions.RequestException as e: except requests.exceptions.RequestException as e:
logger.warning(f"RequestException: fetching status for {url=} due to: {e}") logger.warning(f"RequestException: fetching status due to: {e}")
break break
except json.decoder.JSONDecodeError: except json.decoder.JSONDecodeError:
logger.error(f"Expected a JSON from Wayback and got {r.text} for {url=}") logger.error(f"Expected a JSON from Wayback and got {r.text}")
break break
except Exception as e: except Exception as e:
logger.warning(f"error fetching status for {url=} due to: {e}") logger.warning(f"error fetching status due to: {e}")
if not wayback_url: if not wayback_url:
attempt += 1 attempt += 1
time.sleep(1) # TODO: can be improved with exponential backoff time.sleep(1) # TODO: can be improved with exponential backoff

View File

@@ -1,7 +1,7 @@
import traceback import traceback
import requests import requests
import time import time
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -25,7 +25,7 @@ class WhisperEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() url = to_enrich.get_url()
logger.debug(f"WHISPER[{self.action}]: iterating media items for {url=}.") logger.debug(f"WHISPER[{self.action}]: iterating media items")
job_results = {} job_results = {}
for i, m in enumerate(to_enrich.media): for i, m in enumerate(to_enrich.media):
@@ -35,7 +35,7 @@ class WhisperEnricher(Enricher):
try: try:
job_id = self.submit_job(m) job_id = self.submit_job(m)
job_results[job_id] = False job_results[job_id] = False
logger.debug(f"JOB SUBMITTED: {job_id=} for {m.key=}") logger.debug(f"Job submitted: {job_id=} for {m.key=}")
to_enrich.media[i].set("whisper_model", {"job_id": job_id}) to_enrich.media[i].set("whisper_model", {"job_id": job_id})
except Exception as e: except Exception as e:
logger.error( logger.error(
@@ -72,14 +72,14 @@ class WhisperEnricher(Enricher):
"type": self.action, "type": self.action,
# "language": "string" # may be a config # "language": "string" # may be a config
} }
logger.debug(f"calling API with {payload=}") logger.debug(f"Calling API with {payload=}")
response = requests.post( response = requests.post(
f"{self.api_endpoint}/jobs", json=payload, headers={"Authorization": f"Bearer {self.api_key}"} f"{self.api_endpoint}/jobs", json=payload, headers={"Authorization": f"Bearer {self.api_key}"}
) )
assert response.status_code == 201, ( assert response.status_code == 201, (
f"calling the whisper api {self.api_endpoint} returned a non-success code: {response.status_code}" f"calling the whisper api {self.api_endpoint} returned a non-success code: {response.status_code}"
) )
logger.debug(response.json()) logger.debug(f"Response from whisper API: {response.json()}")
return response.json()["id"] return response.json()["id"]
def check_jobs(self, job_results: dict): def check_jobs(self, job_results: dict):
@@ -115,7 +115,7 @@ class WhisperEnricher(Enricher):
assert r_res.status_code == 200, ( assert r_res.status_code == 200, (
f"Job artifacts did not respond with 200, instead with: {r_res.status_code}" f"Job artifacts did not respond with 200, instead with: {r_res.status_code}"
) )
logger.success(r_res.json()) logger.info(f"Job {job_id} completed successfully:{r_res.json()}")
result = {} result = {}
for art_id, artifact in enumerate(r_res.json()): for art_id, artifact in enumerate(r_res.json()):
subtitle = [] subtitle = []

View File

@@ -0,0 +1,66 @@
from loguru import logger
import json
def type_serializer(obj):
"""Fallback function for objects json can't handle."""
if isinstance(obj, type):
return obj.__name__
return str(obj)
def extract_location(record, short=False):
"""Extracts the file name, function name, and line number from the log record."""
if short:
return f"{record['file'].name}:{record['line']}"
return f"{record['file'].name}:{record['function']}:{record['line']}"
def extract_log_data(record):
subset = {
"level": record["level"].name,
"time": record["time"].isoformat(timespec="seconds"),
}
subset["loc"] = extract_location(record)
# This is where logger.contextualize() parameters can be added to the output
for extra_key in ["trace", "url", "worksheet", "row"]:
if extra_val := record.get("extra", {}).get(extra_key):
subset[extra_key] = extra_val
subset["message"] = record["message"]
if exception := record.get("exception"):
subset["exception"] = exception
return subset
def serialize_for_console(record):
subset = extract_log_data(record)
subset.pop("message", None)
subset.pop("level", None)
subset.pop("loc", None)
subset.pop("time", None)
if not subset:
return ""
return json.dumps(subset, ensure_ascii=False, default=type_serializer)
def serialize(record):
return json.dumps(extract_log_data(record), ensure_ascii=False, default=type_serializer)
def patching(record):
record["extra"]["serialized"] = serialize(record)
record["extra"]["serialize_for_console"] = serialize_for_console(record)
def format_for_human_readable_console():
return (
"<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | "
"<level>{level: <8}</level> | "
"<cyan>{file}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> | "
"{extra[serialize_for_console]} <level>{message}</level>"
)
logger = logger.patch(patching)

View File

@@ -0,0 +1,273 @@
"""
Deletion Detection Utilities
Provides a best-effort detection of deleted, missing, or unavailable content
across various social media platforms based on presence of expected keywords.
This module helps identify removed content, helps to:
- Document content that existed but was deleted
- Track patterns of content removal
- Preserve metadata about missing content
"""
from typing import Optional, Dict, List
from auto_archiver.utils.custom_logger import logger
from urllib.parse import urlparse
class DeletionIndicators:
"""
Platform-specific indicators that content has been deleted or is unavailable, alongside generic indicators.
"""
# Twitter/X deletion indicators
TWITTER = [
"Hmm...this page doesn't exist",
"Try searching for something else",
"This Tweet is unavailable",
"This account doesn't exist",
"This Tweet has been deleted",
"This account has been suspended",
"Sorry, that page doesn't exist",
"The Tweet you're looking for isn't available",
]
# Facebook deletion indicators
FACEBOOK = [
"This content isn't available",
"Sorry, this content isn't available",
"This content is no longer available",
"The link you followed may be broken",
"Page Not Found",
"Content Not Found",
"This content is no longer on Facebook",
]
# Instagram deletion indicators
INSTAGRAM = [
"Sorry, this page isn't available",
"The link you followed may be broken",
"Media not found or unavailable",
"This post is no longer available",
"This account is private",
]
# TikTok deletion indicators
TIKTOK = [
"Couldn't find this account",
"This video is no longer available",
"This video is currently unavailable",
"Video not found",
"This video may have been deleted",
]
# YouTube deletion indicators
YOUTUBE = [
"This video isn't available anymore",
"Video unavailable",
"This video has been removed",
"This video is no longer available",
"This video is private",
"This video has been removed by the uploader",
"This video has been deleted",
]
# Reddit deletion indicators
REDDIT = [
"this post has been removed",
"this comment has been removed",
"[removed]",
"[deleted]",
"page not found",
"there doesn't seem to be anything here",
]
# VK deletion indicators
VK = [
"Post deleted",
"Page not found",
"Content unavailable",
"Access denied",
]
# Telegram deletion indicators
TELEGRAM = [
"Message not found",
"Deleted message",
"Channel is private",
]
# Generic indicators (work across platforms)
GENERIC = [
"has been removed",
"no longer available",
"content removed",
"access denied",
"page not found",
]
@classmethod
def all_indicators(cls) -> List[str]:
"""Returns all deletion indicators from all platforms."""
return (
cls.TWITTER
+ cls.FACEBOOK
+ cls.INSTAGRAM
+ cls.TIKTOK
+ cls.YOUTUBE
+ cls.REDDIT
+ cls.VK
+ cls.TELEGRAM
+ cls.GENERIC
)
@classmethod
def for_url(cls, url: str) -> List[str]:
"""Returns platform-specific indicators based on URL domain."""
platform = _extract_platform(url)
indicators_map = {
"twitter": cls.TWITTER + cls.GENERIC,
"facebook": cls.FACEBOOK + cls.GENERIC,
"instagram": cls.INSTAGRAM + cls.GENERIC,
"tiktok": cls.TIKTOK + cls.GENERIC,
"youtube": cls.YOUTUBE + cls.GENERIC,
"reddit": cls.REDDIT + cls.GENERIC,
"vk": cls.VK + cls.GENERIC,
"telegram": cls.TELEGRAM + cls.GENERIC,
}
return indicators_map.get(platform, cls.GENERIC)
def detect_deletion(
html_content: str = None,
page_title: str = None,
error_message: str = None,
url: str = None,
video_data: dict = None,
) -> Optional[Dict[str, any]]:
"""
Best-effort deletion detection across multiple signals.
Checks HTML content, page titles, error messages, and video metadata for
indicators that content has been deleted or is unavailable.
Args:
html_content: Raw HTML source of the page
page_title: Browser page title
error_message: Any error message from the extractor
url: The URL being archived (for platform-specific detection)
video_data: Video metadata from yt-dlp or other extractors
Returns:
Dictionary with deletion details if detected, None otherwise.
Format: {
"is_deleted": True,
"indicator": "specific text that was found",
"source": "html|title|error|metadata",
"platform": "twitter|facebook|etc"
}
"""
# Determine indicators to check based on URL
if url:
indicators = DeletionIndicators.for_url(url)
platform = _extract_platform(url)
else:
indicators = DeletionIndicators.all_indicators()
platform = "unknown"
# Check HTML content
if html_content:
for indicator in indicators:
if indicator.lower() in html_content.lower():
logger.info(f"Deletion detected in HTML: '{indicator}' found for {url}")
return {"is_deleted": True, "indicator": indicator, "source": "html_content", "platform": platform}
# Check page title
if page_title:
for indicator in indicators:
if indicator.lower() in page_title.lower():
logger.info(f"Deletion detected in page title: '{indicator}' found for {url}")
return {"is_deleted": True, "indicator": indicator, "source": "page_title", "platform": platform}
# Check error messages
if error_message:
for indicator in indicators:
if indicator.lower() in str(error_message).lower():
logger.info(f"Deletion detected in error: '{indicator}' found for {url}")
return {"is_deleted": True, "indicator": indicator, "source": "error_message", "platform": platform}
# Check video metadata (from yt-dlp)
if video_data:
# Check if yt-dlp flagged it as unavailable
if video_data.get("availability") in ["unavailable", "private", "deleted"]:
logger.info(f"Deletion detected in metadata: availability={video_data.get('availability')}")
return {
"is_deleted": True,
"indicator": f"availability: {video_data.get('availability')}",
"source": "video_metadata",
"platform": platform,
}
# Check description/title for deletion indicators
for key in ["title", "description", "fulltitle"]:
if key in video_data:
for indicator in indicators:
if indicator.lower() in str(video_data[key]).lower():
logger.info(f"Deletion detected in {key}: '{indicator}'")
return {
"is_deleted": True,
"indicator": indicator,
"source": f"video_metadata_{key}",
"platform": platform,
}
return None
def _extract_platform(url: str) -> str:
"""Extracts platform name from URL."""
parsed = urlparse(url)
domain = parsed.netloc
if "twitter.com" in domain or "x.com" in domain:
return "twitter"
elif "facebook.com" in domain or "fb.com" in domain:
return "facebook"
elif "instagram.com" in domain:
return "instagram"
elif "tiktok.com" in domain:
return "tiktok"
elif "youtube.com" in domain or "youtu.be" in domain:
return "youtube"
elif "reddit.com" in domain:
return "reddit"
elif "vk.com" in domain:
return "vk"
elif "t.me" in domain:
return "telegram"
return "unknown"
def flag_as_deleted(metadata, deletion_info: Dict[str, any]) -> None:
"""
Flags metadata object as deleted/unavailable.
Adds tentative deletion information to the metadata object.
Args:
metadata: Metadata object to update
deletion_info: Dictionary from detect_deletion()
"""
metadata.set("deletion_detected", True)
metadata.set("deletion_indicator", deletion_info.get("indicator"))
metadata.set("deletion_source", deletion_info.get("source"))
metadata.set("deletion_platform", deletion_info.get("platform"))
metadata.status = "deleted_or_unavailable"
logger.debug(
f"Content marked as deleted/unavailable: "
f"platform={deletion_info.get('platform')}, "
f"indicator='{deletion_info.get('indicator')}', "
f"source={deletion_info.get('source')}"
)

View File

@@ -6,8 +6,7 @@ import uuid
from datetime import datetime, timezone from datetime import datetime, timezone
from dateutil.parser import parse as parse_dt from dateutil.parser import parse as parse_dt
import requests from auto_archiver.utils.custom_logger import logger
from loguru import logger
def mkdir_if_not_exists(folder): def mkdir_if_not_exists(folder):
@@ -15,18 +14,6 @@ def mkdir_if_not_exists(folder):
os.makedirs(folder) os.makedirs(folder)
def expand_url(url):
# expand short URL links
if "https://t.co/" in url:
try:
r = requests.get(url)
logger.debug(f"Expanded url {url} to {r.url}")
return r.url
except Exception:
logger.error(f"Failed to expand url {url}")
return url
def getattr_or(o: object, prop: str, default=None): def getattr_or(o: object, prop: str, default=None):
try: try:
res = getattr(o, prop) res = getattr(o, prop)

View File

@@ -9,7 +9,7 @@ from tempfile import TemporaryDirectory
from typing import Dict, Tuple from typing import Dict, Tuple
import hashlib import hashlib
from loguru import logger from auto_archiver.utils.custom_logger import logger
import pytest import pytest
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media
from auto_archiver.core.module import ModuleFactory from auto_archiver.core.module import ModuleFactory

1
tests/core/__init__.py Normal file
View File

@@ -0,0 +1 @@
# Core module tests

198
tests/core/test_media.py Normal file
View File

@@ -0,0 +1,198 @@
"""
Tests for the Media class from auto_archiver.core.media
"""
import pytest
from unittest.mock import Mock, patch
from auto_archiver.core.media import Media
class TestMediaBasics:
"""Test basic Media properties and methods."""
def test_media_creation_with_filename(self):
media = Media(filename="test.mp4")
assert media.filename == "test.mp4"
assert media.urls == []
assert media.properties == {}
def test_media_key_property(self):
media = Media(filename="test.mp4", _key="my_key")
assert media.key == "my_key"
def test_media_set_get_properties(self):
media = Media(filename="test.mp4")
result = media.set("author", "John Doe")
assert result is media # returns self for chaining
assert media.get("author") == "John Doe"
assert media.get("nonexistent") is None
assert media.get("nonexistent", "default") == "default"
def test_media_add_url(self):
media = Media(filename="test.mp4")
media.add_url("https://example.com/test.mp4")
assert "https://example.com/test.mp4" in media.urls
media.add_url("https://cdn.example.com/test.mp4")
assert len(media.urls) == 2
class TestMediaMimetype:
"""Test mimetype detection and handling."""
@pytest.mark.parametrize(
"filename,expected_mimetype",
[
("video.mp4", "video/mp4"),
("image.jpg", "image/jpeg"),
("image.png", "image/png"),
("audio.mp3", "audio/mpeg"),
("document.pdf", "application/pdf"),
("text.txt", "text/plain"),
],
)
def test_mimetype_detection(self, filename, expected_mimetype):
media = Media(filename=filename)
assert media.mimetype == expected_mimetype
def test_mimetype_setter(self):
media = Media(filename="file.unknown")
media.mimetype = "custom/type"
assert media.mimetype == "custom/type"
def test_mimetype_empty_filename(self):
media = Media(filename="")
assert media.mimetype == ""
class TestMediaTypeChecks:
"""Test media type checking methods."""
@pytest.mark.parametrize(
"filename,is_video,is_audio,is_image",
[
("video.mp4", True, False, False),
("video.avi", True, False, False),
("audio.mp3", False, True, False),
("audio.wav", False, True, False),
("image.jpg", False, False, True),
("image.png", False, False, True),
("document.pdf", False, False, False),
],
)
def test_type_checks(self, filename, is_video, is_audio, is_image):
media = Media(filename=filename)
assert media.is_video() == is_video
assert media.is_audio() == is_audio
assert media.is_image() == is_image
class TestMediaStore:
"""Test media storage functionality."""
def test_store_with_no_storages(self, caplog):
media = Media(filename="test.mp4")
metadata = Mock()
media.store(metadata, storages=[])
assert "No storages found" in caplog.text
def test_store_with_storage(self):
media = Media(filename="test.mp4")
metadata = Mock()
mock_storage = Mock()
media.store(metadata, url="https://example.com", storages=[mock_storage])
mock_storage.store.assert_called_once()
class TestMediaInnerMedia:
"""Test nested media retrieval."""
def test_all_inner_media_no_nested(self):
media = Media(filename="test.mp4")
inner = list(media.all_inner_media(include_self=False))
assert len(inner) == 0
inner_with_self = list(media.all_inner_media(include_self=True))
assert len(inner_with_self) == 1
assert inner_with_self[0] is media
def test_all_inner_media_with_nested(self):
parent = Media(filename="parent.mp4")
child = Media(filename="child.jpg")
grandchild = Media(filename="grandchild.png")
child.set("thumbnail", grandchild)
parent.set("preview", child)
inner = list(parent.all_inner_media(include_self=False))
assert len(inner) == 2
assert child in inner
assert grandchild in inner
def test_all_inner_media_with_list_property(self):
parent = Media(filename="parent.mp4")
child1 = Media(filename="frame1.jpg")
child2 = Media(filename="frame2.jpg")
parent.set("frames", [child1, child2])
inner = list(parent.all_inner_media(include_self=False))
assert len(inner) == 2
assert child1 in inner
assert child2 in inner
class TestMediaIsStored:
"""Test the is_stored method."""
def test_is_stored_no_urls(self):
media = Media(filename="test.mp4")
storage = Mock()
storage.config = {"steps": {"storages": ["s3", "local"]}}
assert media.is_stored(storage) is False
def test_is_stored_partial_urls(self):
media = Media(filename="test.mp4")
media.add_url("https://s3.example.com/test.mp4")
storage = Mock()
storage.config = {"steps": {"storages": ["s3", "local"]}}
assert media.is_stored(storage) is False
def test_is_stored_full_urls(self):
media = Media(filename="test.mp4")
media.add_url("https://s3.example.com/test.mp4")
media.add_url("file:///local/test.mp4")
storage = Mock()
storage.config = {"steps": {"storages": ["s3", "local"]}}
assert media.is_stored(storage) is True
class TestMediaValidVideo:
"""Test video validation functionality."""
def test_is_valid_video_with_valid_probe(self):
media = Media(filename="test.mp4")
mock_streams = {"streams": [{"duration_ts": 1000}]}
with patch("ffmpeg.probe", return_value=mock_streams):
assert media.is_valid_video() is True
def test_is_valid_video_with_no_duration(self):
media = Media(filename="test.mp4")
mock_streams = {"streams": [{"duration_ts": 0}]}
with patch("ffmpeg.probe", return_value=mock_streams):
assert media.is_valid_video() is False
def test_is_valid_video_with_ffmpeg_error(self):
media = Media(filename="test.mp4")
with patch("ffmpeg.probe", side_effect=Exception("ffmpeg error")):
with patch("os.path.getsize", return_value=100):
# Falls back to file size check, small file
assert media.is_valid_video() is False
with patch("os.path.getsize", return_value=30000):
# Falls back to file size check, larger file
assert media.is_valid_video() is True

View File

@@ -0,0 +1,98 @@
"""
Tests for validators module from auto_archiver.core.validators
"""
import argparse
import json
import pytest
from auto_archiver.core.validators import positive_number, valid_file, json_loader
class TestPositiveNumber:
"""Test the positive_number validator."""
@pytest.mark.parametrize(
"value,expected",
[
(0, 0),
(1, 1),
(100, 100),
(0.5, 0.5),
(999999, 999999),
],
)
def test_positive_values(self, value, expected):
assert positive_number(value) == expected
@pytest.mark.parametrize(
"value",
[
-1,
-100,
-0.5,
-999999,
],
)
def test_negative_values_raise_error(self, value):
with pytest.raises(argparse.ArgumentTypeError) as exc_info:
positive_number(value)
assert "not a positive number" in str(exc_info.value)
class TestValidFile:
"""Test the valid_file validator."""
def test_valid_file_exists(self, tmp_path):
test_file = tmp_path / "test.txt"
test_file.write_text("test content")
result = valid_file(str(test_file))
assert result == str(test_file)
def test_valid_file_not_exists(self):
with pytest.raises(argparse.ArgumentTypeError) as exc_info:
valid_file("/nonexistent/path/to/file.txt")
assert "does not exist" in str(exc_info.value)
def test_valid_file_directory_not_file(self, tmp_path):
# A directory is not a file
with pytest.raises(argparse.ArgumentTypeError) as exc_info:
valid_file(str(tmp_path))
assert "does not exist" in str(exc_info.value)
class TestJsonLoader:
"""Test the json_loader validator."""
@pytest.mark.parametrize(
"json_str,expected",
[
('{"key": "value"}', {"key": "value"}),
('{"number": 123}', {"number": 123}),
('{"list": [1, 2, 3]}', {"list": [1, 2, 3]}),
('{"nested": {"inner": "value"}}', {"nested": {"inner": "value"}}),
("[]", []),
("[1, 2, 3]", [1, 2, 3]),
('"string"', "string"),
("123", 123),
("true", True),
("false", False),
("null", None),
],
)
def test_valid_json(self, json_str, expected):
assert json_loader(json_str) == expected
@pytest.mark.parametrize(
"invalid_json",
[
"{invalid}",
"{'single': 'quotes'}",
"{missing: quotes}",
'{"unclosed": "brace"',
"",
],
)
def test_invalid_json_raises_error(self, invalid_json):
with pytest.raises(json.JSONDecodeError):
json_loader(invalid_json)

View File

@@ -1,6 +1,6 @@
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from loguru import logger from auto_archiver.utils.custom_logger import logger
class ExampleExtractor(Extractor): class ExampleExtractor(Extractor):

View File

@@ -1,6 +1,6 @@
from auto_archiver.core import Extractor, Enricher, Feeder, Database, Storage, Formatter, Metadata from auto_archiver.core import Extractor, Enricher, Feeder, Database, Storage, Formatter, Metadata
from loguru import logger from auto_archiver.utils.custom_logger import logger
class ExampleModule(Extractor, Enricher, Feeder, Database, Storage, Formatter): class ExampleModule(Extractor, Enricher, Feeder, Database, Storage, Formatter):

View File

@@ -29,7 +29,7 @@ def test_fetch_fail_status(api_db, metadata, mocker):
mock_get = mocker.patch("auto_archiver.modules.api_db.api_db.requests.get") mock_get = mocker.patch("auto_archiver.modules.api_db.api_db.requests.get")
mock_get.return_value.status_code = 400 mock_get.return_value.status_code = 400
mock_get.return_value.json.return_value = {} mock_get.return_value.json.return_value = {}
mock_error = mocker.patch("loguru.logger.error") mock_error = mocker.patch("auto_archiver.utils.custom_logger.logger.error")
assert api_db.fetch(metadata) is False assert api_db.fetch(metadata) is False
mock_error.assert_called_once_with("AA API FAIL (400): {}") mock_error.assert_called_once_with("AA API FAIL (400): {}")

View File

@@ -0,0 +1,62 @@
"""
Tests for the ConsoleDb module
"""
import pytest
@pytest.fixture
def console_db(setup_module):
return setup_module("console_db")
class TestConsoleDb:
"""Test the ConsoleDb functionality."""
def test_started_logs_info(self, console_db, make_item, caplog):
"""Test that started() logs an info message."""
item = make_item("https://example.com/test")
with caplog.at_level("INFO"):
console_db.started(item)
assert "STARTED" in caplog.text
assert "example.com" in caplog.text
def test_failed_logs_error(self, console_db, make_item, caplog):
"""Test that failed() logs an error message with reason."""
item = make_item("https://example.com/test")
reason = "Connection timeout"
with caplog.at_level("ERROR"):
console_db.failed(item, reason)
assert "FAILED" in caplog.text
assert "Connection timeout" in caplog.text
def test_aborted_logs_warning(self, console_db, make_item, caplog):
"""Test that aborted() logs a warning message."""
item = make_item("https://example.com/test")
with caplog.at_level("WARNING"):
console_db.aborted(item)
assert "ABORTED" in caplog.text
def test_done_logs_success(self, console_db, make_item, caplog):
"""Test that done() logs a success message."""
item = make_item("https://example.com/test")
with caplog.at_level("INFO"):
console_db.done(item)
assert "DONE" in caplog.text
def test_done_cached(self, console_db, make_item, caplog):
"""Test done() with cached=True (should behave the same)."""
item = make_item("https://example.com/test")
with caplog.at_level("INFO"):
console_db.done(item, cached=True)
assert "DONE" in caplog.text

Some files were not shown because too many files have changed in this diff Show More