Compare commits

...

16 Commits

Author SHA1 Message Date
Miguel Sozinho Ramalho
52a7cabaf1 Merge pull request #402 from bellingcat/dev
bug fix: wacz screenshots leak in shared session
2026-02-25 10:39:54 +00:00
msramalho
a739361e12 bug fix: wacz screenshots leak in shared session 2026-02-23 16:26:36 +00:00
Miguel Sozinho Ramalho
9a97fede43 Merge pull request #401 from bellingcat/dev
Dependencies maintenance.
2026-02-23 13:27:51 +00:00
msramalho
2d13077fad bumping ruff version 2026-02-23 12:36:53 +00:00
msramalho
8a4a314cf9 ruff python version to dev version 2026-02-23 12:32:24 +00:00
msramalho
75e8b788ae revert ruff workflow changes 2026-02-23 12:31:20 +00:00
msramalho
defe2315bf docs updates 2026-02-23 12:28:25 +00:00
msramalho
ba0dffdd5e Merge branch 'dev' of github.com:bellingcat/auto-archiver into dev 2026-02-23 12:18:58 +00:00
msramalho
a09927c507 minor docs fix 2026-02-23 12:18:47 +00:00
Miguel Sozinho Ramalho
6c938c489a Merge pull request #392 from bellingcat/dependabot/github_actions/actions-bc0df0c757
Bump the actions group with 5 updates
2026-02-23 11:28:24 +00:00
msramalho
0e39768da9 version bumping settings script 2026-02-23 11:27:12 +00:00
msramalho
1e5d6ec4a6 version bump: minor 2026-02-23 11:23:40 +00:00
msramalho
3385d004cf yt-dlp to latest version 2026-02-23 11:23:26 +00:00
msramalho
7f27f7fce0 closes #383 fixing browsertrix-crawler at 1.11.4 2026-02-23 11:23:06 +00:00
msramalho
a6e3240af1 closes #399 and global dependency updates 2026-02-23 11:13:31 +00:00
dependabot[bot]
bf4c196cc2 Bump the actions group with 5 updates
Bumps the actions group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `4` | `6` |
| [docker/login-action](https://github.com/docker/login-action) | `3.4.0` | `3.7.0` |
| [docker/metadata-action](https://github.com/docker/metadata-action) | `5.7.0` | `5.10.0` |
| [actions/setup-python](https://github.com/actions/setup-python) | `5` | `6` |
| [actions/cache](https://github.com/actions/cache) | `4` | `5` |


Updates `actions/checkout` from 4 to 6
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

Updates `docker/login-action` from 3.4.0 to 3.7.0
- [Release notes](https://github.com/docker/login-action/releases)
- [Commits](74a5d14239...c94ce9fb46)

Updates `docker/metadata-action` from 5.7.0 to 5.10.0
- [Release notes](https://github.com/docker/metadata-action/releases)
- [Commits](902fa8ec7d...c299e40c65)

Updates `actions/setup-python` from 5 to 6
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v5...v6)

Updates `actions/cache` from 4 to 5
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: docker/login-action
  dependency-version: 3.7.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
- dependency-name: docker/metadata-action
  dependency-version: 5.10.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
- dependency-name: actions/cache
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-02-01 20:17:43 +00:00
12 changed files with 1185 additions and 1101 deletions

View File

@@ -22,7 +22,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: actions/checkout@v4
uses: actions/checkout@v6
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
@@ -33,14 +33,14 @@ jobs:
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772
uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@902fa8ec7d6ecbf8d84d538b9b233a880e428804
uses: docker/metadata-action@c299e40c65443455700f0fdfc63efafe5b349051
with:
images: bellingcat/auto-archiver

View File

@@ -22,10 +22,10 @@ jobs:
steps:
- name: Checkout Repository
uses: actions/checkout@v4
uses: actions/checkout@v6
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version-file: pyproject.toml

View File

@@ -20,11 +20,11 @@ jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- name: Install Python
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: "3.11"
python-version: "3.12"
- name: Install dependencies
run: |
python -m pip install --upgrade pip

View File

@@ -26,13 +26,13 @@ jobs:
working-directory: ./
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- name: Install ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}
@@ -40,7 +40,7 @@ jobs:
run: pipx install poetry
- name: Cache Poetry and pip artifacts
uses: actions/cache@v4
uses: actions/cache@v5
with:
path: |
~/.cache/pypoetry

View File

@@ -20,13 +20,13 @@ jobs:
working-directory: ./
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v6
- name: Install ffmpeg
run: sudo apt-get update && sudo apt-get install -y ffmpeg
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: ${{ matrix.python-version }}
@@ -34,7 +34,7 @@ jobs:
run: pipx install poetry
- name: Cache Poetry and pip artifacts
uses: actions/cache@v4
uses: actions/cache@v5
with:
path: |
~/.cache/pypoetry

View File

@@ -1,4 +1,4 @@
FROM webrecorder/browsertrix-crawler:1.9.2 AS base
FROM webrecorder/browsertrix-crawler:1.11.4 AS base
ENV RUNNING_IN_DOCKER=1 \
LANG=C.UTF-8 \

1122
poetry.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[project]
name = "auto-archiver"
version = "1.2.0"
version = "1.2.2"
description = "Automatically archive links to videos, images, and social media content from Google Sheets (and more)."
requires-python = ">=3.10,<3.13"
@@ -54,7 +54,7 @@ dependencies = [
"cryptography (>=46.0.3)",
"opentimestamps (>=0.4.5,<0.5.0)",
"bgutil-ytdlp-pot-provider (>=1.0.0)",
"yt-dlp[curl-cffi,default] (>=2025.5.22,<2026.0.0)",
"yt-dlp[curl-cffi,default] (>=2025.5.22)",
"secretstorage (>=3.3.3,<4.0.0)",
"seleniumbase (>=4.36.4,<5.0.0)",
"pyautogui (>=0.9.54,<0.10.0)",
@@ -66,7 +66,7 @@ pytest = "^8.3.4"
autopep8 = "^2.3.1"
pytest-loguru = "^0.4.0"
pytest-mock = "^3.14.0"
ruff = "^0.9.10"
ruff = "^0.15.2"
pre-commit = "^4.1.0"
[tool.poetry.group.docs.dependencies]

File diff suppressed because it is too large Load Diff

View File

@@ -34,7 +34,7 @@ def _extract_metadata(self, webpage, video_id):
...,
"attachments",
...,
lambda k, v: (k == "media" and str(v["id"]) == video_id and v["__typename"] == "Video"),
lambda k, v: k == "media" and str(v["id"]) == video_id and v["__typename"] == "Video",
),
expected_type=dict,
)

View File

@@ -355,7 +355,7 @@ class GenericExtractor(Extractor):
if not dropin:
# TODO: add a proper link to 'how to create your own dropin'
logger.debug(f"""Could not find valid dropin for {info_extractor.ie_key()}.
Why not try creating your own, and make sure it has a valid function called 'create_metadata'. Learn more: https://auto-archiver.readthedocs.io/en/latest/user_guidelines.html#""")
Why not try creating your own, and make sure it has a valid function called 'create_metadata'. Learn more: https://auto-archiver.readthedocs.io/en/latest/modules/autogen/extractor/generic_extractor.html#dropins""")
return False
post_data = dropin.extract_post(url, ie_instance)

View File

@@ -24,8 +24,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
self.use_docker = os.environ.get("WACZ_ENABLE_DOCKER") or not os.environ.get("RUNNING_IN_DOCKER")
self.docker_in_docker = os.environ.get("WACZ_ENABLE_DOCKER") and os.environ.get("RUNNING_IN_DOCKER")
self.crawl_id = random_str(8)
self.cwd_dind = f"/crawls/crawls{self.crawl_id}"
self.cwd_dind = f"/crawls/crawls{random_str(8)}"
self.browsertrix_home_host = os.environ.get("BROWSERTRIX_HOME_HOST")
self.browsertrix_home_container = os.environ.get("BROWSERTRIX_HOME_CONTAINER") or self.browsertrix_home_host
# create crawls folder if not exists, so it can be safely removed in cleanup
@@ -51,7 +50,8 @@ class WaczExtractorEnricher(Enricher, Extractor):
url = to_enrich.get_url()
collection = self.crawl_id
crawl_id = random_str(8)
collection = crawl_id
browsertrix_home_host = self.browsertrix_home_host or os.path.abspath(self.tmp_dir)
browsertrix_home_container = self.browsertrix_home_container or browsertrix_home_host
@@ -83,8 +83,10 @@ class WaczExtractorEnricher(Enricher, Extractor):
# "--blockAds" # note: this has been known to cause issues on cloudflare protected sites
]
crawl_cwd_dind = os.path.join(self.cwd_dind, crawl_id)
if self.docker_in_docker:
cmd.extend(["--cwd", self.cwd_dind])
os.makedirs(crawl_cwd_dind, exist_ok=True)
cmd.extend(["--cwd", crawl_cwd_dind])
if self.auth_for_site(url):
# there's an auth for this site, but browsertrix only supports username/password auth
@@ -109,7 +111,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
] + cmd
if self.profile:
profile_file = f"profile-{self.crawl_id}.tar.gz"
profile_file = f"profile-{crawl_id}.tar.gz"
profile_fn = os.path.join(browsertrix_home_container, profile_file)
logger.debug(f"Copying {self.profile} to {profile_fn}")
shutil.copyfile(self.profile, profile_fn)
@@ -137,7 +139,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
return False
if self.docker_in_docker:
wacz_fn = os.path.join(self.cwd_dind, "collections", collection, f"{collection}.wacz")
wacz_fn = os.path.join(crawl_cwd_dind, "collections", collection, f"{collection}.wacz")
elif self.use_docker:
wacz_fn = os.path.join(browsertrix_home_container, "collections", collection, f"{collection}.wacz")
else:
@@ -152,7 +154,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
self.extract_media_from_wacz(to_enrich, wacz_fn)
if self.docker_in_docker:
jsonl_fn = os.path.join(self.cwd_dind, "collections", collection, "pages", "pages.jsonl")
jsonl_fn = os.path.join(crawl_cwd_dind, "collections", collection, "pages", "pages.jsonl")
elif self.use_docker:
jsonl_fn = os.path.join(browsertrix_home_container, "collections", collection, "pages", "pages.jsonl")
else: