Compare commits

...

26 Commits

Author SHA1 Message Date
Miguel Sozinho Ramalho
0f56a5aae5 Merge pull request #331 from bellingcat/dev
1.1.1 multiple small fixes, and new logging strategy
2025-06-30 02:36:25 +01:00
msramalho
649412053e exclude non-ready code 2025-06-30 02:27:21 +01:00
msramalho
c2c9718f73 make python api tests work on gh when no env is set 2025-06-30 02:20:51 +01:00
msramalho
30ea8a0ba4 bumps dependencies 2025-06-30 02:20:09 +01:00
msramalho
73c8dc583f closes #333 2025-06-30 01:52:22 +01:00
msramalho
b2648fa3cd follow docs advice on exponential backoff of SheetsAPI 2025-06-30 01:47:12 +01:00
msramalho
4ad71b3589 adds retry to worksheet read for slow worksheets 2025-06-30 01:42:34 +01:00
msramalho
7c9475cde2 allow for human readable console logs, but defaults to JSON on file logs. 2025-06-30 00:53:10 +01:00
msramalho
afd9090a4c concludes logging standardization refactor 2025-06-26 17:20:04 +01:00
msramalho
ad29cb4447 adds post_data to metadata for instagram 2025-06-26 15:48:10 +01:00
msramalho
ce4d7ac649 WIP refactor logging 2025-06-21 15:54:51 +01:00
msramalho
ade7feb5a0 version bump 2025-06-18 17:38:17 +01:00
msramalho
12b457706b closes #166 adds story URL feature to telethon extractor 2025-06-18 17:37:44 +01:00
msramalho
592dc30415 closes #330 2025-06-18 16:40:55 +01:00
msramalho
4a36e6f6b0 fix tests 2025-06-18 13:50:21 +01:00
msramalho
d46eeee9b6 docs improved 2025-06-18 13:35:51 +01:00
msramalho
302e6f4258 logs improved 2025-06-18 13:35:43 +01:00
Miguel Sozinho Ramalho
e803c5d0e3 Merge branch 'main' into dev 2025-06-18 13:35:21 +01:00
msramalho
e1d0314a9e Merge branch 'dev' of https://github.com/bellingcat/auto-archiver into dev 2025-06-18 13:26:48 +01:00
Miguel Sozinho Ramalho
5d5119e053 Merge pull request #329 from bellingcat/dev
installs ffmpeg in readthedocs
2025-06-18 00:31:09 +01:00
msramalho
d6c90d87f1 installs ffmpeg in readthedocs 2025-06-18 00:30:45 +01:00
msramalho
212bf67ab1 installs ffmpeg in readthedocs 2025-06-18 00:29:36 +01:00
Miguel Sozinho Ramalho
6abe2edb13 Merge pull request #328 from bellingcat/dev
fix to configuration editor npm versions
2025-06-18 00:22:39 +01:00
msramalho
03c0cf09ae fix issue with grid in scripts/config_editor @mui lib upgrade 2025-06-18 00:20:31 +01:00
Miguel Sozinho Ramalho
0db77c7e68 Merge pull request #326 from bellingcat/dependabot/npm_and_yarn/scripts/settings/actions-27795ad889
Bump @types/react from 19.1.7 to 19.1.8 in /scripts/settings in the actions group across 1 directory
2025-06-18 00:12:51 +01:00
dependabot[bot]
cd6607943d Bump @types/react
Bumps the actions group with 1 update in the /scripts/settings directory: [@types/react](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/react).


Updates `@types/react` from 19.1.7 to 19.1.8
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/react)

---
updated-dependencies:
- dependency-name: "@types/react"
  dependency-version: 19.1.8
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-06-17 22:58:23 +00:00
77 changed files with 678 additions and 558 deletions

View File

@@ -47,4 +47,4 @@ jobs:
- name: Run Download Tests - name: Run Download Tests
run: poetry run pytest -ra -v -x -m "download" run: poetry run pytest -ra -v -x -m "download"
env: env:
TWITTER_BEARER_TOKEN: ${{ secrets.TWITTER_BEARER_TOKEN }} TWITTER_BEARER_TOKEN: ${{ secrets.TWITTER_BEARER_TOKEN || '' }}

View File

@@ -7,6 +7,8 @@ version: 2
build: build:
os: ubuntu-22.04 os: ubuntu-22.04
apt_packages:
- ffmpeg
tools: tools:
python: "3.10" python: "3.10"
nodejs: "22" nodejs: "22"

View File

@@ -24,7 +24,7 @@ This will disable all logs from Auto Archiver, but it does not disable logs for
#### Logging Level #### Logging Level
There are 7 logging levels in total, with 5 of them used in this tool. They are: `DEBUG`, `INFO`, `SUCCESS`, `WARNING` and `ERROR`. There are 7 logging levels in total, with 5 of them used in this tool. They are: `DEBUG`, `INFO`, `SUCCESS`, `WARNING` and `ERROR`. If you select a level, only that and higher (more serious) levels will be included. `DEBUG` is the most verbose, while `ERROR` is the least verbose.
Change the warning level by setting the value in your orchestration config file: Change the warning level by setting the value in your orchestration config file:
@@ -42,6 +42,20 @@ For normal usage, it is recommended to use the `INFO` level, or if you prefer qu
```{note} To learn about all logging levels, see the [loguru documentation](https://loguru.readthedocs.io/en/stable/api/logger.html) ```{note} To learn about all logging levels, see the [loguru documentation](https://loguru.readthedocs.io/en/stable/api/logger.html)
``` ```
### Logging Format
By default, the console logs are formatted in a human-readable way and the file logs are formatted in JSON. This is new from version 1.1.1. If you want to change the format of the console logs to JSON too you can set the `format:` option in your logging settings.
```{code} yaml
:caption: orchestration.yaml
logging:
format: json
```
When the Auto Archiver is writing logs it will include context about specific tasks, so if you are archiving a URL from a Google Sheet, both the URL (and a unique `trace_id` for that URL's archiving attempt) and the Spreadsheet name and row will be included in the logs. This is useful for debugging and understanding what the Auto Archiver is doing.
Using JSON allows you to easily parse the logs and extract specific information, tools like [`jq`](https://jqlang.org/) can be used to filter and search through the logs.
### Logging to a file ### Logging to a file
As default, auto-archiver will log to the console. But if you wish to store your logs for future reference, or you are running the auto-archiver from within code a implementation, then you may wish to enable file logging. This can be done by setting the `file:` config value in the logging settings. As default, auto-archiver will log to the console. But if you wish to store your logs for future reference, or you are running the auto-archiver from within code a implementation, then you may wish to enable file logging. This can be done by setting the `file:` config value in the logging settings.
@@ -84,6 +98,7 @@ The below example logs only `DEBUG` logs to the console and to the file `/my/fil
logging: logging:
level: DEBUG level: DEBUG
format: json
file: /my/file.log file: /my/file.log
rotation: 1 week rotation: 1 week
``` ```

124
poetry.lock generated
View File

@@ -193,18 +193,18 @@ files = [
[[package]] [[package]]
name = "boto3" name = "boto3"
version = "1.38.37" version = "1.38.46"
description = "The AWS SDK for Python" description = "The AWS SDK for Python"
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "boto3-1.38.37-py3-none-any.whl", hash = "sha256:46a512b1fbc4c51a9abfef8e2130db0806cb00ef137e161f6f751421c78a7c0c"}, {file = "boto3-1.38.46-py3-none-any.whl", hash = "sha256:9c8e88a32a6465e5905308708cff5b17547117f06982908bdfdb0108b4a65079"},
{file = "boto3-1.38.37.tar.gz", hash = "sha256:4ccd700a2a36de0cd63bd8c79cca6164cb684e34fc1126de5c41525e4d0bfaee"}, {file = "boto3-1.38.46.tar.gz", hash = "sha256:d1ca2b53138afd0341e1962bd52be6071ab7a63c5b4f89228c5ef8942c40c852"},
] ]
[package.dependencies] [package.dependencies]
botocore = ">=1.38.37,<1.39.0" botocore = ">=1.38.46,<1.39.0"
jmespath = ">=0.7.1,<2.0.0" jmespath = ">=0.7.1,<2.0.0"
s3transfer = ">=0.13.0,<0.14.0" s3transfer = ">=0.13.0,<0.14.0"
@@ -213,14 +213,14 @@ crt = ["botocore[crt] (>=1.21.0,<2.0a0)"]
[[package]] [[package]]
name = "botocore" name = "botocore"
version = "1.38.37" version = "1.38.46"
description = "Low-level, data-driven core of boto 3." description = "Low-level, data-driven core of boto 3."
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "botocore-1.38.37-py3-none-any.whl", hash = "sha256:f8ad063b7dcdbf12f2c1b5a4405f390ce52beff3b2861af2e5169816ee0146f2"}, {file = "botocore-1.38.46-py3-none-any.whl", hash = "sha256:89ca782ffbf2e8769ca9c89234cfa5ca577f1987d07d913ee3c68c4776b1eb5b"},
{file = "botocore-1.38.37.tar.gz", hash = "sha256:06ce46da5420ea7cf542ece4ff1ec9045922fef977adf4bbec618c96c7a478bf"}, {file = "botocore-1.38.46.tar.gz", hash = "sha256:8798e5a418c27cf93195b077153644aea44cb171fcd56edc1ecebaa1e49e226e"},
] ]
[package.dependencies] [package.dependencies]
@@ -801,20 +801,20 @@ typing-inspect = ">=0.4.0,<1"
[[package]] [[package]]
name = "dateparser" name = "dateparser"
version = "1.2.1" version = "1.2.2"
description = "Date parsing library designed to parse dates from HTML pages" description = "Date parsing library designed to parse dates from HTML pages"
optional = false optional = false
python-versions = ">=3.8" python-versions = ">=3.8"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "dateparser-1.2.1-py3-none-any.whl", hash = "sha256:bdcac262a467e6260030040748ad7c10d6bacd4f3b9cdb4cfd2251939174508c"}, {file = "dateparser-1.2.2-py3-none-any.whl", hash = "sha256:5a5d7211a09013499867547023a2a0c91d5a27d15dd4dbcea676ea9fe66f2482"},
{file = "dateparser-1.2.1.tar.gz", hash = "sha256:7e4919aeb48481dbfc01ac9683c8e20bfe95bb715a38c1e9f6af889f4f30ccc3"}, {file = "dateparser-1.2.2.tar.gz", hash = "sha256:986316f17cb8cdc23ea8ce563027c5ef12fc725b6fb1d137c14ca08777c5ecf7"},
] ]
[package.dependencies] [package.dependencies]
python-dateutil = ">=2.7.0" python-dateutil = ">=2.7.0"
pytz = ">=2024.2" pytz = ">=2024.2"
regex = ">=2015.06.24,<2019.02.19 || >2019.02.19,<2021.8.27 || >2021.8.27" regex = ">=2024.9.11"
tzlocal = ">=0.2" tzlocal = ">=0.2"
[package.extras] [package.extras]
@@ -966,14 +966,14 @@ grpcio-gcp = ["grpcio-gcp (>=0.2.2,<1.0.0)"]
[[package]] [[package]]
name = "google-api-python-client" name = "google-api-python-client"
version = "2.172.0" version = "2.174.0"
description = "Google API Client Library for Python" description = "Google API Client Library for Python"
optional = false optional = false
python-versions = ">=3.7" python-versions = ">=3.7"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "google_api_python_client-2.172.0-py3-none-any.whl", hash = "sha256:9f1b9a268d5dc1228207d246c673d3a09ee211b41a11521d38d9212aeaa43af7"}, {file = "google_api_python_client-2.174.0-py3-none-any.whl", hash = "sha256:f695205ceec97bfaa1590a14282559c4109326c473b07352233a3584cdbf4b89"},
{file = "google_api_python_client-2.172.0.tar.gz", hash = "sha256:dcb3b7e067154b2aa41f1776cf86584a5739c0ac74e6ff46fc665790dca0e6a6"}, {file = "google_api_python_client-2.174.0.tar.gz", hash = "sha256:9eb7616a820b38a9c12c5486f9b9055385c7feb18b20cbafc5c5a688b14f3515"},
] ]
[package.dependencies] [package.dependencies]
@@ -1604,14 +1604,14 @@ six = ">=1.6.1"
[[package]] [[package]]
name = "oauthlib" name = "oauthlib"
version = "3.2.2" version = "3.3.1"
description = "A generic, spec-compliant, thorough implementation of the OAuth request-signing logic" description = "A generic, spec-compliant, thorough implementation of the OAuth request-signing logic"
optional = false optional = false
python-versions = ">=3.6" python-versions = ">=3.8"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "oauthlib-3.2.2-py3-none-any.whl", hash = "sha256:8139f29aac13e25d502680e9e19963e83f16838d48a0d71c287fe40e7067fbca"}, {file = "oauthlib-3.3.1-py3-none-any.whl", hash = "sha256:88119c938d2b8fb88561af5f6ee0eec8cc8d552b7bb1f712743136eb7523b7a1"},
{file = "oauthlib-3.2.2.tar.gz", hash = "sha256:9859c40929662bec5d64f34d01c99e093149682a3f38915dc0655d5a633dd918"}, {file = "oauthlib-3.3.1.tar.gz", hash = "sha256:0f0f8aa759826a193cf66c12ea1af1637f87b9b4622d46e866952bb022e538c9"},
] ]
[package.extras] [package.extras]
@@ -2008,14 +2008,14 @@ pytweening = ">=1.0.4"
[[package]] [[package]]
name = "pycodestyle" name = "pycodestyle"
version = "2.13.0" version = "2.14.0"
description = "Python style guide checker" description = "Python style guide checker"
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["dev"] groups = ["dev"]
files = [ files = [
{file = "pycodestyle-2.13.0-py2.py3-none-any.whl", hash = "sha256:35863c5974a271c7a726ed228a14a4f6daf49df369d8c50cd9a6f58a5e143ba9"}, {file = "pycodestyle-2.14.0-py2.py3-none-any.whl", hash = "sha256:dd6bf7cb4ee77f8e016f9c8e74a35ddd9f67e1d5fd4184d86c3b98e07099f42d"},
{file = "pycodestyle-2.13.0.tar.gz", hash = "sha256:c8415bf09abe81d9c7f872502a6eee881fbe85d8763dd5b9924bb0a01d67efae"}, {file = "pycodestyle-2.14.0.tar.gz", hash = "sha256:c4b5b517d278089ff9d0abdec919cd97262a3367449ea1c8b49b91529167b783"},
] ]
[[package]] [[package]]
@@ -2126,14 +2126,14 @@ pyrect = "*"
[[package]] [[package]]
name = "pygments" name = "pygments"
version = "2.19.1" version = "2.19.2"
description = "Pygments is a syntax highlighting package written in Python." description = "Pygments is a syntax highlighting package written in Python."
optional = false optional = false
python-versions = ">=3.8" python-versions = ">=3.8"
groups = ["main", "dev", "docs"] groups = ["main", "dev", "docs"]
files = [ files = [
{file = "pygments-2.19.1-py3-none-any.whl", hash = "sha256:9ea1544ad55cecf4b8242fab6dd35a93bbce657034b0611ee383099054ab6d8c"}, {file = "pygments-2.19.2-py3-none-any.whl", hash = "sha256:86540386c03d588bb81d44bc3928634ff26449851e99741617ecb9037ee5ec0b"},
{file = "pygments-2.19.1.tar.gz", hash = "sha256:61c16d2a8576dc0649d9f39e089b5f02bcd27fba10d8fb4dcc28173f7a45151f"}, {file = "pygments-2.19.2.tar.gz", hash = "sha256:636cb2477cec7f8952536970bc533bc43743542f70392ae026374600add5b887"},
] ]
[package.extras] [package.extras]
@@ -2815,40 +2815,37 @@ rsa = ["oauthlib[signedtoken] (>=3.0.0)"]
[[package]] [[package]]
name = "retrying" name = "retrying"
version = "1.3.4" version = "1.4.0"
description = "Retrying" description = "Retrying"
optional = false optional = false
python-versions = "*" python-versions = ">=3.6"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "retrying-1.3.4-py3-none-any.whl", hash = "sha256:8cc4d43cb8e1125e0ff3344e9de678fefd85db3b750b81b2240dc0183af37b35"}, {file = "retrying-1.4.0-py3-none-any.whl", hash = "sha256:6509d829c70271937605bce361c8f76e91f9123d355d14df7dc6972b1518064a"},
{file = "retrying-1.3.4.tar.gz", hash = "sha256:345da8c5765bd982b1d1915deb9102fd3d1f7ad16bd84a9700b85f64d24e8f3e"}, {file = "retrying-1.4.0.tar.gz", hash = "sha256:efa99c78bf4fbdbe6f0cba4101470fbc684b93d30ca45ffa1288443a9805172f"},
] ]
[package.dependencies]
six = ">=1.7.0"
[[package]] [[package]]
name = "rfc3161-client" name = "rfc3161-client"
version = "1.0.2" version = "1.0.3"
description = "" description = ""
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "rfc3161_client-1.0.2-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:9cf9a8f813028ef2d5d737f738f27c7abe41a4c5c0570fbc2ddfd5e4d03aee7a"}, {file = "rfc3161_client-1.0.3-cp39-abi3-macosx_10_12_x86_64.whl", hash = "sha256:b3f513adc5d4c1c59aed1f5f89fbe2e560410f461ae163fdca8c130939df79d6"},
{file = "rfc3161_client-1.0.2-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:8db097d98b9e3bca4ca68babbeaed8436c4f8d455623c46821bf0cfd8492533f"}, {file = "rfc3161_client-1.0.3-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:863d97877c3aa7e42682f70da0f3009618bc1e2aa0a7353133b94dd649d3a602"},
{file = "rfc3161_client-1.0.2-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8397241db132602e38bc6c4e416cb47d541528b6665aee9788705949487560f7"}, {file = "rfc3161_client-1.0.3-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:649037dbade2e78bdc1e8d7d917b04f27c245e0d758ab713f2ddeeec0fc6dd52"},
{file = "rfc3161_client-1.0.2-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:8fe3c05f050b18719dac4accce6fdae88e7d5309eb36292eac0cad2f989d159e"}, {file = "rfc3161_client-1.0.3-cp39-abi3-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:c6743aa339c07772a53ffb1accc7def78c11d8ebba57c6d25329c1d412dde4dd"},
{file = "rfc3161_client-1.0.2-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:af30b5e46db8b88c1bf7eae182e1bd4080f5d2475044f6ae04ab545e0faaa217"}, {file = "rfc3161_client-1.0.3-cp39-abi3-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0d40bb252d1a0714f4faa6b538be0bcbe9d13c6a7a37188b26f9f23d34aad7a3"},
{file = "rfc3161_client-1.0.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a93b3b3f79f83fefd5399004d3cd522fe93f49dbbb4865dba2c6ac6d8190ab60"}, {file = "rfc3161_client-1.0.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f76bdf2a9f80ea97a99324fa74695621fddc0e6f5d4a4a4e0ca30e822a37e534"},
{file = "rfc3161_client-1.0.2-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:714b5fd21b56b5d47136e4ca2ad346db26320a47b282b20d14337711e2bdec5b"}, {file = "rfc3161_client-1.0.3-cp39-abi3-musllinux_1_2_aarch64.whl", hash = "sha256:9d4d628e00fee72f07bdc779ce75160036c8cb318cac5336cd12692e2d7153e8"},
{file = "rfc3161_client-1.0.2-cp39-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:19cf1cdfa7a3c189d10e58ffdc9553f78972b45bce9dc713c78752b6dd696b5a"}, {file = "rfc3161_client-1.0.3-cp39-abi3-musllinux_1_2_armv7l.whl", hash = "sha256:e5eeb73862b28e5aacc2951c0aec72ecff5209925a4c5be2753cd30f13c39ae5"},
{file = "rfc3161_client-1.0.2-cp39-abi3-musllinux_1_2_i686.whl", hash = "sha256:24653746e2d3868ac53bb47a46d2b891ffddd7fa939954df47301566919ed7e3"}, {file = "rfc3161_client-1.0.3-cp39-abi3-musllinux_1_2_i686.whl", hash = "sha256:39e188281bc04378130ed52b1b00ee330570f04f0000cc60a0a534803f349482"},
{file = "rfc3161_client-1.0.2-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:b5a2e502d60176c3d376a7c81a3748b96df64c3c7ff46934f8f0e35b72f9922d"}, {file = "rfc3161_client-1.0.3-cp39-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:ea49605cf10558145b075979d8bfc8bff685c44815bf8b66fd580ced642216c9"},
{file = "rfc3161_client-1.0.2-cp39-abi3-win32.whl", hash = "sha256:8cb9d6aa413362b98f40ce4c6667e69ae29a31c91c657547de99203e353ebc43"}, {file = "rfc3161_client-1.0.3-cp39-abi3-win32.whl", hash = "sha256:a231b2d3430216491a4dac0cb04afdad0398bf5ded39138938b6002734abf2b4"},
{file = "rfc3161_client-1.0.2-cp39-abi3-win_amd64.whl", hash = "sha256:03bb5c92a59dd028959142a2dba8edfbf7575d3ccd74ac50eaf2c0ada45e3a40"}, {file = "rfc3161_client-1.0.3-cp39-abi3-win_amd64.whl", hash = "sha256:f2a925e668b7637c0aecd416dd060ec9579a5edd62502bb88efa981791419a44"},
{file = "rfc3161_client-1.0.2.tar.gz", hash = "sha256:37c78277d78aab02baf17393c30f66d1c2ab1a398d3540b0657792c0ceb81858"}, {file = "rfc3161_client-1.0.3.tar.gz", hash = "sha256:e9b614a5a4596ab9aea44d3fe8a4995bd84ac7f20dcbfaa82b115224202d88d8"},
] ]
[package.dependencies] [package.dependencies]
@@ -2856,7 +2853,7 @@ cryptography = ">=43,<46"
[package.extras] [package.extras]
dev = ["maturin (>=1.7,<2.0)", "rfc3161-client[doc,lint,test]"] dev = ["maturin (>=1.7,<2.0)", "rfc3161-client[doc,lint,test]"]
lint = ["interrogate", "ruff (>=0.7,<0.12)"] lint = ["interrogate", "mypy", "ruff (>=0.7,<0.13)", "types-requests"]
test = ["coverage[toml]", "pretend", "pytest", "pytest-cov"] test = ["coverage[toml]", "pretend", "pytest", "pytest-cov"]
[[package]] [[package]]
@@ -2901,7 +2898,7 @@ description = "Manipulate well-formed Roman numerals"
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["docs"] groups = ["docs"]
markers = "python_version != \"3.10\"" markers = "python_version >= \"3.11\""
files = [ files = [
{file = "roman_numerals_py-3.1.0-py3-none-any.whl", hash = "sha256:9da2ad2fb670bcf24e81070ceb3be72f6c11c440d73bd579fbeca1e9f330954c"}, {file = "roman_numerals_py-3.1.0-py3-none-any.whl", hash = "sha256:9da2ad2fb670bcf24e81070ceb3be72f6c11c440d73bd579fbeca1e9f330954c"},
{file = "roman_numerals_py-3.1.0.tar.gz", hash = "sha256:be4bf804f083a4ce001b5eb7e3c0862479d10f94c936f6c4e5f250aa5ff5bd2d"}, {file = "roman_numerals_py-3.1.0.tar.gz", hash = "sha256:be4bf804f083a4ce001b5eb7e3c0862479d10f94c936f6c4e5f250aa5ff5bd2d"},
@@ -3119,21 +3116,21 @@ websocket-client = ">=1.8.0,<1.9.0"
[[package]] [[package]]
name = "seleniumbase" name = "seleniumbase"
version = "4.39.4" version = "4.39.5"
description = "A complete web automation framework for end-to-end testing." description = "A complete web automation framework for end-to-end testing."
optional = false optional = false
python-versions = ">=3.8" python-versions = ">=3.8"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "seleniumbase-4.39.4-py3-none-any.whl", hash = "sha256:15562b2550ce6f6fdcc524ff9bd87a1d7381a558767245f10ff63982f508c281"}, {file = "seleniumbase-4.39.5-py3-none-any.whl", hash = "sha256:bda571f4864bba126442571bb0a3ae8a9bee9253461253ac84affd9a48efdb4d"},
{file = "seleniumbase-4.39.4.tar.gz", hash = "sha256:8880869b88fa5a48c649a776488bafa1ca97d786fb8a25f63e6d5b5b5fc47f44"}, {file = "seleniumbase-4.39.5.tar.gz", hash = "sha256:a6d4930eb894c84d881f0fa596fb357b0fa2bb5a9e89ac3875d9e89eb27054c7"},
] ]
[package.dependencies] [package.dependencies]
attrs = ">=25.3.0" attrs = ">=25.3.0"
beautifulsoup4 = "4.13.4" beautifulsoup4 = "4.13.4"
behave = "1.2.6" behave = "1.2.6"
certifi = ">=2025.4.26" certifi = ">=2025.6.15"
chardet = "5.2.0" chardet = "5.2.0"
charset-normalizer = ">=3.4.2,<4" charset-normalizer = ">=3.4.2,<4"
colorama = ">=0.4.6" colorama = ">=0.4.6"
@@ -3192,7 +3189,7 @@ wsproto = "1.2.0"
[package.extras] [package.extras]
allure = ["allure-behave (>=2.13.5)", "allure-pytest (>=2.13.5)", "allure-python-commons (>=2.13.5)"] allure = ["allure-behave (>=2.13.5)", "allure-pytest (>=2.13.5)", "allure-python-commons (>=2.13.5)"]
coverage = ["coverage (>=7.6.1) ; python_version < \"3.9\"", "coverage (>=7.9.0) ; python_version >= \"3.9\"", "pytest-cov (>=5.0.0) ; python_version < \"3.9\"", "pytest-cov (>=6.2.1) ; python_version >= \"3.9\""] coverage = ["coverage (>=7.6.1) ; python_version < \"3.9\"", "coverage (>=7.9.1) ; python_version >= \"3.9\"", "pytest-cov (>=5.0.0) ; python_version < \"3.9\"", "pytest-cov (>=6.2.1) ; python_version >= \"3.9\""]
flake8 = ["flake8 (==5.0.4) ; python_version < \"3.9\"", "flake8 (==7.2.0) ; python_version >= \"3.9\"", "mccabe (==0.7.0)", "pycodestyle (==2.13.0) ; python_version >= \"3.9\"", "pycodestyle (==2.9.1) ; python_version < \"3.9\"", "pyflakes (==2.5.0) ; python_version < \"3.9\"", "pyflakes (==3.3.2) ; python_version >= \"3.9\""] flake8 = ["flake8 (==5.0.4) ; python_version < \"3.9\"", "flake8 (==7.2.0) ; python_version >= \"3.9\"", "mccabe (==0.7.0)", "pycodestyle (==2.13.0) ; python_version >= \"3.9\"", "pycodestyle (==2.9.1) ; python_version < \"3.9\"", "pyflakes (==2.5.0) ; python_version < \"3.9\"", "pyflakes (==3.3.2) ; python_version >= \"3.9\""]
ipdb = ["ipdb (==0.13.13)", "ipython (==7.34.0)"] ipdb = ["ipdb (==0.13.13)", "ipython (==7.34.0)"]
mss = ["mss (==10.0.0) ; python_version >= \"3.9\"", "mss (==9.0.2) ; python_version < \"3.9\""] mss = ["mss (==10.0.0) ; python_version >= \"3.9\"", "mss (==9.0.2) ; python_version < \"3.9\""]
@@ -3330,7 +3327,7 @@ description = "Python documentation generator"
optional = false optional = false
python-versions = ">=3.11" python-versions = ">=3.11"
groups = ["docs"] groups = ["docs"]
markers = "python_version != \"3.10\"" markers = "python_version >= \"3.11\""
files = [ files = [
{file = "sphinx-8.2.3-py3-none-any.whl", hash = "sha256:4405915165f13521d875a8c29c8970800a0141c14cc5416a38feca4ea5d9b9c3"}, {file = "sphinx-8.2.3-py3-none-any.whl", hash = "sha256:4405915165f13521d875a8c29c8970800a0141c14cc5416a38feca4ea5d9b9c3"},
{file = "sphinx-8.2.3.tar.gz", hash = "sha256:398ad29dee7f63a75888314e9424d40f52ce5a6a87ae88e7071e80af296ec348"}, {file = "sphinx-8.2.3.tar.gz", hash = "sha256:398ad29dee7f63a75888314e9424d40f52ce5a6a87ae88e7071e80af296ec348"},
@@ -3565,18 +3562,19 @@ test = ["pytest"]
[[package]] [[package]]
name = "starlette" name = "starlette"
version = "0.47.0" version = "0.47.1"
description = "The little ASGI library that shines." description = "The little ASGI library that shines."
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["docs"] groups = ["docs"]
files = [ files = [
{file = "starlette-0.47.0-py3-none-any.whl", hash = "sha256:9d052d4933683af40ffd47c7465433570b4949dc937e20ad1d73b34e72f10c37"}, {file = "starlette-0.47.1-py3-none-any.whl", hash = "sha256:5e11c9f5c7c3f24959edbf2dffdc01bba860228acf657129467d8a7468591527"},
{file = "starlette-0.47.0.tar.gz", hash = "sha256:1f64887e94a447fed5f23309fb6890ef23349b7e478faa7b24a851cd4eb844af"}, {file = "starlette-0.47.1.tar.gz", hash = "sha256:aef012dd2b6be325ffa16698f9dc533614fb1cebd593a906b90dc1025529a79b"},
] ]
[package.dependencies] [package.dependencies]
anyio = ">=3.6.2,<5" anyio = ">=3.6.2,<5"
typing-extensions = {version = ">=4.10.0", markers = "python_version < \"3.13\""}
[package.extras] [package.extras]
full = ["httpx (>=0.27.0,<0.29.0)", "itsdangerous", "jinja2", "python-multipart (>=0.0.18)", "pyyaml"] full = ["httpx (>=0.27.0,<0.29.0)", "itsdangerous", "jinja2", "python-multipart (>=0.0.18)", "pyyaml"]
@@ -3841,14 +3839,14 @@ zstd = ["zstandard (>=0.18.0)"]
[[package]] [[package]]
name = "uvicorn" name = "uvicorn"
version = "0.34.3" version = "0.35.0"
description = "The lightning-fast ASGI server." description = "The lightning-fast ASGI server."
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["docs"] groups = ["docs"]
files = [ files = [
{file = "uvicorn-0.34.3-py3-none-any.whl", hash = "sha256:16246631db62bdfbf069b0645177d6e8a77ba950cfedbfd093acef9444e4d885"}, {file = "uvicorn-0.35.0-py3-none-any.whl", hash = "sha256:197535216b25ff9b785e29a0b79199f55222193d47f820816e7da751e9bc8d4a"},
{file = "uvicorn-0.34.3.tar.gz", hash = "sha256:35919a9a979d7a59334b6b10e05d77c1d0d574c50e0fc98b8b1a0f165708b55a"}, {file = "uvicorn-0.35.0.tar.gz", hash = "sha256:bc662f087f7cf2ce11a1d7fd70b90c9f98ef2e2831556dd078d131b96cc94a01"},
] ]
[package.dependencies] [package.dependencies]
@@ -4162,14 +4160,14 @@ h11 = ">=0.9.0,<1"
[[package]] [[package]]
name = "yt-dlp" name = "yt-dlp"
version = "2025.6.9" version = "2025.6.25"
description = "A feature-rich command-line audio/video downloader" description = "A feature-rich command-line audio/video downloader"
optional = false optional = false
python-versions = ">=3.9" python-versions = ">=3.9"
groups = ["main"] groups = ["main"]
files = [ files = [
{file = "yt_dlp-2025.6.9-py3-none-any.whl", hash = "sha256:ebdfda9ffa807f6a26aed7c8f906e5557cd06b4c388dc547df1ec2078631fca8"}, {file = "yt_dlp-2025.6.25-py3-none-any.whl", hash = "sha256:1eb31c9a47d56c7433be23a6ae084c640bd4e14961ad43076927ef05280871ea"},
{file = "yt_dlp-2025.6.9.tar.gz", hash = "sha256:751f53a3b61353522bf805fa30bbcbd16666126537e39706eab4f8c368f111ac"}, {file = "yt_dlp-2025.6.25.tar.gz", hash = "sha256:242b648e1a18ab04bdd4cc175a317fe8ec3ad7d0175eee9f981912624b3d6c8b"},
] ]
[package.dependencies] [package.dependencies]
@@ -4196,4 +4194,4 @@ test = ["pytest (>=8.1,<9.0)", "pytest-rerunfailures (>=14.0,<15.0)"]
[metadata] [metadata]
lock-version = "2.1" lock-version = "2.1"
python-versions = ">=3.10,<3.13" python-versions = ">=3.10,<3.13"
content-hash = "1ab1e4c9b8beb51116052c1e8d180616a0938757f173f05b7355e279902d3350" content-hash = "8f0806dff086087dcf5bbec03902bdd05794dab3d16e7e4b379015db26211c92"

View File

@@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"
[project] [project]
name = "auto-archiver" name = "auto-archiver"
version = "1.1.0" version = "1.1.1"
description = "Automatically archive links to videos, images, and social media content from Google Sheets (and more)." description = "Automatically archive links to videos, images, and social media content from Google Sheets (and more)."
requires-python = ">=3.10,<3.13" requires-python = ">=3.10,<3.13"
@@ -50,7 +50,7 @@ dependencies = [
"retrying (>=0.0.0)", "retrying (>=0.0.0)",
"rich-argparse (>=1.6.0,<2.0.0)", "rich-argparse (>=1.6.0,<2.0.0)",
"ruamel-yaml (>=0.18.10,<0.19.0)", "ruamel-yaml (>=0.18.10,<0.19.0)",
"rfc3161-client (>=1.0.1,<2.0.0)", "rfc3161-client (==1.0.3)",
"cryptography (>44.0.1,<45.0.0)", "cryptography (>44.0.1,<45.0.0)",
"opentimestamps (>=0.4.5,<0.5.0)", "opentimestamps (>=0.4.5,<0.5.0)",
"bgutil-ytdlp-pot-provider (>=1.0.0)", "bgutil-ytdlp-pot-provider (>=1.0.0)",

View File

@@ -13,7 +13,7 @@
"@emotion/react": "latest", "@emotion/react": "latest",
"@emotion/styled": "latest", "@emotion/styled": "latest",
"@mui/icons-material": "^7.1.1", "@mui/icons-material": "^7.1.1",
"@mui/material": "latest", "@mui/material": "^7.1.1",
"react": "19.1.0", "react": "19.1.0",
"react-dom": "19.1.0", "react-dom": "19.1.0",
"react-markdown": "^10.0.0", "react-markdown": "^10.0.0",
@@ -1244,9 +1244,9 @@
"license": "MIT" "license": "MIT"
}, },
"node_modules/@rollup/rollup-android-arm-eabi": { "node_modules/@rollup/rollup-android-arm-eabi": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-android-arm-eabi/-/rollup-android-arm-eabi-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-android-arm-eabi/-/rollup-android-arm-eabi-4.43.0.tgz",
"integrity": "sha512-gldmAyS9hpj+H6LpRNlcjQWbuKUtb94lodB9uCz71Jm+7BxK1VIOo7y62tZZwxhA7j1ylv/yQz080L5WkS+LoQ==", "integrity": "sha512-Krjy9awJl6rKbruhQDgivNbD1WuLb8xAclM4IR4cN5pHGAs2oIMMQJEiC3IC/9TZJ+QZkmZhlMO/6MBGxPidpw==",
"cpu": [ "cpu": [
"arm" "arm"
], ],
@@ -1258,9 +1258,9 @@
] ]
}, },
"node_modules/@rollup/rollup-android-arm64": { "node_modules/@rollup/rollup-android-arm64": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-android-arm64/-/rollup-android-arm64-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-android-arm64/-/rollup-android-arm64-4.43.0.tgz",
"integrity": "sha512-bpRipfTgmGFdCZDFLRvIkSNO1/3RGS74aWkJJTFJBH7h3MRV4UijkaEUeOMbi9wxtxYmtAbVcnMtHTPBhLEkaw==", "integrity": "sha512-ss4YJwRt5I63454Rpj+mXCXicakdFmKnUNxr1dLK+5rv5FJgAxnN7s31a5VchRYxCFWdmnDWKd0wbAdTr0J5EA==",
"cpu": [ "cpu": [
"arm64" "arm64"
], ],
@@ -1272,9 +1272,9 @@
] ]
}, },
"node_modules/@rollup/rollup-darwin-arm64": { "node_modules/@rollup/rollup-darwin-arm64": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-darwin-arm64/-/rollup-darwin-arm64-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-darwin-arm64/-/rollup-darwin-arm64-4.43.0.tgz",
"integrity": "sha512-JxHtA081izPBVCHLKnl6GEA0w3920mlJPLh89NojpU2GsBSB6ypu4erFg/Wx1qbpUbepn0jY4dVWMGZM8gplgA==", "integrity": "sha512-eKoL8ykZ7zz8MjgBenEF2OoTNFAPFz1/lyJ5UmmFSz5jW+7XbH1+MAgCVHy72aG59rbuQLcJeiMrP8qP5d/N0A==",
"cpu": [ "cpu": [
"arm64" "arm64"
], ],
@@ -1286,9 +1286,9 @@
] ]
}, },
"node_modules/@rollup/rollup-darwin-x64": { "node_modules/@rollup/rollup-darwin-x64": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-darwin-x64/-/rollup-darwin-x64-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-darwin-x64/-/rollup-darwin-x64-4.43.0.tgz",
"integrity": "sha512-rv5UZaWVIJTDMyQ3dCEK+m0SAn6G7H3PRc2AZmExvbDvtaDc+qXkei0knQWcI3+c9tEs7iL/4I4pTQoPbNL2SA==", "integrity": "sha512-SYwXJgaBYW33Wi/q4ubN+ldWC4DzQY62S4Ll2dgfr/dbPoF50dlQwEaEHSKrQdSjC6oIe1WgzosoaNoHCdNuMg==",
"cpu": [ "cpu": [
"x64" "x64"
], ],
@@ -1300,9 +1300,9 @@
] ]
}, },
"node_modules/@rollup/rollup-freebsd-arm64": { "node_modules/@rollup/rollup-freebsd-arm64": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-freebsd-arm64/-/rollup-freebsd-arm64-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-freebsd-arm64/-/rollup-freebsd-arm64-4.43.0.tgz",
"integrity": "sha512-fJcN4uSGPWdpVmvLuMtALUFwCHgb2XiQjuECkHT3lWLZhSQ3MBQ9pq+WoWeJq2PrNxr9rPM1Qx+IjyGj8/c6zQ==", "integrity": "sha512-SV+U5sSo0yujrjzBF7/YidieK2iF6E7MdF6EbYxNz94lA+R0wKl3SiixGyG/9Klab6uNBIqsN7j4Y/Fya7wAjQ==",
"cpu": [ "cpu": [
"arm64" "arm64"
], ],
@@ -1314,9 +1314,9 @@
] ]
}, },
"node_modules/@rollup/rollup-freebsd-x64": { "node_modules/@rollup/rollup-freebsd-x64": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-freebsd-x64/-/rollup-freebsd-x64-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-freebsd-x64/-/rollup-freebsd-x64-4.43.0.tgz",
"integrity": "sha512-CziHfyzpp8hJpCVE/ZdTizw58gr+m7Y2Xq5VOuCSrZR++th2xWAz4Nqk52MoIIrV3JHtVBhbBsJcAxs6NammOQ==", "integrity": "sha512-J7uCsiV13L/VOeHJBo5SjasKiGxJ0g+nQTrBkAsmQBIdil3KhPnSE9GnRon4ejX1XDdsmK/l30IYLiAaQEO0Cg==",
"cpu": [ "cpu": [
"x64" "x64"
], ],
@@ -1328,9 +1328,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-arm-gnueabihf": { "node_modules/@rollup/rollup-linux-arm-gnueabihf": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm-gnueabihf/-/rollup-linux-arm-gnueabihf-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm-gnueabihf/-/rollup-linux-arm-gnueabihf-4.43.0.tgz",
"integrity": "sha512-UsQD5fyLWm2Fe5CDM7VPYAo+UC7+2Px4Y+N3AcPh/LdZu23YcuGPegQly++XEVaC8XUTFVPscl5y5Cl1twEI4A==", "integrity": "sha512-gTJ/JnnjCMc15uwB10TTATBEhK9meBIY+gXP4s0sHD1zHOaIh4Dmy1X9wup18IiY9tTNk5gJc4yx9ctj/fjrIw==",
"cpu": [ "cpu": [
"arm" "arm"
], ],
@@ -1342,9 +1342,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-arm-musleabihf": { "node_modules/@rollup/rollup-linux-arm-musleabihf": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm-musleabihf/-/rollup-linux-arm-musleabihf-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm-musleabihf/-/rollup-linux-arm-musleabihf-4.43.0.tgz",
"integrity": "sha512-/i8NIrlgc/+4n1lnoWl1zgH7Uo0XK5xK3EDqVTf38KvyYgCU/Rm04+o1VvvzJZnVS5/cWSd07owkzcVasgfIkQ==", "integrity": "sha512-ZJ3gZynL1LDSIvRfz0qXtTNs56n5DI2Mq+WACWZ7yGHFUEirHBRt7fyIk0NsCKhmRhn7WAcjgSkSVVxKlPNFFw==",
"cpu": [ "cpu": [
"arm" "arm"
], ],
@@ -1356,9 +1356,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-arm64-gnu": { "node_modules/@rollup/rollup-linux-arm64-gnu": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm64-gnu/-/rollup-linux-arm64-gnu-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm64-gnu/-/rollup-linux-arm64-gnu-4.43.0.tgz",
"integrity": "sha512-eoujJFOvoIBjZEi9hJnXAbWg+Vo1Ov8n/0IKZZcPZ7JhBzxh2A+2NFyeMZIRkY9iwBvSjloKgcvnjTbGKHE44Q==", "integrity": "sha512-8FnkipasmOOSSlfucGYEu58U8cxEdhziKjPD2FIa0ONVMxvl/hmONtX/7y4vGjdUhjcTHlKlDhw3H9t98fPvyA==",
"cpu": [ "cpu": [
"arm64" "arm64"
], ],
@@ -1370,9 +1370,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-arm64-musl": { "node_modules/@rollup/rollup-linux-arm64-musl": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm64-musl/-/rollup-linux-arm64-musl-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-arm64-musl/-/rollup-linux-arm64-musl-4.43.0.tgz",
"integrity": "sha512-/3NrcOWFSR7RQUQIuZQChLND36aTU9IYE4j+TB40VU78S+RA0IiqHR30oSh6P1S9f9/wVOenHQnacs/Byb824g==", "integrity": "sha512-KPPyAdlcIZ6S9C3S2cndXDkV0Bb1OSMsX0Eelr2Bay4EsF9yi9u9uzc9RniK3mcUGCLhWY9oLr6er80P5DE6XA==",
"cpu": [ "cpu": [
"arm64" "arm64"
], ],
@@ -1384,9 +1384,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-loongarch64-gnu": { "node_modules/@rollup/rollup-linux-loongarch64-gnu": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-loongarch64-gnu/-/rollup-linux-loongarch64-gnu-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-loongarch64-gnu/-/rollup-linux-loongarch64-gnu-4.43.0.tgz",
"integrity": "sha512-O8AplvIeavK5ABmZlKBq9/STdZlnQo7Sle0LLhVA7QT+CiGpNVe197/t8Aph9bhJqbDVGCHpY2i7QyfEDDStDg==", "integrity": "sha512-HPGDIH0/ZzAZjvtlXj6g+KDQ9ZMHfSP553za7o2Odegb/BEfwJcR0Sw0RLNpQ9nC6Gy8s+3mSS9xjZ0n3rhcYg==",
"cpu": [ "cpu": [
"loong64" "loong64"
], ],
@@ -1398,9 +1398,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-powerpc64le-gnu": { "node_modules/@rollup/rollup-linux-powerpc64le-gnu": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-powerpc64le-gnu/-/rollup-linux-powerpc64le-gnu-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-powerpc64le-gnu/-/rollup-linux-powerpc64le-gnu-4.43.0.tgz",
"integrity": "sha512-6Qb66tbKVN7VyQrekhEzbHRxXXFFD8QKiFAwX5v9Xt6FiJ3BnCVBuyBxa2fkFGqxOCSGGYNejxd8ht+q5SnmtA==", "integrity": "sha512-gEmwbOws4U4GLAJDhhtSPWPXUzDfMRedT3hFMyRAvM9Mrnj+dJIFIeL7otsv2WF3D7GrV0GIewW0y28dOYWkmw==",
"cpu": [ "cpu": [
"ppc64" "ppc64"
], ],
@@ -1412,9 +1412,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-riscv64-gnu": { "node_modules/@rollup/rollup-linux-riscv64-gnu": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-riscv64-gnu/-/rollup-linux-riscv64-gnu-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-riscv64-gnu/-/rollup-linux-riscv64-gnu-4.43.0.tgz",
"integrity": "sha512-KQETDSEBamQFvg/d8jajtRwLNBlGc3aKpaGiP/LvEbnmVUKlFta1vqJqTrvPtsYsfbE/DLg5CC9zyXRX3fnBiA==", "integrity": "sha512-XXKvo2e+wFtXZF/9xoWohHg+MuRnvO29TI5Hqe9xwN5uN8NKUYy7tXUG3EZAlfchufNCTHNGjEx7uN78KsBo0g==",
"cpu": [ "cpu": [
"riscv64" "riscv64"
], ],
@@ -1426,9 +1426,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-riscv64-musl": { "node_modules/@rollup/rollup-linux-riscv64-musl": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-riscv64-musl/-/rollup-linux-riscv64-musl-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-riscv64-musl/-/rollup-linux-riscv64-musl-4.43.0.tgz",
"integrity": "sha512-qMvnyjcU37sCo/tuC+JqeDKSuukGAd+pVlRl/oyDbkvPJ3awk6G6ua7tyum02O3lI+fio+eM5wsVd66X0jQtxw==", "integrity": "sha512-ruf3hPWhjw6uDFsOAzmbNIvlXFXlBQ4nk57Sec8E8rUxs/AI4HD6xmiiasOOx/3QxS2f5eQMKTAwk7KHwpzr/Q==",
"cpu": [ "cpu": [
"riscv64" "riscv64"
], ],
@@ -1440,9 +1440,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-s390x-gnu": { "node_modules/@rollup/rollup-linux-s390x-gnu": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-s390x-gnu/-/rollup-linux-s390x-gnu-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-s390x-gnu/-/rollup-linux-s390x-gnu-4.43.0.tgz",
"integrity": "sha512-I2Y1ZUgTgU2RLddUHXTIgyrdOwljjkmcZ/VilvaEumtS3Fkuhbw4p4hgHc39Ypwvo2o7sBFNl2MquNvGCa55Iw==", "integrity": "sha512-QmNIAqDiEMEvFV15rsSnjoSmO0+eJLoKRD9EAa9rrYNwO/XRCtOGM3A5A0X+wmG+XRrw9Fxdsw+LnyYiZWWcVw==",
"cpu": [ "cpu": [
"s390x" "s390x"
], ],
@@ -1454,9 +1454,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-x64-gnu": { "node_modules/@rollup/rollup-linux-x64-gnu": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-x64-gnu/-/rollup-linux-x64-gnu-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-x64-gnu/-/rollup-linux-x64-gnu-4.43.0.tgz",
"integrity": "sha512-Gfm6cV6mj3hCUY8TqWa63DB8Mx3NADoFwiJrMpoZ1uESbK8FQV3LXkhfry+8bOniq9pqY1OdsjFWNsSbfjPugw==", "integrity": "sha512-jAHr/S0iiBtFyzjhOkAics/2SrXE092qyqEg96e90L3t9Op8OTzS6+IX0Fy5wCt2+KqeHAkti+eitV0wvblEoQ==",
"cpu": [ "cpu": [
"x64" "x64"
], ],
@@ -1468,9 +1468,9 @@
] ]
}, },
"node_modules/@rollup/rollup-linux-x64-musl": { "node_modules/@rollup/rollup-linux-x64-musl": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-linux-x64-musl/-/rollup-linux-x64-musl-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-linux-x64-musl/-/rollup-linux-x64-musl-4.43.0.tgz",
"integrity": "sha512-g86PF8YZ9GRqkdi0VoGlcDUb4rYtQKyTD1IVtxxN4Hpe7YqLBShA7oHMKU6oKTCi3uxwW4VkIGnOaH/El8de3w==", "integrity": "sha512-3yATWgdeXyuHtBhrLt98w+5fKurdqvs8B53LaoKD7P7H7FKOONLsBVMNl9ghPQZQuYcceV5CDyPfyfGpMWD9mQ==",
"cpu": [ "cpu": [
"x64" "x64"
], ],
@@ -1482,9 +1482,9 @@
] ]
}, },
"node_modules/@rollup/rollup-win32-arm64-msvc": { "node_modules/@rollup/rollup-win32-arm64-msvc": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-win32-arm64-msvc/-/rollup-win32-arm64-msvc-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-win32-arm64-msvc/-/rollup-win32-arm64-msvc-4.43.0.tgz",
"integrity": "sha512-+axkdyDGSp6hjyzQ5m1pgcvQScfHnMCcsXkx8pTgy/6qBmWVhtRVlgxjWwDp67wEXXUr0x+vD6tp5W4x6V7u1A==", "integrity": "sha512-wVzXp2qDSCOpcBCT5WRWLmpJRIzv23valvcTwMHEobkjippNf+C3ys/+wf07poPkeNix0paTNemB2XrHr2TnGw==",
"cpu": [ "cpu": [
"arm64" "arm64"
], ],
@@ -1496,9 +1496,9 @@
] ]
}, },
"node_modules/@rollup/rollup-win32-ia32-msvc": { "node_modules/@rollup/rollup-win32-ia32-msvc": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-win32-ia32-msvc/-/rollup-win32-ia32-msvc-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-win32-ia32-msvc/-/rollup-win32-ia32-msvc-4.43.0.tgz",
"integrity": "sha512-F+5J9pelstXKwRSDq92J0TEBXn2nfUrQGg+HK1+Tk7VOL09e0gBqUHugZv7SW4MGrYj41oNCUe3IKCDGVlis2g==", "integrity": "sha512-fYCTEyzf8d+7diCw8b+asvWDCLMjsCEA8alvtAutqJOJp/wL5hs1rWSqJ1vkjgW0L2NB4bsYJrpKkiIPRR9dvw==",
"cpu": [ "cpu": [
"ia32" "ia32"
], ],
@@ -1510,9 +1510,9 @@
] ]
}, },
"node_modules/@rollup/rollup-win32-x64-msvc": { "node_modules/@rollup/rollup-win32-x64-msvc": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/@rollup/rollup-win32-x64-msvc/-/rollup-win32-x64-msvc-4.42.0.tgz", "resolved": "https://registry.npmjs.org/@rollup/rollup-win32-x64-msvc/-/rollup-win32-x64-msvc-4.43.0.tgz",
"integrity": "sha512-LpHiJRwkaVz/LqjHjK8LCi8osq7elmpwujwbXKNW88bM8eeGxavJIKKjkjpMHAh/2xfnrt1ZSnhTv41WYUHYmA==", "integrity": "sha512-SnGhLiE5rlK0ofq8kzuDkM0g7FN1s5VYY+YSMTibP7CqShxCQvqtNxTARS4xX4PFJfHjG0ZQYX9iGzI3FQh5Aw==",
"cpu": [ "cpu": [
"x64" "x64"
], ],
@@ -1629,9 +1629,9 @@
"license": "MIT" "license": "MIT"
}, },
"node_modules/@types/react": { "node_modules/@types/react": {
"version": "19.1.7", "version": "19.1.8",
"resolved": "https://registry.npmjs.org/@types/react/-/react-19.1.7.tgz", "resolved": "https://registry.npmjs.org/@types/react/-/react-19.1.8.tgz",
"integrity": "sha512-BnsPLV43ddr05N71gaGzyZ5hzkCmGwhMvYc8zmvI8Ci1bRkkDSzDDVfAXfN2tk748OwI7ediiPX6PfT9p0QGVg==", "integrity": "sha512-AwAfQ2Wa5bCx9WP8nZL2uMZWod7J7/JSplxbTmBQ5ms6QpqNYm672H0Vu9ZVKVngQ+ii4R/byguVEUZQyeg44g==",
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"csstype": "^3.0.2" "csstype": "^3.0.2"
@@ -1770,9 +1770,9 @@
} }
}, },
"node_modules/caniuse-lite": { "node_modules/caniuse-lite": {
"version": "1.0.30001721", "version": "1.0.30001723",
"resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001721.tgz", "resolved": "https://registry.npmjs.org/caniuse-lite/-/caniuse-lite-1.0.30001723.tgz",
"integrity": "sha512-cOuvmUVtKrtEaoKiO0rSc29jcjwMwX5tOHDy4MgVFEWiUXj4uBMJkwI8MDySkgXidpMiHUcviogAvFi4pA2hDQ==", "integrity": "sha512-1R/elMjtehrFejxwmexeXAtae5UO9iSyFn6G/I806CYC/BLyyBk1EPhrKBkWhy6wM6Xnm47dSJQec+tLJ39WHw==",
"dev": true, "dev": true,
"funding": [ "funding": [
{ {
@@ -1914,9 +1914,9 @@
} }
}, },
"node_modules/decode-named-character-reference": { "node_modules/decode-named-character-reference": {
"version": "1.1.0", "version": "1.2.0",
"resolved": "https://registry.npmjs.org/decode-named-character-reference/-/decode-named-character-reference-1.1.0.tgz", "resolved": "https://registry.npmjs.org/decode-named-character-reference/-/decode-named-character-reference-1.2.0.tgz",
"integrity": "sha512-Wy+JTSbFThEOXQIR2L6mxJvEs+veIzpmqD7ynWxMXGpnk3smkHQOp6forLdHsKpAMW9iJpaBBIxz285t1n1C3w==", "integrity": "sha512-c6fcElNV6ShtZXmsgNgFFV5tVX2PaV4g+MOAkb8eXHvn6sryJBrZa9r0zV6+dtTyoCKxtDy5tyQ5ZwQuidtd+Q==",
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"character-entities": "^2.0.0" "character-entities": "^2.0.0"
@@ -1959,9 +1959,9 @@
} }
}, },
"node_modules/electron-to-chromium": { "node_modules/electron-to-chromium": {
"version": "1.5.166", "version": "1.5.169",
"resolved": "https://registry.npmjs.org/electron-to-chromium/-/electron-to-chromium-1.5.166.tgz", "resolved": "https://registry.npmjs.org/electron-to-chromium/-/electron-to-chromium-1.5.169.tgz",
"integrity": "sha512-QPWqHL0BglzPYyJJ1zSSmwFFL6MFXhbACOCcsCdUMCkzPdS9/OIBVxg516X/Ado2qwAq8k0nJJ7phQPCqiaFAw==", "integrity": "sha512-q7SQx6mkLy0GTJK9K9OiWeaBMV4XQtBSdf6MJUzDB/H/5tFXfIiX38Lci1Kl6SsgiEhz1SQI1ejEOU5asWEhwQ==",
"dev": true, "dev": true,
"license": "ISC" "license": "ISC"
}, },
@@ -3144,9 +3144,9 @@
} }
}, },
"node_modules/postcss": { "node_modules/postcss": {
"version": "8.5.4", "version": "8.5.6",
"resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.4.tgz", "resolved": "https://registry.npmjs.org/postcss/-/postcss-8.5.6.tgz",
"integrity": "sha512-QSa9EBe+uwlGTFmHsPKokv3B/oEMQZxfqW0QqNCyhpa6mB1afzulwn8hihglqAb2pOw+BJgNlmXQ8la2VeHB7w==", "integrity": "sha512-3Ybi1tAuwAP9s0r1UQ2J4n5Y0G05bJkpUIO0/bI9MhwmD70S5aTWbXGBwxHrelT+XM1k6dM0pk+SwNkpTRN7Pg==",
"dev": true, "dev": true,
"funding": [ "funding": [
{ {
@@ -3342,9 +3342,9 @@
} }
}, },
"node_modules/rollup": { "node_modules/rollup": {
"version": "4.42.0", "version": "4.43.0",
"resolved": "https://registry.npmjs.org/rollup/-/rollup-4.42.0.tgz", "resolved": "https://registry.npmjs.org/rollup/-/rollup-4.43.0.tgz",
"integrity": "sha512-LW+Vse3BJPyGJGAJt1j8pWDKPd73QM8cRXYK1IxOBgL2AGLu7Xd2YOW0M2sLUBCkF5MshXXtMApyEAEzMVMsnw==", "integrity": "sha512-wdN2Kd3Twh8MAEOEJZsuxuLKCsBEo4PVNLK6tQWAn10VhsVewQLzcucMgLolRlhFybGxfclbPeEYBaP6RvUFGg==",
"dev": true, "dev": true,
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
@@ -3358,26 +3358,26 @@
"npm": ">=8.0.0" "npm": ">=8.0.0"
}, },
"optionalDependencies": { "optionalDependencies": {
"@rollup/rollup-android-arm-eabi": "4.42.0", "@rollup/rollup-android-arm-eabi": "4.43.0",
"@rollup/rollup-android-arm64": "4.42.0", "@rollup/rollup-android-arm64": "4.43.0",
"@rollup/rollup-darwin-arm64": "4.42.0", "@rollup/rollup-darwin-arm64": "4.43.0",
"@rollup/rollup-darwin-x64": "4.42.0", "@rollup/rollup-darwin-x64": "4.43.0",
"@rollup/rollup-freebsd-arm64": "4.42.0", "@rollup/rollup-freebsd-arm64": "4.43.0",
"@rollup/rollup-freebsd-x64": "4.42.0", "@rollup/rollup-freebsd-x64": "4.43.0",
"@rollup/rollup-linux-arm-gnueabihf": "4.42.0", "@rollup/rollup-linux-arm-gnueabihf": "4.43.0",
"@rollup/rollup-linux-arm-musleabihf": "4.42.0", "@rollup/rollup-linux-arm-musleabihf": "4.43.0",
"@rollup/rollup-linux-arm64-gnu": "4.42.0", "@rollup/rollup-linux-arm64-gnu": "4.43.0",
"@rollup/rollup-linux-arm64-musl": "4.42.0", "@rollup/rollup-linux-arm64-musl": "4.43.0",
"@rollup/rollup-linux-loongarch64-gnu": "4.42.0", "@rollup/rollup-linux-loongarch64-gnu": "4.43.0",
"@rollup/rollup-linux-powerpc64le-gnu": "4.42.0", "@rollup/rollup-linux-powerpc64le-gnu": "4.43.0",
"@rollup/rollup-linux-riscv64-gnu": "4.42.0", "@rollup/rollup-linux-riscv64-gnu": "4.43.0",
"@rollup/rollup-linux-riscv64-musl": "4.42.0", "@rollup/rollup-linux-riscv64-musl": "4.43.0",
"@rollup/rollup-linux-s390x-gnu": "4.42.0", "@rollup/rollup-linux-s390x-gnu": "4.43.0",
"@rollup/rollup-linux-x64-gnu": "4.42.0", "@rollup/rollup-linux-x64-gnu": "4.43.0",
"@rollup/rollup-linux-x64-musl": "4.42.0", "@rollup/rollup-linux-x64-musl": "4.43.0",
"@rollup/rollup-win32-arm64-msvc": "4.42.0", "@rollup/rollup-win32-arm64-msvc": "4.43.0",
"@rollup/rollup-win32-ia32-msvc": "4.42.0", "@rollup/rollup-win32-ia32-msvc": "4.43.0",
"@rollup/rollup-win32-x64-msvc": "4.42.0", "@rollup/rollup-win32-x64-msvc": "4.43.0",
"fsevents": "~2.3.2" "fsevents": "~2.3.2"
} }
}, },
@@ -3448,18 +3448,18 @@
} }
}, },
"node_modules/style-to-js": { "node_modules/style-to-js": {
"version": "1.1.16", "version": "1.1.17",
"resolved": "https://registry.npmjs.org/style-to-js/-/style-to-js-1.1.16.tgz", "resolved": "https://registry.npmjs.org/style-to-js/-/style-to-js-1.1.17.tgz",
"integrity": "sha512-/Q6ld50hKYPH3d/r6nr117TZkHR0w0kGGIVfpG9N6D8NymRPM9RqCUv4pRpJ62E5DqOYx2AFpbZMyCPnjQCnOw==", "integrity": "sha512-xQcBGDxJb6jjFCTzvQtfiPn6YvvP2O8U1MDIPNfJQlWMYfktPy+iGsHE7cssjs7y84d9fQaK4UF3RIJaAHSoYA==",
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"style-to-object": "1.0.8" "style-to-object": "1.0.9"
} }
}, },
"node_modules/style-to-object": { "node_modules/style-to-object": {
"version": "1.0.8", "version": "1.0.9",
"resolved": "https://registry.npmjs.org/style-to-object/-/style-to-object-1.0.8.tgz", "resolved": "https://registry.npmjs.org/style-to-object/-/style-to-object-1.0.9.tgz",
"integrity": "sha512-xT47I/Eo0rwJmaXC4oilDGDWLohVhR6o/xAQcPQN8q6QBuZVL8qMYL85kLmST5cPjAorwvqIA4qXTRQoYHaL6g==", "integrity": "sha512-G4qppLgKu/k6FwRpHiGiKPaPTFcG3g4wNVX/Qsfu+RqQM30E7Tyu/TEgxcL9PNLF5pdRLwQdE3YKKf+KF2Dzlw==",
"license": "MIT", "license": "MIT",
"dependencies": { "dependencies": {
"inline-style-parser": "0.2.4" "inline-style-parser": "0.2.4"

View File

@@ -14,7 +14,7 @@
"@emotion/react": "latest", "@emotion/react": "latest",
"@emotion/styled": "latest", "@emotion/styled": "latest",
"@mui/icons-material": "^7.1.1", "@mui/icons-material": "^7.1.1",
"@mui/material": "latest", "@mui/material": "^7.1.1",
"react": "19.1.0", "react": "19.1.0",
"react-dom": "19.1.0", "react-dom": "19.1.0",
"react-markdown": "^10.0.0", "react-markdown": "^10.0.0",

View File

@@ -31,7 +31,7 @@ import {
Stack, Stack,
Button, Button,
} from '@mui/material'; } from '@mui/material';
import Grid from '@mui/material/Grid2'; import Grid from '@mui/material/Grid';
import { parseDocument, Document, YAMLSeq, YAMLMap, Scalar } from 'yaml' import { parseDocument, Document, YAMLSeq, YAMLMap, Scalar } from 'yaml'
import StepCard from './StepCard'; import StepCard from './StepCard';

View File

@@ -25,7 +25,7 @@ import {
Typography, Typography,
InputAdornment, InputAdornment,
} from '@mui/material'; } from '@mui/material';
import Grid from '@mui/material/Grid2'; import Grid from '@mui/material/Grid';
import DragIndicatorIcon from '@mui/icons-material/DragIndicator'; import DragIndicatorIcon from '@mui/icons-material/DragIndicator';
import Visibility from '@mui/icons-material/Visibility'; import Visibility from '@mui/icons-material/Visibility';
import VisibilityOff from '@mui/icons-material/VisibilityOff'; import VisibilityOff from '@mui/icons-material/VisibilityOff';

View File

@@ -14,7 +14,7 @@ You will need to provide your phone number and a 2FA code the first time you run
import os import os
from telethon.sync import TelegramClient from telethon.sync import TelegramClient
from loguru import logger from auto_archiver.utils.custom_logger import logger
# Create a # Create a
@@ -24,4 +24,4 @@ SESSION_FILE = "secrets/anon-insta"
os.makedirs("secrets", exist_ok=True) os.makedirs("secrets", exist_ok=True)
with TelegramClient(SESSION_FILE, API_ID, API_HASH) as client: with TelegramClient(SESSION_FILE, API_ID, API_HASH) as client:
logger.success(f"New session file created: {SESSION_FILE}.session") logger.success(f"new session file created: {SESSION_FILE}.session")

View File

@@ -7,7 +7,7 @@ from tempfile import TemporaryDirectory
from auto_archiver.utils import url as UrlUtil from auto_archiver.utils import url as UrlUtil
from auto_archiver.core.consts import MODULE_TYPES as CONF_MODULE_TYPES from auto_archiver.core.consts import MODULE_TYPES as CONF_MODULE_TYPES
from loguru import logger from auto_archiver.utils.custom_logger import logger
if TYPE_CHECKING: if TYPE_CHECKING:
from .module import ModuleFactory from .module import ModuleFactory

View File

@@ -10,7 +10,7 @@ from ruamel.yaml import YAML, CommentedMap
import json import json
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from copy import deepcopy from copy import deepcopy
from auto_archiver.core.consts import MODULE_TYPES from auto_archiver.core.consts import MODULE_TYPES
@@ -118,8 +118,7 @@ class DefaultValidatingParser(argparse.ArgumentParser):
""" """
Override of error to format a nicer looking error message using logger Override of error to format a nicer looking error message using logger
""" """
logger.error("Problem with configuration file (tip: use --help to see the available options):") logger.error(f"Problem with configuration file (tip: use --help to see the available options): \n{message}")
logger.error(message)
self.exit(2) self.exit(2)
def parse_known_args(self, args=None, namespace=None): def parse_known_args(self, args=None, namespace=None):
@@ -136,8 +135,7 @@ class DefaultValidatingParser(argparse.ArgumentParser):
try: try:
self._check_value(action, action.default) self._check_value(action, action.default)
except argparse.ArgumentError as e: except argparse.ArgumentError as e:
logger.error(f"You have an invalid setting in your configuration file ({action.dest}):") logger.error(f"You have an invalid setting in your configuration file ({action.dest}):\n {e}")
logger.error(e)
exit() exit()
return super().parse_known_args(args, namespace) return super().parse_known_args(args, namespace)

View File

@@ -12,7 +12,7 @@ from contextlib import suppress
import mimetypes import mimetypes
import os import os
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from retrying import retry from retrying import retry
import re import re
@@ -94,7 +94,7 @@ class Extractor(BaseModule):
to_filename = to_filename[-64:] to_filename = to_filename[-64:]
to_filename = os.path.join(self.tmp_dir, to_filename) to_filename = os.path.join(self.tmp_dir, to_filename)
if verbose: if verbose:
logger.debug(f"downloading {url[0:50]=} {to_filename=}") logger.debug(f"Downloading {to_filename=}")
headers = { headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
} }
@@ -117,7 +117,7 @@ class Extractor(BaseModule):
return to_filename return to_filename
except requests.RequestException as e: except requests.RequestException as e:
logger.warning(f"Failed to fetch the Media URL: {str(e)[:250]}") logger.warning(f"Failed to fetch the Media URL: {e}")
if try_best_quality: if try_best_quality:
return None, url return None, url

View File

@@ -11,7 +11,7 @@ from dataclasses import dataclass, field
from dataclasses_json import dataclass_json, config from dataclasses_json import dataclass_json, config
import mimetypes import mimetypes
from loguru import logger from auto_archiver.utils.custom_logger import logger
@dataclass_json # annotation order matters @dataclass_json # annotation order matters
@@ -86,7 +86,7 @@ class Media:
@property # getter .mimetype @property # getter .mimetype
def mimetype(self) -> str: def mimetype(self) -> str:
if not self.filename or len(self.filename) == 0: if not self.filename or len(self.filename) == 0:
logger.warning(f"cannot get mimetype from media without filename: {self}") logger.warning(f"Cannot get mimetype from media without filename: {self}")
return "" return ""
if not self._mimetype: if not self._mimetype:
self._mimetype = mimetypes.guess_type(self.filename)[0] self._mimetype = mimetypes.guess_type(self.filename)[0]
@@ -116,13 +116,12 @@ class Media:
# self.is_video() should be used together with this method # self.is_video() should be used together with this method
try: try:
streams = ffmpeg.probe(self.filename, select_streams="v")["streams"] streams = ffmpeg.probe(self.filename, select_streams="v")["streams"]
logger.debug(f"STREAMS FOR {self.filename} {streams}") logger.debug(f"Streams for {self.filename}: {streams}")
return any(s.get("duration_ts", 0) > 0 for s in streams) return any(s.get("duration_ts", 0) > 0 for s in streams)
except Error: except Error:
return False # ffmpeg errors when reading bad files return False # ffmpeg errors when reading bad files
except Exception as e: except Exception as e:
logger.error(e) logger.error(f"{e}: {traceback.format_exc()}")
logger.error(traceback.format_exc())
try: try:
fsize = os.path.getsize(self.filename) fsize = os.path.getsize(self.filename)
return fsize > 20_000 return fsize > 20_000

View File

@@ -17,7 +17,7 @@ from dataclasses_json import dataclass_json
import datetime import datetime
from urllib.parse import urlparse from urllib.parse import urlparse
from dateutil.parser import parse as parse_dt from dateutil.parser import parse as parse_dt
from loguru import logger from auto_archiver.utils.custom_logger import logger
from .media import Media from .media import Media

View File

@@ -16,7 +16,7 @@ import sys
from importlib.util import find_spec from importlib.util import find_spec
import os import os
from os.path import join from os.path import join
from loguru import logger from auto_archiver.utils.custom_logger import logger
import auto_archiver import auto_archiver
from auto_archiver.core.consts import DEFAULT_MANIFEST, MANIFEST_FILE, SetupError from auto_archiver.core.consts import DEFAULT_MANIFEST, MANIFEST_FILE, SetupError

View File

@@ -15,9 +15,11 @@ import traceback
from copy import copy from copy import copy
from rich_argparse import RichHelpFormatter from rich_argparse import RichHelpFormatter
from loguru import logger from auto_archiver.utils.custom_logger import format_for_human_readable_console, logger
import requests import requests
from auto_archiver.utils.misc import random_str
from .metadata import Metadata, Media from .metadata import Metadata, Media
from auto_archiver.version import __version__ from auto_archiver.version import __version__
from .config import ( from .config import (
@@ -342,7 +344,14 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
# add other logging info # add other logging info
if self.logger_id is None: # note - need direct comparison to None since need to consider falsy value 0 if self.logger_id is None: # note - need direct comparison to None since need to consider falsy value 0
use_level = logging_config["level"] use_level = logging_config["level"]
self.logger_id = logger.add(sys.stderr, level=use_level) self.logger_id = logger.add(
sys.stderr,
level=use_level,
catch=True,
format="<level>{extra[serialized]}</level>"
if logging_config.get("format", "").lower() == "json"
else format_for_human_readable_console(),
)
rotation = logging_config["rotation"] rotation = logging_config["rotation"]
log_file = logging_config["file"] log_file = logging_config["file"]
@@ -356,9 +365,10 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
f"{log_file}.{i}_{level.lower()}", f"{log_file}.{i}_{level.lower()}",
filter=lambda rec, lvl=level: rec["level"].name == lvl, filter=lambda rec, lvl=level: rec["level"].name == lvl,
rotation=rotation, rotation=rotation,
format="{extra[serialized]}",
) )
elif log_file: elif log_file:
logger.add(log_file, rotation=rotation, level=use_level) logger.add(log_file, rotation=rotation, level=use_level, format="{extra[serialized]}")
def install_modules(self, modules_by_type): def install_modules(self, modules_by_type):
""" """
@@ -466,13 +476,9 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
update_cmd = "`docker pull bellingcat/auto-archiver:latest`" update_cmd = "`docker pull bellingcat/auto-archiver:latest`"
else: else:
update_cmd = "`pip install --upgrade auto-archiver`" update_cmd = "`pip install --upgrade auto-archiver`"
logger.warning("")
logger.warning("********* IMPORTANT: UPDATE AVAILABLE ********")
logger.warning( logger.warning(
f"A new version of auto-archiver is available (v{latest_version}, you have v{current_version})" f"\n********* IMPORTANT: UPDATE AVAILABLE ********\nA new version of auto-archiver is available (v{latest_version}, you have v{current_version})\nMake sure to update to the latest version using: {update_cmd}\n"
) )
logger.warning(f"Make sure to update to the latest version using: {update_cmd}")
logger.warning("")
def setup(self, args: list): def setup(self, args: list):
""" """
@@ -522,7 +528,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
self.setup(args) self.setup(args)
return self.feed() return self.feed()
except Exception as e: except Exception as e:
logger.error(e) logger.error(f"{e}: {traceback.format_exc()}")
exit(1) exit(1)
def cleanup(self) -> None: def cleanup(self) -> None:
@@ -534,8 +540,10 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
url_count = 0 url_count = 0
for feeder in self.feeders: for feeder in self.feeders:
for item in feeder: for item in feeder:
yield self.feed_item(item) with logger.contextualize(url=item.get_url(), trace=random_str(12)):
url_count += 1 logger.info("Started processing")
yield self.feed_item(item)
url_count += 1
logger.info(f"Processed {url_count} URL(s)") logger.info(f"Processed {url_count} URL(s)")
self.cleanup() self.cleanup()
@@ -555,13 +563,13 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
return self.archive(item) return self.archive(item)
except KeyboardInterrupt: except KeyboardInterrupt:
# catches keyboard interruptions to do a clean exit # catches keyboard interruptions to do a clean exit
logger.warning(f"caught interrupt on {item=}") logger.warning("Caught interrupt")
for d in self.databases: for d in self.databases:
d.aborted(item) d.aborted(item)
self.cleanup() self.cleanup()
exit() exit()
except Exception as e: except Exception as e:
logger.error(f"Got unexpected error on item {item}: {e}\n{traceback.format_exc()}") logger.error(f"Got unexpected error: {e}\n{traceback.format_exc()}")
for d in self.databases: for d in self.databases:
if isinstance(e, AssertionError): if isinstance(e, AssertionError):
d.failed(item, str(e)) d.failed(item, str(e))
@@ -589,7 +597,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try: try:
check_url_or_raise(original_url) check_url_or_raise(original_url)
except ValueError as e: except ValueError as e:
logger.error(f"Error archiving URL {original_url}: {e}") logger.error(f"Error archiving: {e}")
raise e raise e
# 1 - sanitize - each archiver is responsible for cleaning/expanding its own URLs # 1 - sanitize - each archiver is responsible for cleaning/expanding its own URLs
@@ -599,7 +607,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
result.set_url(url) result.set_url(url)
if original_url != url: if original_url != url:
logger.debug(f"Sanitized URL from {original_url} to {url}") logger.debug(f"Sanitized URL to {url}")
result.set("original_url", original_url) result.set("original_url", original_url)
# 2 - notify start to DBs, propagate already archived if feature enabled in DBs # 2 - notify start to DBs, propagate already archived if feature enabled in DBs
@@ -614,25 +622,25 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try: try:
d.done(cached_result, cached=True) d.done(cached_result, cached=True)
except Exception as e: except Exception as e:
logger.error(f"ERROR database {d.name}: {e}: {traceback.format_exc()}") logger.error(f"Database {d.name}: {e}: {traceback.format_exc()}")
return cached_result return cached_result
# 3 - call extractors until one succeeds # 3 - call extractors until one succeeds
for a in self.extractors: for a in self.extractors:
logger.info(f"Trying extractor {a.name} for {url}") logger.info(f"Trying extractor {a.name}")
try: try:
result.merge(a.download(result)) result.merge(a.download(result))
if result.is_success(): if result.is_success():
break break
except Exception as e: except Exception as e:
logger.error(f"ERROR archiver {a.name}: {e}: {traceback.format_exc()}") logger.error(f"Extractor {a.name}: {e}: {traceback.format_exc()}")
# 4 - call enrichers to work with archived content # 4 - call enrichers to work with archived content
for e in self.enrichers: for e in self.enrichers:
try: try:
e.enrich(result) e.enrich(result)
except Exception as exc: except Exception as exc:
logger.error(f"ERROR enricher {e.name}: {exc}: {traceback.format_exc()}") logger.error(f"Enricher {e.name}: {exc}: {traceback.format_exc()}")
# 5 - store all downloaded/generated media # 5 - store all downloaded/generated media
result.store(storages=self.storages) result.store(storages=self.storages)
@@ -651,7 +659,7 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try: try:
d.done(result) d.done(result)
except Exception as e: except Exception as e:
logger.error(f"ERROR database {d.name}: {e}: {traceback.format_exc()}") logger.error(f"Database {d.name}: {e}: {traceback.format_exc()}")
return result return result

View File

@@ -24,7 +24,7 @@ from abc import abstractmethod
from typing import IO from typing import IO
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from slugify import slugify from slugify import slugify
from auto_archiver.utils.misc import random_str from auto_archiver.utils.misc import random_str

View File

@@ -7,7 +7,7 @@ from urllib.parse import urljoin
import glob import glob
import importlib.util import importlib.util
from loguru import logger from auto_archiver.utils.custom_logger import logger
import selenium import selenium
from seleniumbase import SB from seleniumbase import SB
@@ -57,7 +57,7 @@ class AntibotExtractorEnricher(Extractor, Enricher):
continue # Skip imported modules/classes/functions continue # Skip imported modules/classes/functions
if isinstance(obj, type) and issubclass(obj, Dropin): if isinstance(obj, type) and issubclass(obj, Dropin):
dropins.append(obj) dropins.append(obj)
logger.debug(f"ANTIBOT loaded drop-in classes: {', '.join([d.__name__ for d in dropins])}") logger.debug(f"Loaded drop-in classes: {', '.join([d.__name__ for d in dropins])}")
return dropins return dropins
def sanitize_url(self, url: str) -> str: def sanitize_url(self, url: str) -> str:
@@ -83,14 +83,13 @@ class AntibotExtractorEnricher(Extractor, Enricher):
def enrich(self, to_enrich: Metadata, custom_data_dir: bool = True) -> bool: def enrich(self, to_enrich: Metadata, custom_data_dir: bool = True) -> bool:
using_user_data_dir = self.user_data_dir if custom_data_dir else None using_user_data_dir = self.user_data_dir if custom_data_dir else None
url = to_enrich.get_url() url = to_enrich.get_url()
url_sample = url[:75]
try: try:
with SB(uc=True, agent=self.agent, headed=None, user_data_dir=using_user_data_dir, proxy=self.proxy) as sb: with SB(uc=True, agent=self.agent, headed=None, user_data_dir=using_user_data_dir, proxy=self.proxy) as sb:
logger.info(f"ANTIBOT selenium browser is up with agent {self.agent}, opening {url_sample}...") logger.info(f"Selenium browser is up with agent {self.agent}, opening url...")
sb.uc_open_with_reconnect(url, 4) sb.uc_open_with_reconnect(url, 4)
logger.debug(f"ANTIBOT handling CAPTCHAs for {url_sample}...") logger.debug("Handling CAPTCHAs for...")
sb.uc_gui_handle_cf() sb.uc_gui_handle_cf()
sb.uc_gui_click_rc() # NB: using handle instead of click breaks some sites like reddit, for now we separate here but can have dropins deciding this in the future sb.uc_gui_click_rc() # NB: using handle instead of click breaks some sites like reddit, for now we separate here but can have dropins deciding this in the future
@@ -98,7 +97,7 @@ class AntibotExtractorEnricher(Extractor, Enricher):
dropin.open_page(url) dropin.open_page(url)
if self.detect_auth_wall and self._hit_auth_wall(sb): if self.detect_auth_wall and self._hit_auth_wall(sb):
logger.warning(f"ANTIBOT SKIP since auth wall or CAPTCHA was detected for {url_sample}") logger.warning("Skipping since auth wall or CAPTCHA was detected")
return False return False
sb.wait_for_ready_state_complete() sb.wait_for_ready_state_complete()
@@ -125,18 +124,18 @@ class AntibotExtractorEnricher(Extractor, Enricher):
js_css_selector=dropin.js_for_video_css_selectors(), js_css_selector=dropin.js_for_video_css_selectors(),
max_media=self.max_download_videos - downloaded_videos, max_media=self.max_download_videos - downloaded_videos,
) )
logger.info(f"ANTIBOT completed for {url_sample}") logger.info("Completed")
return to_enrich return to_enrich
except selenium.common.exceptions.SessionNotCreatedException as e: except selenium.common.exceptions.SessionNotCreatedException as e:
if custom_data_dir: # the retry logic only works once if custom_data_dir: # the retry logic only works once
logger.error( logger.error(
f"ANTIBOT session not created error: {e}. Please remove the user_data_dir {self.user_data_dir} and try again, will retry without user data dir though." f"Session not created error: {e}. Please remove the user_data_dir {self.user_data_dir} and try again, will retry without user data dir though."
) )
return self.enrich(to_enrich, custom_data_dir=False) return self.enrich(to_enrich, custom_data_dir=False)
raise e # re-raise raise e # re-raise
except Exception as e: except Exception as e:
logger.error(f"ANTIBOT runtime error: {e}: {traceback.format_exc()}") logger.error(f"Runtime error: {e}: {traceback.format_exc()}")
return False return False
def _get_suitable_dropin(self, url: str, sb: SB): def _get_suitable_dropin(self, url: str, sb: SB):
@@ -146,7 +145,7 @@ class AntibotExtractorEnricher(Extractor, Enricher):
""" """
for dropin in self.dropins: for dropin in self.dropins:
if dropin.suitable(url): if dropin.suitable(url):
logger.debug(f"ANTIBOT using drop-in {dropin.__name__} for {url}") logger.debug(f"Using drop-in {dropin.__name__}")
return dropin(sb, self) return dropin(sb, self)
return DefaultDropin(sb, self) return DefaultDropin(sb, self)

View File

@@ -1,6 +1,7 @@
import os import os
import traceback
from typing import Mapping from typing import Mapping
from loguru import logger from auto_archiver.utils.custom_logger import logger
from seleniumbase import SB from seleniumbase import SB
import yt_dlp import yt_dlp
@@ -143,7 +144,7 @@ class Dropin:
with yt_dlp.YoutubeDL(validated_options) as ydl: with yt_dlp.YoutubeDL(validated_options) as ydl:
for url in video_urls: for url in video_urls:
try: try:
logger.debug(f"Downloading video from URL: {url}") logger.debug(f"Downloading video from url: {url}")
info = ydl.extract_info(url, download=True) info = ydl.extract_info(url, download=True)
filename = ydl_entry_to_filename(ydl, info) filename = ydl_entry_to_filename(ydl, info)
if not filename: # Failed to download video. if not filename: # Failed to download video.
@@ -155,5 +156,5 @@ class Dropin:
to_enrich.add_media(media) to_enrich.add_media(media)
downloaded += 1 downloaded += 1
except Exception as e: except Exception as e:
logger.error(f"Error downloading {url}: {e}") logger.error(f"Download failed: {e} {traceback.format_exc()}")
return downloaded return downloaded

View File

@@ -1,5 +1,5 @@
from typing import Mapping from typing import Mapping
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
@@ -62,7 +62,7 @@ class LinkedinDropin(Dropin):
self.sb.wait_for_ready_state_complete() self.sb.wait_for_ready_state_complete()
username, password = self._get_username_password("linkedin.com") username, password = self._get_username_password("linkedin.com")
logger.debug("LinkedinDropin Logging in to Linkedin with username: {}", username) logger.debug("Logging in to Linkedin with username: {}", username)
self.sb.type("#username", username) self.sb.type("#username", username)
self.sb.type("#password", password) self.sb.type("#password", password)
self.sb.click_if_visible("#password-visibility-toggle", timeout=0.5) self.sb.click_if_visible("#password-visibility-toggle", timeout=0.5)

View File

@@ -3,7 +3,7 @@ from typing import Mapping
from auto_archiver.core.metadata import Metadata from auto_archiver.core.metadata import Metadata
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
from loguru import logger from auto_archiver.utils.custom_logger import logger
class RedditDropin(Dropin): class RedditDropin(Dropin):
@@ -50,7 +50,7 @@ class RedditDropin(Dropin):
self._close_cookies_banner() self._close_cookies_banner()
username, password = self._get_username_password("reddit.com") username, password = self._get_username_password("reddit.com")
logger.debug("RedditDropin Logging in to Reddit with username: {}", username) logger.debug("Logging in to Reddit with username: {}", username)
self.sb.type("#login-username", username) self.sb.type("#login-username", username)
self.sb.type("#login-password", password) self.sb.type("#login-password", password)
@@ -68,7 +68,7 @@ class RedditDropin(Dropin):
self.sb.click_link_text("Log in") self.sb.click_link_text("Log in")
self.sb.wait_for_ready_state_complete() self.sb.wait_for_ready_state_complete()
if self.sb.is_text_visible("Welcome back"): if self.sb.is_text_visible("Welcome back"):
logger.debug("RedditDropin Login successful") logger.debug("Login successful")
self.sb.click_if_visible("this link") self.sb.click_if_visible("this link")
def _close_cookies_banner(self): def _close_cookies_banner(self):
@@ -88,5 +88,5 @@ class RedditDropin(Dropin):
.map(el => el.src || el.href) .map(el => el.src || el.href)
.filter(url => url && /\.(m3u8|mpd|ism)$/.test(url)); .filter(url => url && /\.(m3u8|mpd|ism)$/.test(url));
""") """)
logger.debug("RedditDropin Found {} video URLs", len(filtered_urls)) logger.debug("Found {} video URLs", len(filtered_urls))
return 0, self._download_videos_with_ytdlp(filtered_urls, to_enrich) return 0, self._download_videos_with_ytdlp(filtered_urls, to_enrich)

View File

@@ -4,7 +4,7 @@ from typing import Mapping
from auto_archiver.core.metadata import Metadata from auto_archiver.core.metadata import Metadata
from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin from auto_archiver.modules.antibot_extractor_enricher.dropin import Dropin
from loguru import logger from auto_archiver.utils.custom_logger import logger
class VkDropin(Dropin): class VkDropin(Dropin):

View File

@@ -2,7 +2,7 @@ from typing import Union
import os import os
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Database from auto_archiver.core import Database
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -36,9 +36,9 @@ class AAApiDb(Database):
if not self.store_results: if not self.store_results:
return return
if cached: if cached:
logger.debug(f"skipping saving archive of {item.get_url()} to the AA API because it was cached") logger.debug("Skipping saving archive to AA API because it was cached")
return return
logger.debug(f"saving archive of {item.get_url()} to the AA API.") logger.debug("Saving archive to the AA API.")
payload = { payload = {
"author_id": self.author_id, "author_id": self.author_id,

View File

@@ -3,7 +3,7 @@ import os
from typing import IO, Iterator, Optional, Union from typing import IO, Iterator, Optional, Union
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Database, Feeder, Media, Metadata, Storage from auto_archiver.core import Database, Feeder, Media, Metadata, Storage
from auto_archiver.utils import calculate_file_hash from auto_archiver.utils import calculate_file_hash
@@ -66,13 +66,13 @@ class AtlosFeederDbStorage(Feeder, Database, Storage):
"""Mark an item as failed in Atlos, if the ID exists.""" """Mark an item as failed in Atlos, if the ID exists."""
atlos_id = item.metadata.get("atlos_id") atlos_id = item.metadata.get("atlos_id")
if not atlos_id: if not atlos_id:
logger.info(f"Item {item.get_url()} has no Atlos ID, skipping") logger.info("No Atlos ID available, skipping")
return return
self._post( self._post(
f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver", f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver",
json={"metadata": {"processed": True, "status": "error", "error": reason}}, json={"metadata": {"processed": True, "status": "error", "error": reason}},
) )
logger.info(f"Stored failure for {item.get_url()} (ID {atlos_id}) on Atlos: {reason}") logger.info(f"Stored failure ID {atlos_id} on Atlos: {reason}")
def fetch(self, item: Metadata) -> Union[Metadata, bool]: def fetch(self, item: Metadata) -> Union[Metadata, bool]:
"""check and fetch if the given item has been archived already, each """check and fetch if the given item has been archived already, each
@@ -88,7 +88,7 @@ class AtlosFeederDbStorage(Feeder, Database, Storage):
"""Mark an item as successfully archived in Atlos.""" """Mark an item as successfully archived in Atlos."""
atlos_id = item.metadata.get("atlos_id") atlos_id = item.metadata.get("atlos_id")
if not atlos_id: if not atlos_id:
logger.info(f"Item {item.get_url()} has no Atlos ID, skipping") logger.info("Item has no Atlos ID, skipping")
return return
self._post( self._post(
f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver", f"/api/v2/source_material/metadata/{atlos_id}/auto_archiver",
@@ -100,7 +100,7 @@ class AtlosFeederDbStorage(Feeder, Database, Storage):
} }
}, },
) )
logger.info(f"Stored success for {item.get_url()} (ID {atlos_id}) on Atlos") logger.info(f"Stored success ID {atlos_id} on Atlos")
# ! Atlos Module - Storage Methods # ! Atlos Module - Storage Methods

View File

@@ -1,5 +1,3 @@
from loguru import logger
from auto_archiver.core.feeder import Feeder from auto_archiver.core.feeder import Feeder
from auto_archiver.core.metadata import Metadata from auto_archiver.core.metadata import Metadata
from auto_archiver.core.consts import SetupError from auto_archiver.core.consts import SetupError
@@ -16,8 +14,5 @@ class CLIFeeder(Feeder):
def __iter__(self) -> Metadata: def __iter__(self) -> Metadata:
urls = self.config["urls"] urls = self.config["urls"]
for url in urls: for url in urls:
logger.debug(f"Processing {url}")
m = Metadata().set_url(url) m = Metadata().set_url(url)
yield m yield m
logger.success(f"Processed {len(urls)} URL(s)")

View File

@@ -1,4 +1,4 @@
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Database from auto_archiver.core import Database
from auto_archiver.core import Metadata from auto_archiver.core import Metadata

View File

@@ -1,5 +1,5 @@
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from csv import DictWriter from csv import DictWriter
from dataclasses import asdict from dataclasses import asdict

View File

@@ -1,4 +1,4 @@
from loguru import logger from auto_archiver.utils.custom_logger import logger
import csv import csv
from auto_archiver.core import Feeder from auto_archiver.core import Feeder
@@ -35,5 +35,4 @@ class CSVFeeder(Feeder):
logger.warning(f"Not a valid URL in row: {row}, skipping") logger.warning(f"Not a valid URL in row: {row}, skipping")
continue continue
url = row[url_column] url = row[url_column]
logger.debug(f"Processing {url}")
yield Metadata().set_url(url) yield Metadata().set_url(url)

View File

@@ -8,7 +8,7 @@ from google.oauth2 import service_account
from google.oauth2.credentials import Credentials from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload from googleapiclient.http import MediaFileUpload
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Media from auto_archiver.core import Media
from auto_archiver.core import Storage from auto_archiver.core import Storage
@@ -62,7 +62,7 @@ class GDriveStorage(Storage):
parent_id, folder_id = self.root_folder_id, None parent_id, folder_id = self.root_folder_id, None
path_parts = media.key.split(os.path.sep) path_parts = media.key.split(os.path.sep)
filename = path_parts[-1] filename = path_parts[-1]
logger.info(f"looking for folders for {path_parts[0:-1]} before getting url for {filename=}") logger.info(f"Looking for folders for {path_parts[0:-1]} before getting url for {filename=}")
for folder in path_parts[0:-1]: for folder in path_parts[0:-1]:
folder_id = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=True) folder_id = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=True)
parent_id = folder_id parent_id = folder_id
@@ -70,7 +70,7 @@ class GDriveStorage(Storage):
file_id = self._get_id_from_parent_and_name(folder_id, filename, raise_on_missing=True) file_id = self._get_id_from_parent_and_name(folder_id, filename, raise_on_missing=True)
if not file_id: if not file_id:
# #
logger.info(f"file {filename} not found in folder {folder_id}") logger.info(f"File {filename} not found in folder {folder_id}")
return None return None
return f"https://drive.google.com/file/d/{file_id}/view?usp=sharing" return f"https://drive.google.com/file/d/{file_id}/view?usp=sharing"
@@ -83,7 +83,7 @@ class GDriveStorage(Storage):
parent_id, upload_to = self.root_folder_id, None parent_id, upload_to = self.root_folder_id, None
path_parts = media.key.split(os.path.sep) path_parts = media.key.split(os.path.sep)
filename = path_parts[-1] filename = path_parts[-1]
logger.info(f"checking folders {path_parts[0:-1]} exist (or creating) before uploading {filename=}") logger.info(f"Checking folders {path_parts[0:-1]} exist (or creating) before uploading {filename=}")
for folder in path_parts[0:-1]: for folder in path_parts[0:-1]:
upload_to = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=False) upload_to = self._get_id_from_parent_and_name(parent_id, folder, use_mime_type=True, raise_on_missing=False)
if upload_to is None: if upload_to is None:
@@ -91,7 +91,7 @@ class GDriveStorage(Storage):
parent_id = upload_to parent_id = upload_to
# upload file to gd # upload file to gd
logger.debug(f"uploading {filename=} to folder id {upload_to}") logger.debug(f"Uploading {filename=} to folder id {upload_to}")
file_metadata = {"name": [filename], "parents": [upload_to]} file_metadata = {"name": [filename], "parents": [upload_to]}
try: try:
media = MediaFileUpload(media.filename, resumable=True) media = MediaFileUpload(media.filename, resumable=True)
@@ -100,11 +100,11 @@ class GDriveStorage(Storage):
.create(supportsAllDrives=True, body=file_metadata, media_body=media, fields="id") .create(supportsAllDrives=True, body=file_metadata, media_body=media, fields="id")
.execute() .execute()
) )
logger.debug(f"uploadf: uploaded file {gd_file['id']} successfully in folder={upload_to}") logger.debug(f"Uploadf: uploaded file {gd_file['id']} successfully in folder={upload_to}")
except FileNotFoundError as e: except FileNotFoundError as e:
logger.error(f"gd uploadf: file not found {media.filename=} - {e}") logger.error(f"GD uploadf: file not found {media.filename=} - {e}")
except Exception as e: except Exception as e:
logger.error(f"gd uploadf: error uploading {media.filename=} to {upload_to} - {e}") logger.error(f"GD uploadf: error uploading {media.filename=} to {upload_to} - {e}")
# must be implemented even if unused # must be implemented even if unused
def uploadf(self, file: IO[bytes], key: str, **kwargs: dict) -> bool: def uploadf(self, file: IO[bytes], key: str, **kwargs: dict) -> bool:
@@ -133,7 +133,7 @@ class GDriveStorage(Storage):
self.api_cache = getattr(self, "api_cache", {}) self.api_cache = getattr(self, "api_cache", {})
cache_key = f"{parent_id}_{name}_{use_mime_type}" cache_key = f"{parent_id}_{name}_{use_mime_type}"
if cache_key in self.api_cache: if cache_key in self.api_cache:
logger.debug(f"cache hit for {cache_key=}") logger.debug(f"Cache hit for {cache_key=}")
return self.api_cache[cache_key] return self.api_cache[cache_key]
# API logic # API logic
@@ -168,7 +168,7 @@ class GDriveStorage(Storage):
else: else:
logger.debug(f"{debug_header} not found, attempt {attempt + 1}/{retries}.") logger.debug(f"{debug_header} not found, attempt {attempt + 1}/{retries}.")
if attempt < retries - 1: if attempt < retries - 1:
logger.debug(f"sleeping for {sleep_seconds} second(s)") logger.debug(f"Sleeping for {sleep_seconds} second(s)")
time.sleep(sleep_seconds) time.sleep(sleep_seconds)
if raise_on_missing: if raise_on_missing:

View File

@@ -58,7 +58,7 @@ If you are having issues with the extractor, you can review the version of `yt-d
}, },
"proxy": { "proxy": {
"default": "", "default": "",
"help": "http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port", "help": "http/https/socks proxy to use for the webdriver, eg https://proxy-user:password@proxy-ip:port",
}, },
"end_means_success": { "end_means_success": {
"default": True, "default": True,

View File

@@ -1,4 +1,4 @@
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core.extractor import Extractor from auto_archiver.core.extractor import Extractor
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media

View File

@@ -14,7 +14,7 @@ from yt_dlp.extractor.common import InfoExtractor
from yt_dlp.utils import MaxDownloadsReached from yt_dlp.utils import MaxDownloadsReached
import pysubs2 import pysubs2
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core.extractor import Extractor from auto_archiver.core.extractor import Extractor
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -63,8 +63,7 @@ class GenericExtractor(Extractor):
if os.environ.get("AUTO_ARCHIVER_ALLOW_RESTART", "1") != "1": if os.environ.get("AUTO_ARCHIVER_ALLOW_RESTART", "1") != "1":
logger.warning("yt-dlp or plugin was updated — please restart auto-archiver manually") logger.warning("yt-dlp or plugin was updated — please restart auto-archiver manually")
else: else:
logger.warning("yt-dlp or plugin was updated — restarting auto-archiver") logger.warning("yt-dlp or plugin was updated — restarting auto-archiver\n ======= RESTARTING ======= ")
logger.warning(" ======= RESTARTING ======= ")
os.execv(sys.executable, [sys.executable] + sys.argv) os.execv(sys.executable, [sys.executable] + sys.argv)
def update_package(self, package_name: str) -> bool: def update_package(self, package_name: str) -> bool:
@@ -80,7 +79,7 @@ class GenericExtractor(Extractor):
return True return True
logger.info(f"{package_name} already up to date") logger.info(f"{package_name} already up to date")
except Exception as e: except Exception as e:
logger.error(f"Error updating {package_name}: {e}") logger.error(f"Failed to update {package_name}: {e}")
return False return False
def setup_po_tokens(self) -> None: def setup_po_tokens(self) -> None:
@@ -206,7 +205,7 @@ class GenericExtractor(Extractor):
media = Media(cover_image_path) media = Media(cover_image_path)
metadata.add_media(media, id="cover") metadata.add_media(media, id="cover")
except Exception as e: except Exception as e:
logger.error(f"Error downloading cover image {thumbnail_url}: {e}") logger.error(f"Could not download cover image {thumbnail_url}: {e}")
dropin = self.dropin_for_name(info_extractor.ie_key()) dropin = self.dropin_for_name(info_extractor.ie_key())
if dropin: if dropin:
@@ -375,7 +374,7 @@ class GenericExtractor(Extractor):
if "entries" in data: if "entries" in data:
entries = data.get("entries", []) entries = data.get("entries", [])
if not len(entries): if not len(entries):
logger.info("YoutubeDLArchiver could not find any video") logger.info("GenericExtractor could not find any video")
return False return False
else: else:
entries = [data] entries = [data]
@@ -560,17 +559,17 @@ class GenericExtractor(Extractor):
# order of importance: username/password -> api_key -> cookie -> cookies_from_browser -> cookies_file # order of importance: username/password -> api_key -> cookie -> cookies_from_browser -> cookies_file
if auth: if auth:
if "username" in auth and "password" in auth: if "username" in auth and "password" in auth:
logger.debug(f"Using provided auth username and password for {url}") logger.debug("Using provided auth username and password")
ydl_options.extend(("--username", auth["username"])) ydl_options.extend(("--username", auth["username"]))
ydl_options.extend(("--password", auth["password"])) ydl_options.extend(("--password", auth["password"]))
elif "cookie" in auth: elif "cookie" in auth:
logger.debug(f"Using provided auth cookie for {url}") logger.debug("Using provided auth cookie")
yt_dlp.utils.std_headers["cookie"] = auth["cookie"] yt_dlp.utils.std_headers["cookie"] = auth["cookie"]
elif "cookies_from_browser" in auth: elif "cookies_from_browser" in auth:
logger.debug(f"Using extracted cookies from browser {auth['cookies_from_browser']} for {url}") logger.debug(f"Using extracted cookies from browser {auth['cookies_from_browser']}")
ydl_options.extend(("--cookies-from-browser", auth["cookies_from_browser"])) ydl_options.extend(("--cookies-from-browser", auth["cookies_from_browser"]))
elif "cookies_file" in auth: elif "cookies_file" in auth:
logger.debug(f"Using cookies from file {auth['cookies_file']} for {url}") logger.debug(f"Using cookies from file {auth['cookies_file']}")
ydl_options.extend(("--cookies", auth["cookies_file"])) ydl_options.extend(("--cookies", auth["cookies_file"]))
# Applying user-defined extractor_args # Applying user-defined extractor_args
@@ -584,7 +583,7 @@ class GenericExtractor(Extractor):
ydl_options.extend(["--extractor-args", f"{key}:{arg_str}"]) ydl_options.extend(["--extractor-args", f"{key}:{arg_str}"])
if self.ytdlp_args: if self.ytdlp_args:
logger.debug("Adding additional ytdlp arguments: {self.ytdlp_args}") logger.debug(f"Adding additional ytdlp arguments: {self.ytdlp_args}")
ydl_options += self.ytdlp_args.split(" ") ydl_options += self.ytdlp_args.split(" ")
*_, validated_options = yt_dlp.parse_options(ydl_options) *_, validated_options = yt_dlp.parse_options(ydl_options)

View File

@@ -1,5 +1,5 @@
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from yt_dlp.extractor.tiktok import TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE from yt_dlp.extractor.tiktok import TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE
@@ -22,7 +22,7 @@ class Tiktok(GenericDropin):
return any(extractor().suitable(url) for extractor in (TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE)) return any(extractor().suitable(url) for extractor in (TikTokIE, TikTokLiveIE, TikTokVMIE, TikTokUserIE))
def extract_post(self, url: str, ie_instance): def extract_post(self, url: str, ie_instance):
logger.debug(f"Using Tikwm API to attempt to download tiktok video from {url=}") logger.debug("Using Tikwm API to attempt to download tiktok video")
endpoint = self.TIKWM_ENDPOINT.format(url=url) endpoint = self.TIKWM_ENDPOINT.format(url=url)
@@ -62,7 +62,7 @@ class Tiktok(GenericDropin):
# get the video or fail # get the video or fail
video_downloaded = archiver.download_from_url(video_url, f"vid_{post.get('id', '')}") video_downloaded = archiver.download_from_url(video_url, f"vid_{post.get('id', '')}")
if not video_downloaded: if not video_downloaded:
logger.error(f"failed to download video from {video_url}") logger.error("Failed to download video")
return False return False
video_media = Media(video_downloaded) video_media = Media(video_downloaded)
if duration := post.get("duration", None): if duration := post.get("duration", None):

View File

@@ -1,7 +1,7 @@
import re import re
import mimetypes import mimetypes
from loguru import logger from auto_archiver.utils.custom_logger import logger
from slugify import slugify from slugify import slugify
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media

View File

@@ -10,11 +10,12 @@ The filtered rows are processed into `Metadata` objects.
""" """
import os import os
import traceback
from typing import Tuple, Union, Iterator from typing import Tuple, Union, Iterator
from urllib.parse import quote from urllib.parse import quote
import gspread import gspread
from loguru import logger from auto_archiver.utils.custom_logger import logger
from slugify import slugify from slugify import slugify
from retrying import retry from retrying import retry
@@ -41,18 +42,18 @@ class GsheetsFeederDB(Feeder, Database):
sh = self.open_sheet() sh = self.open_sheet()
for ii, worksheet in enumerate(sh.worksheets()): for ii, worksheet in enumerate(sh.worksheets()):
if not self.should_process_sheet(worksheet.title): if not self.should_process_sheet(worksheet.title):
logger.debug(f"SKIPPED worksheet '{worksheet.title}' due to allow/block rules") logger.debug(f"Skipped worksheet '{worksheet.title}' due to allow/block rules")
continue continue
logger.info(f"Opening worksheet {ii=}: {worksheet.title=} header={self.header}") logger.info(f"Opening worksheet {ii=}: {worksheet.title=} header={self.header}")
gw = GWorksheet(worksheet, header_row=self.header, columns=self.columns) gw = GWorksheet(worksheet, header_row=self.header, columns=self.columns)
if len(missing_cols := self.missing_required_columns(gw)): if len(missing_cols := self.missing_required_columns(gw)):
logger.debug( logger.debug(
f"SKIPPED worksheet '{worksheet.title}' due to missing required column(s) for {missing_cols}" f"Skipped worksheet '{worksheet.title}' due to missing required column(s) for {missing_cols}"
) )
continue continue
with logger.contextualize(worksheet=f"{sh.title}:{worksheet.title}"):
# process and yield metadata here: # process and yield metadata here:
yield from self._process_rows(gw) yield from self._process_rows(gw)
logger.info(f"Finished worksheet {worksheet.title}") logger.info(f"Finished worksheet {worksheet.title}")
def _process_rows(self, gw: GWorksheet): def _process_rows(self, gw: GWorksheet):
@@ -69,7 +70,9 @@ class GsheetsFeederDB(Feeder, Database):
# All checks done - archival process starts here # All checks done - archival process starts here
m = Metadata().set_url(url) m = Metadata().set_url(url)
self._set_context(m, gw, row) self._set_context(m, gw, row)
yield m
with logger.contextualize(row=row):
yield m
def _set_context(self, m: Metadata, gw: GWorksheet, row: int) -> Metadata: def _set_context(self, m: Metadata, gw: GWorksheet, row: int) -> Metadata:
# TODO: Check folder value not being recognised # TODO: Check folder value not being recognised
@@ -99,16 +102,16 @@ class GsheetsFeederDB(Feeder, Database):
return missing return missing
def started(self, item: Metadata) -> None: def started(self, item: Metadata) -> None:
logger.info(f"STARTED {item}") logger.info("STARTED")
gw, row = self._retrieve_gsheet(item) gw, row = self._retrieve_gsheet(item)
gw.set_cell(row, "status", "Archive in progress") gw.set_cell(row, "status", "Archive in progress")
def failed(self, item: Metadata, reason: str) -> None: def failed(self, item: Metadata, reason: str) -> None:
logger.error(f"FAILED {item}") logger.error("FAILED")
self._safe_status_update(item, f"Archive failed {reason}") self._safe_status_update(item, f"Archive failed {reason}")
def aborted(self, item: Metadata) -> None: def aborted(self, item: Metadata) -> None:
logger.warning(f"ABORTED {item}") logger.warning("ABORTED")
self._safe_status_update(item, "") self._safe_status_update(item, "")
def fetch(self, item: Metadata) -> Union[Metadata, bool]: def fetch(self, item: Metadata) -> Union[Metadata, bool]:
@@ -117,13 +120,13 @@ class GsheetsFeederDB(Feeder, Database):
def done(self, item: Metadata, cached: bool = False) -> None: def done(self, item: Metadata, cached: bool = False) -> None:
"""archival result ready - should be saved to DB""" """archival result ready - should be saved to DB"""
logger.success(f"DONE {item.get_url()}")
gw, row = self._retrieve_gsheet(item) gw, row = self._retrieve_gsheet(item)
# self._safe_status_update(item, 'done')
cell_updates = [] cell_updates = []
row_values = gw.get_row(row) row_values = gw.get_row(row)
logger.info("DONE")
def batch_if_valid(col, val, final_value=None): def batch_if_valid(col, val, final_value=None):
final_value = final_value or val final_value = final_value or val
try: try:
@@ -175,9 +178,7 @@ class GsheetsFeederDB(Feeder, Database):
) )
@retry( @retry(
wait_incrementing_start=1000, wait_exponential_multiplier=1,
wait_incrementing_increment=3000,
wait_incrementing_max=20_000,
stop_max_attempt_number=5, stop_max_attempt_number=5,
) )
def batch_set_cell_with_retry(gw, cell_updates: list): def batch_set_cell_with_retry(gw, cell_updates: list):
@@ -190,15 +191,13 @@ class GsheetsFeederDB(Feeder, Database):
gw, row = self._retrieve_gsheet(item) gw, row = self._retrieve_gsheet(item)
gw.set_cell(row, "status", new_status) gw.set_cell(row, "status", new_status)
except Exception as e: except Exception as e:
logger.debug(f"Unable to update sheet: {e}") logger.debug(f"Unable to update sheet: {e}: {traceback.format_exc()}")
def _retrieve_gsheet(self, item: Metadata) -> Tuple[GWorksheet, int]: def _retrieve_gsheet(self, item: Metadata) -> Tuple[GWorksheet, int]:
if gsheet := item.get_context("gsheet"): if gsheet := item.get_context("gsheet"):
gw: GWorksheet = gsheet.get("worksheet") gw: GWorksheet = gsheet.get("worksheet")
row: int = gsheet.get("row") row: int = gsheet.get("row")
elif self.sheet_id: elif self.sheet_id:
logger.error( logger.error("Unable to retrieve Gsheet, GsheetDB must be used alongside GsheetFeeder.")
f"Unable to retrieve Gsheet for {item.get_url()}, GsheetDB must be used alongside GsheetFeeder."
)
return gw, row return gw, row

View File

@@ -1,4 +1,5 @@
from gspread import utils from gspread import utils
from retrying import retry
class GWorksheet: class GWorksheet:
@@ -26,6 +27,10 @@ class GWorksheet:
"replaywebpage": "replaywebpage", "replaywebpage": "replaywebpage",
} }
@retry(
wait_exponential_multiplier=1,
stop_max_attempt_number=6,
)
def __init__(self, worksheet, columns=COLUMN_NAMES, header_row=1): def __init__(self, worksheet, columns=COLUMN_NAMES, header_row=1):
self.wks = worksheet self.wks = worksheet
self.columns = columns self.columns = columns

View File

@@ -9,7 +9,7 @@ making it suitable for handling large files efficiently.
""" """
import hashlib import hashlib
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -22,8 +22,7 @@ class HashEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug(f"Calculating media hashes with algo={self.algorithm}")
logger.debug(f"calculating media hashes for {url=} (using {self.algorithm})")
for i, m in enumerate(to_enrich.media): for i, m in enumerate(to_enrich.media):
if len(hd := self.calculate_hash(m.filename)): if len(hd := self.calculate_hash(m.filename)):

View File

@@ -4,7 +4,7 @@ import os
import pathlib import pathlib
from jinja2 import Environment, FileSystemLoader from jinja2 import Environment, FileSystemLoader
from urllib.parse import quote from urllib.parse import quote
from loguru import logger from auto_archiver.utils.custom_logger import logger
import json import json
import base64 import base64
@@ -35,7 +35,7 @@ class HtmlFormatter(Formatter):
def format(self, item: Metadata) -> Media: def format(self, item: Metadata) -> Media:
url = item.get_url() url = item.get_url()
if item.is_empty(): if item.is_empty():
logger.debug(f"[SKIP] FORMAT there is no media or metadata to format: {url=}") logger.debug("Nothing to format, skipping")
return return
content = self.template.render( content = self.template.render(

View File

@@ -22,7 +22,7 @@
"full_profile_max_posts": { "full_profile_max_posts": {
"default": 0, "default": 0,
"type": "int", "type": "int",
"help": "Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights", "help": "Use to limit the number of posts to download when full_profile is true or when a URL for multiple posts is passed (like /stories /highlights ...). 0 means no limit. when full_profile is true the order of downloaded content is stories -> posts -> tagged posts -> highlights, so a value of 10 could download 2 stories, 7 posts, 1 tagged posts, and 0 highlights.",
}, },
"minimize_json_output": { "minimize_json_output": {
"default": True, "default": True,

View File

@@ -8,11 +8,13 @@ data, reducing JSON output size, and handling large profiles.
""" """
import math
import re import re
from datetime import datetime from datetime import datetime
import traceback
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from retrying import retry from retrying import retry
from tqdm import tqdm from tqdm import tqdm
@@ -35,17 +37,19 @@ class InstagramAPIExtractor(Extractor):
def setup(self) -> None: def setup(self) -> None:
if self.api_endpoint[-1] == "/": if self.api_endpoint[-1] == "/":
self.api_endpoint = self.api_endpoint[:-1] self.api_endpoint = self.api_endpoint[:-1]
self.full_profile_max_posts = int(self.full_profile_max_posts or 0)
if self.full_profile_max_posts == 0:
self.full_profile_max_posts = math.inf
def download(self, item: Metadata) -> Metadata: def download(self, item: Metadata) -> Metadata:
url = item.get_url() url = item.get_url()
url.replace("instagr.com", "instagram.com").replace("instagr.am", "instagram.com") url.replace("instagr.com", "instagram.com").replace("instagr.am", "instagram.com")
insta_matches = self.valid_url.findall(url) insta_matches = self.valid_url.findall(url)
logger.info(f"{insta_matches=}")
if not len(insta_matches) or len(insta_matches[0]) != 3: if not len(insta_matches) or len(insta_matches[0]) != 3:
return return
if len(insta_matches) > 1: if len(insta_matches) > 1:
logger.warning(f"Multiple instagram matches found in {url=}, using the first one") logger.debug("Multiple instagram matches found, using the first one")
return return
g1, g2, g3 = insta_matches[0][0], insta_matches[0][1], insta_matches[0][2] g1, g2, g3 = insta_matches[0][0], insta_matches[0][1], insta_matches[0][2]
if g1 == "": if g1 == "":
@@ -61,13 +65,13 @@ class InstagramAPIExtractor(Extractor):
return self.download_post(item, id=g3, context="story") return self.download_post(item, id=g3, context="story")
return self.download_stories(item, g2) return self.download_stories(item, g2)
else: else:
logger.warning(f"Unknown instagram regex group match {g1=} found in {url=}") logger.warning(f"Unknown instagram regex group match {g1=}")
return return
@retry(wait_random_min=1000, wait_random_max=3000, stop_max_attempt_number=5) @retry(wait_random_min=1000, wait_random_max=3000, stop_max_attempt_number=5)
def call_api(self, path: str, params: dict) -> dict: def call_api(self, path: str, params: dict) -> dict:
headers = {"accept": "application/json", "x-access-key": self.access_token} headers = {"accept": "application/json", "x-access-key": self.access_token}
logger.debug(f"calling {self.api_endpoint}/{path} with {params=}") logger.debug(f"Calling {self.api_endpoint}/{path} with {params=}")
return requests.get(f"{self.api_endpoint}/{path}", headers=headers, params=params).json() return requests.get(f"{self.api_endpoint}/{path}", headers=headers, params=params).json()
def cleanup_dict(self, d: dict | list) -> dict: def cleanup_dict(self, d: dict | list) -> dict:
@@ -97,65 +101,84 @@ class InstagramAPIExtractor(Extractor):
filename = self.download_from_url(pic_url) filename = self.download_from_url(pic_url)
result.add_media(Media(filename=filename), id="profile_picture") result.add_media(Media(filename=filename), id="profile_picture")
count_posts = 0
if self.full_profile: if self.full_profile:
user_id = user.get("pk") user_id = user.get("pk")
# download all stories # download all stories
try: try:
stories = self._download_stories_reusable(result, username) stories = self._download_stories_reusable(
result, username, max_to_download=self.full_profile_max_posts - count_posts
)
count_posts += len(stories)
result.set("#stories", len(stories)) result.set("#stories", len(stories))
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading stories for {username}") result.append("errors", f"Error downloading stories for {username}")
logger.error(f"Error downloading stories for {username}: {e}") logger.error(f"Error downloading stories for {username}: {e} {traceback.format_exc()}")
# download all posts # download all posts
try: try:
self.download_all_posts(result, user_id) if count_posts < self.full_profile_max_posts:
count_posts += self.download_all_posts(
result, user_id, max_to_download=self.full_profile_max_posts - count_posts
)
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading posts for {username}") result.append("errors", f"Error downloading posts for {username}")
logger.error(f"Error downloading posts for {username}: {e}") logger.error(f"Error downloading posts for {username}: {e} {traceback.format_exc()}")
# download all tagged # download all tagged
try: try:
self.download_all_tagged(result, user_id) if count_posts < self.full_profile_max_posts:
count_posts += self.download_all_tagged(
result, user_id, max_to_download=self.full_profile_max_posts - count_posts
)
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading tagged posts for {username}") result.append("errors", f"Error downloading tagged posts for {username}")
logger.error(f"Error downloading tagged posts for {username}: {e}") logger.error(f"Error downloading tagged posts for {username}: {e} {traceback.format_exc()}")
# download all highlights # download all highlights
try: try:
self.download_all_highlights(result, username, user_id) if count_posts < self.full_profile_max_posts:
count_posts += self.download_all_highlights(
result, username, user_id, max_to_download=self.full_profile_max_posts - count_posts
)
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading highlights for {username}") result.append("errors", f"Error downloading highlights for {username}")
logger.error(f"Error downloading highlights for {username}: {e}") logger.error(f"Error downloading highlights for {username}: {e} {traceback.format_exc()}")
result.set_url(url) # reset as scrape_item modifies it result.set_url(url) # reset as scrape_item modifies it
return result.success("insta profile") return result.success("insta profile")
def download_all_highlights(self, result, username, user_id): def download_all_highlights(self, result, username, user_id, max_to_download: int) -> int:
count_highlights = 0 count_highlights = 0
highlights = self.call_api("v1/user/highlights", {"user_id": user_id}) highlights = self.call_api("v1/user/highlights", {"user_id": user_id})
highlights = highlights[: min(max_to_download, len(highlights))] # newest to oldest
for h in highlights: for h in highlights:
try: try:
h_info = self._download_highlights_reusable(result, h.get("pk")) h_info = self._download_highlights_reusable(result, h.get("pk"), max_to_download=max_to_download)
count_highlights += len(h_info.get("items", [])) count_highlights += len(h_info.get("items", []))
except Exception as e: except Exception as e:
result.append( result.append(
"errors", "errors",
f"Error downloading highlight id{h.get('pk')} for {username}", f"Error downloading highlight id{h.get('pk')} for {username}",
) )
logger.error(f"Error downloading highlight id{h.get('pk')} for {username}: {e}") logger.error(
if self.full_profile_max_posts and count_highlights >= self.full_profile_max_posts: f"Error downloading highlight id{h.get('pk')} for {username}: {e} {traceback.format_exc()}"
logger.info(f"HIGHLIGHTS reached full_profile_max_posts={self.full_profile_max_posts}") )
if count_highlights >= max_to_download:
logger.debug(f"HIGHLIGHTS reached max_to_download={self.full_profile_max_posts}")
break break
result.set("#highlights", count_highlights) result.set("#highlights", count_highlights)
return count_highlights
def download_post(self, result: Metadata, code: str = None, id: str = None, context: str = None) -> Metadata: def download_post(self, result: Metadata, code: str = None, id: str = None, context: str = "") -> Metadata:
if id: if id:
post = self.call_api("v1/media/by/id", {"id": id}) post = self.call_api("v1/media/by/id", {"id": id})
else: else:
post = self.call_api("v1/media/by/code", {"code": code}) post = self.call_api("v1/media/by/code", {"code": code})
assert post, f"Post {id or code} not found" assert post, f"Post {id or code} not found"
result.set(f"{context}_data", post)
if caption_text := post.get("caption_text"): if caption_text := post.get("caption_text"):
result.set_title(caption_text) result.set_title(caption_text)
@@ -166,13 +189,13 @@ class InstagramAPIExtractor(Extractor):
return result.success(f"insta {context or 'post'}") return result.success(f"insta {context or 'post'}")
def download_highlights(self, result: Metadata, id: str) -> Metadata: def download_highlights(self, result: Metadata, id: str) -> Metadata:
h_info = self._download_highlights_reusable(result, id) h_info = self._download_highlights_reusable(result, id, self.full_profile_max_posts)
items = len(h_info.get("items", [])) items = len(h_info.get("items", []))
del h_info["items"] del h_info["items"]
result.set_title(h_info.get("title")).set("data", h_info).set("#reels", items) result.set_title(h_info.get("title")).set("data", h_info).set("#reels", items)
return result.success("insta highlights") return result.success("insta highlights")
def _download_highlights_reusable(self, result: Metadata, id: str) -> dict: def _download_highlights_reusable(self, result: Metadata, id: str, max_to_download: int) -> dict:
full_h = self.call_api("v2/highlight/by/id", {"id": id}) full_h = self.call_api("v2/highlight/by/id", {"id": id})
h_info = full_h.get("response", {}).get("reels", {}).get(f"highlight:{id}") h_info = full_h.get("response", {}).get("reels", {}).get(f"highlight:{id}")
assert h_info, f"Highlight {id} not found: {full_h=}" assert h_info, f"Highlight {id} not found: {full_h=}"
@@ -182,38 +205,39 @@ class InstagramAPIExtractor(Extractor):
result.add_media(Media(filename=filename), id=f"cover_media highlight {id}") result.add_media(Media(filename=filename), id=f"cover_media highlight {id}")
items = h_info.get("items", [])[::-1] # newest to oldest items = h_info.get("items", [])[::-1] # newest to oldest
items = items[: min(max_to_download, len(items))]
for h in tqdm(items, desc="downloading highlights", unit="highlight"): for h in tqdm(items, desc="downloading highlights", unit="highlight"):
try: try:
self.scrape_item(result, h, "highlight") self.scrape_item(result, h, "highlight")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading highlight {h.get('id')}") result.append("errors", f"Error downloading highlight {h.get('id')}")
logger.error(f"Error downloading highlight, skipping {h.get('id')}: {e}") logger.error(f"Error downloading highlight, skipping {h.get('id')}: {e} {traceback.format_exc()}")
return h_info return h_info
def download_stories(self, result: Metadata, username: str) -> Metadata: def download_stories(self, result: Metadata, username: str) -> Metadata:
now = datetime.now().strftime("%Y-%m-%d_%H-%M") now = datetime.now().strftime("%Y-%m-%d_%H-%M")
stories = self._download_stories_reusable(result, username) stories = self._download_stories_reusable(result, username, max_to_download=self.full_profile_max_posts)
if stories == []: if stories == []:
return result.success("insta no story") return result.success("insta no story")
result.set_title(f"stories {username} at {now}").set("#stories", len(stories)) result.set_title(f"stories {username} at {now}").set("#stories", len(stories))
return result.success(f"insta stories {now}") return result.success(f"insta stories {now}")
def _download_stories_reusable(self, result: Metadata, username: str) -> list[dict]: def _download_stories_reusable(self, result: Metadata, username: str, max_to_download: int) -> list[dict]:
stories = self.call_api("v1/user/stories/by/username", {"username": username}) stories = self.call_api("v1/user/stories/by/username", {"username": username})
if not stories or not len(stories): if not stories or not len(stories):
return [] return []
stories = stories[::-1] # newest to oldest stories = stories[::-1][: min(max_to_download, len(stories))] # newest to oldest
for s in tqdm(stories, desc="downloading stories", unit="story"): for s in tqdm(stories, desc="downloading stories", unit="story"):
try: try:
self.scrape_item(result, s, "story") self.scrape_item(result, s, "story")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading story {s.get('id')}") result.append("errors", f"Error downloading story {s.get('id')}")
logger.error(f"Error downloading story, skipping {s.get('id')}: {e}") logger.error(f"Error downloading story, skipping {s.get('id')}: {e} {traceback.format_exc()}")
return stories return stories
def download_all_posts(self, result: Metadata, user_id: str): def download_all_posts(self, result: Metadata, user_id: str, max_to_download: int) -> int:
end_cursor = None end_cursor = None
pbar = tqdm(desc="downloading posts") pbar = tqdm(desc="downloading posts")
@@ -223,22 +247,23 @@ class InstagramAPIExtractor(Extractor):
if not posts or not isinstance(posts, list) or len(posts) != 2: if not posts or not isinstance(posts, list) or len(posts) != 2:
break break
posts, end_cursor = posts[0], posts[1] posts, end_cursor = posts[0], posts[1]
logger.info(f"parsing {len(posts)} posts, next {end_cursor=}") posts = posts[: min(max_to_download, len(posts))]
logger.info(f"Parsing {len(posts)} posts, next {end_cursor=} {post_count=} {max_to_download=}")
for p in posts: for p in posts:
try: try:
self.scrape_item(result, p, "post") self.scrape_item(result, p, "post")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading post {p.get('id')}") result.append("errors", f"Error downloading post {p.get('id')}")
logger.error(f"Error downloading post, skipping {p.get('id')}: {e}") logger.error(f"Error downloading post, skipping {p.get('id')}: {e} {traceback.format_exc()}")
pbar.update(1) pbar.update(1)
post_count += 1 post_count += 1
if self.full_profile_max_posts and post_count >= self.full_profile_max_posts: if post_count >= max_to_download:
logger.info(f"POSTS reached full_profile_max_posts={self.full_profile_max_posts}") logger.info(f"POSTS reached max_to_download={self.full_profile_max_posts}")
break break
result.set("#posts", post_count) result.set("#posts", post_count)
return post_count
def download_all_tagged(self, result: Metadata, user_id: str): def download_all_tagged(self, result: Metadata, user_id: str, max_to_download: int) -> int:
next_page_id = "" next_page_id = ""
pbar = tqdm(desc="downloading tagged posts") pbar = tqdm(desc="downloading tagged posts")
@@ -250,22 +275,23 @@ class InstagramAPIExtractor(Extractor):
break break
next_page_id = resp.get("next_page_id") next_page_id = resp.get("next_page_id")
logger.info(f"parsing {len(posts)} tagged posts, next {next_page_id=}") logger.info(f"Parsing {len(posts)} tagged posts, next {next_page_id=}")
posts = posts[: min(max_to_download, len(posts))]
for p in posts: for p in posts:
try: try:
self.scrape_item(result, p, "tagged") self.scrape_item(result, p, "tagged")
except Exception as e: except Exception as e:
result.append("errors", f"Error downloading tagged post {p.get('id')}") result.append("errors", f"Error downloading tagged post {p.get('id')}")
logger.error(f"Error downloading tagged post, skipping {p.get('id')}: {e}") logger.error(f"Error downloading tagged post, skipping {p.get('id')}: {e} {traceback.format_exc()}")
pbar.update(1) pbar.update(1)
tagged_count += 1 tagged_count += 1
if self.full_profile_max_posts and tagged_count >= self.full_profile_max_posts: if tagged_count >= max_to_download:
logger.info(f"TAGS reached full_profile_max_posts={self.full_profile_max_posts}") logger.info(f"TAGS reached max_to_download={self.full_profile_max_posts}")
break break
result.set("#tagged", tagged_count) result.set("#tagged", tagged_count)
return tagged_count
### reusable parsing utils below # reusable parsing utils below
def scrape_item(self, result: Metadata, item: dict, context: str = None) -> dict: def scrape_item(self, result: Metadata, item: dict, context: str = None) -> dict:
""" """

View File

@@ -7,8 +7,9 @@ highlights, and tagged posts. Authentication is required via username/password o
import re import re
import os import os
import shutil import shutil
import traceback
import instaloader import instaloader
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -29,8 +30,9 @@ class InstagramExtractor(Extractor):
# TODO: links to stories # TODO: links to stories
def setup(self) -> None: def setup(self) -> None:
logger.warning("Instagram Extractor is not actively maintained, and may not work as expected.") logger.warning(
logger.warning("Please consider using the Instagram Tbot Extractor or Instagram API Extractor instead.") "Instagram Extractor is not actively maintained, and may not work as expected.\nPlease consider using the Instagram Tbot Extractor or Instagram API Extractor instead."
)
self.insta = instaloader.Instaloader( self.insta = instaloader.Instaloader(
download_geotags=True, download_geotags=True,
@@ -43,8 +45,7 @@ class InstagramExtractor(Extractor):
self.insta.load_session_from_file(self.username, self.session_file) self.insta.load_session_from_file(self.username, self.session_file)
except Exception: except Exception:
try: try:
logger.debug("Session file failed", exc_info=True) logger.info("No valid session file found - Attempting login with username and password.")
logger.info("No valid session file found - Attempting login with use and password.")
self.insta.login(self.username, self.password) self.insta.login(self.username, self.password)
self.insta.save_session_to_file(self.session_file) self.insta.save_session_to_file(self.session_file)
except Exception as e: except Exception as e:
@@ -79,7 +80,7 @@ class InstagramExtractor(Extractor):
return result return result
def download_post(self, url: str, post_id: str) -> Metadata: def download_post(self, url: str, post_id: str) -> Metadata:
logger.debug(f"Instagram {post_id=} detected in {url=}") logger.debug(f"Instagram {post_id=} detected")
post = instaloader.Post.from_shortcode(self.insta.context, post_id) post = instaloader.Post.from_shortcode(self.insta.context, post_id)
if self.insta.download_post(post, target=post.owner_username): if self.insta.download_post(post, target=post.owner_username):
@@ -87,7 +88,7 @@ class InstagramExtractor(Extractor):
def download_profile(self, url: str, username: str) -> Metadata: def download_profile(self, url: str, username: str) -> Metadata:
# gets posts, posts where username is tagged, igtv postss, stories, and highlights # gets posts, posts where username is tagged, igtv postss, stories, and highlights
logger.debug(f"Instagram {username=} detected in {url=}") logger.debug(f"Instagram {username=} detected")
profile = instaloader.Profile.from_username(self.insta.context, username) profile = instaloader.Profile.from_username(self.insta.context, username)
try: try:
@@ -95,27 +96,27 @@ class InstagramExtractor(Extractor):
try: try:
self.insta.download_post(post, target=f"profile_post_{post.owner_username}") self.insta.download_post(post, target=f"profile_post_{post.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download post: {post.shortcode}: {e}") logger.error(f"Failed to download post: {post.shortcode}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed profile.get_posts: {e}") logger.error(f"Failed profile.get_posts: {e}: {traceback.format_exc()}")
try: try:
for post in profile.get_tagged_posts(): for post in profile.get_tagged_posts():
try: try:
self.insta.download_post(post, target=f"tagged_post_{post.owner_username}") self.insta.download_post(post, target=f"tagged_post_{post.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download tagged post: {post.shortcode}: {e}") logger.error(f"Failed to download tagged post: {post.shortcode}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed profile.get_tagged_posts: {e}") logger.error(f"Failed profile.get_tagged_posts: {e} {traceback.format_exc()}")
try: try:
for post in profile.get_igtv_posts(): for post in profile.get_igtv_posts():
try: try:
self.insta.download_post(post, target=f"igtv_post_{post.owner_username}") self.insta.download_post(post, target=f"igtv_post_{post.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download igtv post: {post.shortcode}: {e}") logger.error(f"Failed to download igtv post: {post.shortcode}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed profile.get_igtv_posts: {e}") logger.error(f"Failed profile.get_igtv_posts: {e} {traceback.format_exc()}")
try: try:
for story in self.insta.get_stories([profile.userid]): for story in self.insta.get_stories([profile.userid]):
@@ -123,9 +124,9 @@ class InstagramExtractor(Extractor):
try: try:
self.insta.download_storyitem(item, target=f"story_item_{story.owner_username}") self.insta.download_storyitem(item, target=f"story_item_{story.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download story item: {item}: {e}") logger.error(f"Failed to download story item: {item}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed get_stories: {e}") logger.error(f"Failed get_stories: {e} {traceback.format_exc()}")
try: try:
for highlight in self.insta.get_highlights(profile.userid): for highlight in self.insta.get_highlights(profile.userid):
@@ -133,9 +134,9 @@ class InstagramExtractor(Extractor):
try: try:
self.insta.download_storyitem(item, target=f"highlight_item_{highlight.owner_username}") self.insta.download_storyitem(item, target=f"highlight_item_{highlight.owner_username}")
except Exception as e: except Exception as e:
logger.error(f"Failed to download highlight item: {item}: {e}") logger.error(f"Failed to download highlight item: {item}: {e} {traceback.format_exc()}")
except Exception as e: except Exception as e:
logger.error(f"Failed get_highlights: {e}") logger.error(f"Failed get_highlights: {e} {traceback.format_exc()}")
return self.process_downloads(url, f"@{username}", profile._asdict(), None) return self.process_downloads(url, f"@{username}", profile._asdict(), None)
@@ -158,4 +159,4 @@ class InstagramExtractor(Extractor):
return result.success("instagram") return result.success("instagram")
except Exception as e: except Exception as e:
logger.error(f"Could not fetch instagram post {url} due to: {e}") logger.error(f"Could not fetch instagram post due to: {e} {traceback.format_exc()}")

View File

@@ -12,7 +12,7 @@ import shutil
import time import time
from sqlite3 import OperationalError from sqlite3 import OperationalError
from loguru import logger from auto_archiver.utils.custom_logger import logger
from telethon.sync import TelegramClient from telethon.sync import TelegramClient
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
@@ -32,7 +32,7 @@ class InstagramTbotExtractor(Extractor):
1. makes a copy of session_file that is removed in cleanup 1. makes a copy of session_file that is removed in cleanup
2. checks if the session file is valid 2. checks if the session file is valid
""" """
logger.info(f"SETUP {self.name} checking login...") logger.debug(f"SETUP {self.name} checking login...")
self._prepare_session_file() self._prepare_session_file()
self._initialize_telegram_client() self._initialize_telegram_client()
@@ -58,10 +58,10 @@ class InstagramTbotExtractor(Extractor):
"If you do, disable at least one of the archivers for the first-time setup of the telethon session: {e}" "If you do, disable at least one of the archivers for the first-time setup of the telethon session: {e}"
) )
with self.client.start(): with self.client.start():
logger.info(f"SETUP {self.name} login works.") logger.debug(f"SETUP {self.name} login works.")
def cleanup(self) -> None: def cleanup(self) -> None:
logger.info(f"CLEANUP {self.name}.") logger.debug(f"CLEANUP {self.name}.")
session_file_name = self.session_file + ".session" session_file_name = self.session_file + ".session"
if os.path.exists(session_file_name): if os.path.exists(session_file_name):
os.remove(session_file_name) os.remove(session_file_name)
@@ -79,17 +79,17 @@ class InstagramTbotExtractor(Extractor):
# This may be outdated and replaced by the below message, but keeping until confirmed # This may be outdated and replaced by the below message, but keeping until confirmed
if "You must enter a URL to a post" in message: if "You must enter a URL to a post" in message:
logger.debug(f"invalid link {url=} for {self.name}: {message}") logger.debug(f"Invalid link for {self.name}: {message}")
return False return False
if "Media not found or unavailable" in message: if "Media not found or unavailable" in message:
logger.debug(f"No media found for link {url=} for {self.name}: {message}") logger.debug(f"No media found for {self.name}: {message}")
return False return False
if message: if message:
result.set_content(message).set_title(message[:128]) result.set_content(message).set_title(message[:128])
elif result.is_empty(): elif result.is_empty():
logger.debug(f"No media found for link {url=} for {self.name}: {message}") logger.debug(f"No media found for {self.name}: {message}")
return False return False
return result.success("insta-via-bot") return result.success("insta-via-bot")

View File

@@ -1,5 +1,5 @@
import json import json
from loguru import logger from auto_archiver.utils.custom_logger import logger
import os import os
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
@@ -8,9 +8,7 @@ from auto_archiver.core import Media, Metadata
class JsonEnricher(Enricher): class JsonEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("Enriching as JSON")
logger.debug(f"JSON Enricher for {url=}")
item_path = os.path.join(self.tmp_dir, "metadata.json") item_path = os.path.join(self.tmp_dir, "metadata.json")
with open(item_path, mode="w", encoding="utf-8") as outf: with open(item_path, mode="w", encoding="utf-8") as outf:

View File

@@ -1,7 +1,7 @@
import shutil import shutil
from typing import IO from typing import IO
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Media from auto_archiver.core import Media
from auto_archiver.core import Storage from auto_archiver.core import Storage
@@ -38,8 +38,7 @@ class LocalStorage(Storage):
os.makedirs(os.path.dirname(dest), exist_ok=True) os.makedirs(os.path.dirname(dest), exist_ok=True)
logger.debug(f"[{self.__class__.__name__}] storing file {media.filename} with key {media.key} to {dest}") logger.debug(f"[{self.__class__.__name__}] storing file {media.filename} with key {media.key} to {dest}")
res = shutil.copy2(media.filename, dest) shutil.copy2(media.filename, dest)
logger.info(res)
return True return True
# must be implemented even if unused # must be implemented even if unused

View File

@@ -1,6 +1,6 @@
import datetime import datetime
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -12,20 +12,17 @@ class MetaEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url()
if to_enrich.is_empty(): if to_enrich.is_empty():
logger.debug(f"[SKIP] META_ENRICHER there is no media or metadata to enrich: {url=}") logger.debug("[SKIP] META_ENRICHER there is no media or metadata to enrich")
return return
logger.debug(f"calculating archive metadata information for {url=}") logger.debug("Calculating archive metadata information")
self.enrich_file_sizes(to_enrich) self.enrich_file_sizes(to_enrich)
self.enrich_archive_duration(to_enrich) self.enrich_archive_duration(to_enrich)
def enrich_file_sizes(self, to_enrich: Metadata): def enrich_file_sizes(self, to_enrich: Metadata):
logger.debug( logger.debug(f"Calculating archive file sizes for {len(to_enrich.media)} media files")
f"calculating archive file sizes for url={to_enrich.get_url()} ({len(to_enrich.media)} media files)"
)
total_size = 0 total_size = 0
for media in to_enrich.get_all_media(): for media in to_enrich.get_all_media():
file_stats = os.stat(media.filename) file_stats = os.stat(media.filename)
@@ -44,7 +41,7 @@ class MetaEnricher(Enricher):
size /= 1024 size /= 1024
def enrich_archive_duration(self, to_enrich): def enrich_archive_duration(self, to_enrich):
logger.debug(f"calculating archive duration for url={to_enrich.get_url()} ") logger.debug("Calculating archive duration")
archive_duration = datetime.datetime.now(datetime.timezone.utc) - to_enrich.get("_processed_at") archive_duration = datetime.datetime.now(datetime.timezone.utc) - to_enrich.get("_processed_at")
to_enrich.set("archive_duration_seconds", archive_duration.seconds) to_enrich.set("archive_duration_seconds", archive_duration.seconds)

View File

@@ -1,6 +1,6 @@
import subprocess import subprocess
import traceback import traceback
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -12,8 +12,7 @@ class MetadataEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("Extracting EXIF metadata")
logger.debug(f"extracting EXIF metadata for {url=}")
for i, m in enumerate(to_enrich.media): for i, m in enumerate(to_enrich.media):
if len(md := self.get_metadata(m.filename)): if len(md := self.get_metadata(m.filename)):
@@ -31,8 +30,8 @@ class MetadataEnricher(Enricher):
field, value = line.strip().split(":", 1) field, value = line.strip().split(":", 1)
metadata[field.strip()] = value.strip() metadata[field.strip()] = value.strip()
return metadata return metadata
except FileNotFoundError: except FileNotFoundError as e:
logger.error("[exif_enricher] ExifTool not found. Make sure ExifTool is installed and added to PATH.") logger.error(f"ExifTool not found. Make sure ExifTool is installed and added to PATH. {e}")
except Exception as e: except Exception as e:
logger.error(f"Error occurred: {e}: {traceback.format_exc()}") logger.error(f"Error occurred: {e}: {traceback.format_exc()}")
return {} return {}

View File

@@ -1,6 +1,7 @@
import os import os
import traceback
from loguru import logger from auto_archiver.utils.custom_logger import logger
import opentimestamps import opentimestamps
from opentimestamps.calendar import RemoteCalendar, DEFAULT_CALENDAR_WHITELIST from opentimestamps.calendar import RemoteCalendar, DEFAULT_CALENDAR_WHITELIST
from opentimestamps.core.timestamp import Timestamp, DetachedTimestampFile from opentimestamps.core.timestamp import Timestamp, DetachedTimestampFile
@@ -14,13 +15,12 @@ from auto_archiver.utils.misc import get_current_timestamp
class OpentimestampsEnricher(Enricher): class OpentimestampsEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("OpenTimestamps timestamping files")
logger.debug(f"OpenTimestamps timestamping files for {url=}")
# Get the media files to timestamp # Get the media files to timestamp
media_files = [m for m in to_enrich.media if m.filename and not m.get("opentimestamps")] media_files = [m for m in to_enrich.media if m.filename and not m.get("opentimestamps")]
if not media_files: if not media_files:
logger.debug(f"No files found to timestamp in {url=}") logger.debug("No files found to timestamp")
return return
timestamp_files = [] timestamp_files = []
@@ -94,7 +94,7 @@ class OpentimestampsEnricher(Enricher):
detached_timestamp.serialize(ctx) detached_timestamp.serialize(ctx)
f.write(ctx.getbytes()) f.write(ctx.getbytes())
except Exception as e: except Exception as e:
logger.warning(f"Failed to serialize timestamp file: {e}") logger.warning(f"Failed to serialize timestamp file: {e} {traceback.format_exc()}")
continue continue
# Create media for the timestamp file # Create media for the timestamp file
@@ -113,16 +113,16 @@ class OpentimestampsEnricher(Enricher):
media.set("opentimestamps", True) media.set("opentimestamps", True)
except Exception as e: except Exception as e:
logger.warning(f"Error while timestamping {media.filename}: {e}") logger.warning(f"Error while timestamping {media.filename}: {e} {traceback.format_exc()}")
# Add timestamp files to the metadata # Add timestamp files to the metadata
if timestamp_files: if timestamp_files:
to_enrich.set("opentimestamped", True) to_enrich.set("opentimestamped", True)
to_enrich.set("opentimestamps_count", len(timestamp_files)) to_enrich.set("opentimestamps_count", len(timestamp_files))
logger.info(f"{len(timestamp_files)} OpenTimestamps proofs created for {url=}") logger.info(f"{len(timestamp_files)} OpenTimestamps proofs created")
else: else:
to_enrich.set("opentimestamped", False) to_enrich.set("opentimestamped", False)
logger.warning(f"No successful timestamps created for {url=}") logger.warning("No successful timestamps created")
def verify_timestamp(self, detached_timestamp): def verify_timestamp(self, detached_timestamp):
""" """

View File

@@ -15,7 +15,7 @@ import traceback
import pdqhash import pdqhash
import numpy as np import numpy as np
from PIL import Image, UnidentifiedImageError from PIL import Image, UnidentifiedImageError
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata from auto_archiver.core import Metadata
@@ -28,8 +28,7 @@ class PdqHashEnricher(Enricher):
""" """
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug("Calculating perceptual hashes")
logger.debug(f"calculating perceptual hashes for {url=}")
media_with_hashes = [] media_with_hashes = []
for m in to_enrich.media: for m in to_enrich.media:
@@ -44,7 +43,7 @@ class PdqHashEnricher(Enricher):
media.set("pdq_hash", hd) media.set("pdq_hash", hd)
media_with_hashes.append(media.filename) media_with_hashes.append(media.filename)
logger.debug(f"calculated '{len(media_with_hashes)}' perceptual hashes for {url=}: {media_with_hashes}") logger.debug(f"Calculated '{len(media_with_hashes)}' perceptual hashes: {media_with_hashes}")
def calculate_pdq_hash(self, filename): def calculate_pdq_hash(self, filename):
# returns a hexadecimal string with the perceptual hash for the given filename # returns a hexadecimal string with the perceptual hash for the given filename

View File

@@ -2,7 +2,7 @@ from typing import IO
import boto3 import boto3
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Media from auto_archiver.core import Media
from auto_archiver.core import Storage from auto_archiver.core import Storage
@@ -56,7 +56,7 @@ class S3Storage(Storage):
if existing_key := self.file_in_folder(path): if existing_key := self.file_in_folder(path):
media._key = existing_key media._key = existing_key
media.set("previously archived", True) media.set("previously archived", True)
logger.debug(f"skipping upload of {media.filename} because it already exists in {media.key}") logger.debug(f"Skipping upload of {media.filename} because it already exists in {media.key}")
return False return False
_, ext = os.path.splitext(media.key) _, ext = os.path.splitext(media.key)

View File

@@ -2,7 +2,7 @@ import ssl
import os import os
from slugify import slugify from slugify import slugify
from urllib.parse import urlparse from urllib.parse import urlparse
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -19,10 +19,10 @@ class SSLEnricher(Enricher):
url = to_enrich.get_url() url = to_enrich.get_url()
parsed = urlparse(url) parsed = urlparse(url)
assert parsed.scheme in ["https"], f"Invalid URL scheme {url=}" assert parsed.scheme in ["https"], "Invalid URL scheme"
domain = parsed.netloc domain = parsed.netloc
logger.debug(f"fetching SSL certificate for {domain=} in {url=}") logger.debug(f"Fetching SSL certificate for {domain=}")
cert = ssl.get_server_certificate((domain, 443)) cert = ssl.get_server_certificate((domain, 443))
cert_fn = os.path.join(self.tmp_dir, f"{slugify(domain)}.pem") cert_fn = os.path.join(self.tmp_dir, f"{slugify(domain)}.pem")

View File

@@ -2,7 +2,7 @@ import requests
import re import re
import html import html
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -38,7 +38,7 @@ class TelegramExtractor(Extractor):
video = s.find("video") video = s.find("video")
if video is None: if video is None:
logger.warning("could not find video") logger.warning("Could not find video")
image_tags = s.find_all(class_="tgme_widget_message_photo_wrap") image_tags = s.find_all(class_="tgme_widget_message_photo_wrap")
image_urls = [] image_urls = []

View File

@@ -5,6 +5,7 @@ import time
from pathlib import Path from pathlib import Path
from datetime import date from datetime import date
from telethon import functions
from telethon.sync import TelegramClient from telethon.sync import TelegramClient
from telethon.errors import ChannelInvalidError from telethon.errors import ChannelInvalidError
from telethon.tl.functions.messages import ImportChatInviteRequest from telethon.tl.functions.messages import ImportChatInviteRequest
@@ -16,7 +17,7 @@ from telethon.errors.rpcerrorlist import (
) )
from tqdm import tqdm from tqdm import tqdm
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -24,7 +25,7 @@ from auto_archiver.utils import random_str
class TelethonExtractor(Extractor): class TelethonExtractor(Extractor):
valid_url = re.compile(r"https:\/\/t\.me(\/c){0,1}\/(.+)\/(\d+)") valid_url = re.compile(r"https:\/\/t\.me(\/c){0,1}\/(.+?)(\/s){0,1}\/(\d+)")
invite_pattern = re.compile(r"t.me(\/joinchat){0,1}\/\+?(.+)") invite_pattern = re.compile(r"t.me(\/joinchat){0,1}\/\+?(.+)")
def setup(self) -> None: def setup(self) -> None:
@@ -64,7 +65,7 @@ class TelethonExtractor(Extractor):
# get currently joined channels # get currently joined channels
# https://docs.telethon.dev/en/stable/modules/custom.html#module-telethon.tl.custom.dialog # https://docs.telethon.dev/en/stable/modules/custom.html#module-telethon.tl.custom.dialog
joined_channel_ids = [c.id for c in self.client.get_dialogs() if c.is_channel] joined_channel_ids = [c.id for c in self.client.get_dialogs() if c.is_channel]
logger.info(f"already part of {len(joined_channel_ids)} channels") logger.info(f"Already part of {len(joined_channel_ids)} channels")
i = 0 i = 0
pbar = tqdm(desc=f"joining {len(self.channel_invites)} invite links", total=len(self.channel_invites)) pbar = tqdm(desc=f"joining {len(self.channel_invites)} invite links", total=len(self.channel_invites))
@@ -79,22 +80,22 @@ class TelethonExtractor(Extractor):
else: else:
ent = self.client.get_entity(invite) # fails if not a member ent = self.client.get_entity(invite) # fails if not a member
logger.warning( logger.warning(
f"please add the property id='{ent.id}' to the 'channel_invites' configuration where {invite=}, not doing so can lead to a minutes-long setup time due to telegram's rate limiting." f"Please add the property id='{ent.id}' to the 'channel_invites' configuration where {invite=}, not doing so can lead to a minutes-long setup time due to telegram's rate limiting."
) )
except ValueError: except ValueError:
logger.info(f"joining new channel {invite=}") logger.info(f"Joining new channel {invite=}")
try: try:
self.client(ImportChatInviteRequest(match.group(2))) self.client(ImportChatInviteRequest(match.group(2)))
except UserAlreadyParticipantError: except UserAlreadyParticipantError:
logger.info(f"already joined {invite=}") logger.info(f"Already joined {invite=}")
except InviteRequestSentError: except InviteRequestSentError:
logger.warning(f"already sent a join request with {invite} still no answer") logger.warning(f"Already sent a join request with {invite} still no answer")
except InviteHashExpiredError: except InviteHashExpiredError:
logger.warning(f"{invite=} has expired please find a more recent one") logger.warning(f"{invite=} has expired please find a more recent one")
except Exception as e: except Exception as e:
logger.error(f"could not join channel with {invite=} due to {e}") logger.error(f"Could not join channel with {invite=} due to {e}")
except FloodWaitError as e: except FloodWaitError as e:
logger.warning(f"got a flood error, need to wait {e.seconds} seconds") logger.warning(f"Got a flood error, need to wait {e.seconds} seconds")
time.sleep(e.seconds) time.sleep(e.seconds)
continue continue
else: else:
@@ -116,68 +117,91 @@ class TelethonExtractor(Extractor):
url = item.get_url() url = item.get_url()
# detect URLs that we definitely cannot handle # detect URLs that we definitely cannot handle
match = self.valid_url.search(url) match = self.valid_url.search(url)
logger.debug(f"TELETHON: {match=}") logger.debug(f"Found telethon url {match=}")
if not match: if not match:
return False return False
is_private = match.group(1) == "/c" is_private = match.group(1) == "/c"
chat = int(match.group(2)) if is_private else match.group(2) chat = int(match.group(2)) if is_private else match.group(2)
post_id = int(match.group(3)) is_story = match.group(3) == "/s"
post_id = int(match.group(4))
result = Metadata() result = Metadata()
# NB: not using bot_token since then private channels cannot be archived: self.client.start(bot_token=self.bot_token) # NB: not using bot_token since then private channels cannot be archived: self.client.start(bot_token=self.bot_token)
with self.client.start(): with self.client.start():
# with self.client.start(bot_token=self.bot_token): # with self.client.start(bot_token=self.bot_token):
try: if is_story:
post = self.client.get_messages(chat, ids=post_id) try:
except ValueError as e: stories = self.client(functions.stories.GetStoriesByIDRequest(peer=chat, id=[post_id]))
logger.error(f"Could not fetch telegram {url} possibly it's private: {e}") if not stories.stories:
return False logger.info("No stories found, possibly it's private or the story has expired.")
except ChannelInvalidError as e: return False
logger.error( story = stories.stories[0]
f"Could not fetch telegram {url}. This error may be fixed if you setup a bot_token in addition to api_id and api_hash (but then private channels will not be archived, we need to update this logic to handle both): {e}" logger.debug(f"Got story {story.id=} {story.date=} {story.expire_date=}")
) result.set_timestamp(story.date).set("views", story.views.to_dict()).set(
return False "expire_date", story.expire_date
)
logger.debug(f"TELETHON GOT POST {post=}") # download the story media
if post is None: filename_dest = os.path.join(self.tmp_dir, f"{chat}_{post_id}", str(story.id))
return False if filename := self.client.download_media(story.media, filename_dest):
result.add_media(Media(filename))
except Exception as e:
logger.error(f"Error fetching story {post_id} from {chat}: {e}")
return False
else:
try:
post = self.client.get_messages(chat, ids=post_id)
except ValueError as e:
logger.error(f"Could not fetch telegram URL possibly it's private: {e}")
return False
except ChannelInvalidError as e:
logger.error(
f"Could not fetch telegram URL. This error may be fixed if you setup a bot_token in addition to api_id and api_hash (but then private channels will not be archived, we need to update this logic to handle both): {e}"
)
return False
media_posts = self._get_media_posts_in_group(chat, post) logger.debug(f"Got post {post=}")
logger.debug(f"got {len(media_posts)=} for {url=}") if post is None:
return False
tmp_dir = self.tmp_dir media_posts = self._get_media_posts_in_group(chat, post)
logger.debug(f"Got {len(media_posts)=}")
group_id = post.grouped_id if post.grouped_id is not None else post.id group_id = post.grouped_id if post.grouped_id is not None else post.id
title = post.message title = post.message
for mp in media_posts: for mp in media_posts:
if len(mp.message) > len(title): if len(mp.message) > len(title):
title = mp.message # save the longest text found (usually only 1) title = mp.message # save the longest text found (usually only 1)
# media can also be in entities # media can also be in entities
if mp.entities: if mp.entities:
other_media_urls = [ other_media_urls = [
e.url e.url
for e in mp.entities for e in mp.entities
if hasattr(e, "url") and e.url and self._guess_file_type(e.url) in ["video", "image", "audio"] if hasattr(e, "url")
] and e.url
if len(other_media_urls): and self._guess_file_type(e.url) in ["video", "image", "audio"]
logger.debug(f"Got {len(other_media_urls)} other media urls from {mp.id=}: {other_media_urls}") ]
for i, om_url in enumerate(other_media_urls): if len(other_media_urls):
filename = self.download_from_url(om_url, f"{chat}_{group_id}_{i}") logger.debug(
result.add_media(Media(filename=filename), id=f"{group_id}_{i}") f"Got {len(other_media_urls)} other media urls from {mp.id=}: {other_media_urls}"
)
for i, om_url in enumerate(other_media_urls):
filename = self.download_from_url(om_url, f"{chat}_{group_id}_{i}")
result.add_media(Media(filename=filename), id=f"{group_id}_{i}")
filename_dest = os.path.join(tmp_dir, f"{chat}_{group_id}", str(mp.id)) filename_dest = os.path.join(self.tmp_dir, f"{chat}_{group_id}", str(mp.id))
filename = self.client.download_media(mp.media, filename_dest) filename = self.client.download_media(mp.media, filename_dest)
if not filename: if not filename:
logger.debug(f"Empty media found, skipping {str(mp)=}") logger.debug(f"Empty media found, skipping {str(mp)=}")
continue continue
result.add_media(Media(filename)) result.add_media(Media(filename))
result.set_title(title).set_timestamp(post.date).set("api_data", post.to_dict()) result.set_title(title).set_timestamp(post.date).set("api_data", post.to_dict())
if post.message != title: if post.message != title:
result.set_content(post.message) result.set_content(post.message)
return result.success("telethon") return result.success("telethon")
def _get_media_posts_in_group(self, chat, original_post, max_amp=10): def _get_media_posts_in_group(self, chat, original_post, max_amp=10):

View File

@@ -9,7 +9,7 @@ and identify important moments without watching the entire video.
import ffmpeg import ffmpeg
import os import os
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Media, Metadata from auto_archiver.core import Media, Metadata
@@ -27,12 +27,12 @@ class ThumbnailEnricher(Enricher):
Calculates how many thumbnails to generate and at which timestamps based on the video duration, the number of thumbnails per minute and the max number of thumbnails. Calculates how many thumbnails to generate and at which timestamps based on the video duration, the number of thumbnails per minute and the max number of thumbnails.
Thumbnails are equally distributed across the video duration. Thumbnails are equally distributed across the video duration.
""" """
logger.debug(f"generating thumbnails for {to_enrich.get_url()}") logger.debug("Generating thumbnails")
for m_id, m in enumerate(to_enrich.media[::]): for m_id, m in enumerate(to_enrich.media[::]):
if m.is_video(): if m.is_video():
folder = os.path.join(self.tmp_dir, random_str(24)) folder = os.path.join(self.tmp_dir, random_str(24))
os.makedirs(folder, exist_ok=True) os.makedirs(folder, exist_ok=True)
logger.debug(f"generating thumbnails for {m.filename}") logger.debug(f"Generating thumbnails for {m.filename}")
duration = m.get("duration") duration = m.get("duration")
try: try:
@@ -42,10 +42,10 @@ class ThumbnailEnricher(Enricher):
) )
to_enrich.media[m_id].set("duration", duration) to_enrich.media[m_id].set("duration", duration)
except Exception as e: except Exception as e:
logger.warning(f"failed to get duration with FFMPEG from {m.filename}: {e}") logger.warning(f"Failed to get duration with FFMPEG from {m.filename}: {e}")
if not duration or type(duration) not in [float, int] or duration <= 0: if not duration or type(duration) not in [float, int] or duration <= 0:
logger.warning(f"cannot generate thumbnails for {m.filename} without valid duration") logger.warning(f"Cannot generate thumbnails for {m.filename} without valid duration")
continue continue
num_thumbs = int(min(max(1, (duration / 60) * self.thumbnails_per_minute), self.max_thumbnails)) num_thumbs = int(min(max(1, (duration / 60) * self.thumbnails_per_minute), self.max_thumbnails))

View File

@@ -5,7 +5,7 @@ import hashlib
from slugify import slugify from slugify import slugify
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from rfc3161_client import (decode_timestamp_response,TimestampRequestBuilder,TimeStampResponse, VerifierBuilder) from rfc3161_client import (decode_timestamp_response,TimestampRequestBuilder,TimeStampResponse, VerifierBuilder)
from rfc3161_client import VerificationError as Rfc3161VerificationError from rfc3161_client import VerificationError as Rfc3161VerificationError
@@ -49,8 +49,7 @@ class TimestampingEnricher(Enricher):
self.session.close() self.session.close()
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() logger.debug(f"RFC3161 timestamping existing files")
logger.debug(f"RFC3161 timestamping existing files for {url=}")
# create a new text file with the existing media hashes # create a new text file with the existing media hashes
hashes = [ hashes = [
@@ -58,7 +57,7 @@ class TimestampingEnricher(Enricher):
] ]
if not len(hashes): if not len(hashes):
logger.debug(f"No hashes found in {url=}") logger.debug(f"No hashes found")
return return
@@ -74,7 +73,7 @@ class TimestampingEnricher(Enricher):
try: try:
message = bytes(data_to_sign, encoding='utf8') message = bytes(data_to_sign, encoding='utf8')
logger.debug(f"Timestamping {url=} with {tsa_url=}") logger.debug(f"Timestamping with {tsa_url=}")
signed: TimeStampResponse = self.sign_data(tsa_url, message) signed: TimeStampResponse = self.sign_data(tsa_url, message)
# fail if there's any issue with the certificates, uses certifi list of trusted CAs or the user-defined `cert_authorities` # fail if there's any issue with the certificates, uses certifi list of trusted CAs or the user-defined `cert_authorities`
@@ -92,7 +91,7 @@ class TimestampingEnricher(Enricher):
timestamp_token_path = self.save_timestamp_token(signed.time_stamp_token(), tsa_url) timestamp_token_path = self.save_timestamp_token(signed.time_stamp_token(), tsa_url)
timestamp_tokens.append(Media(filename=timestamp_token_path).set("tsa", tsa_url).set("cert_chain", cert_chain)) timestamp_tokens.append(Media(filename=timestamp_token_path).set("tsa", tsa_url).set("cert_chain", cert_chain))
except Exception as e: except Exception as e:
logger.warning(f"Error while timestamping {url=} with {tsa_url=}: {e}") logger.warning(f"Error while timestamping with {tsa_url=}: {e}")
if len(timestamp_tokens): if len(timestamp_tokens):
hashes_media.set("timestamp_authority_files", timestamp_tokens) hashes_media.set("timestamp_authority_files", timestamp_tokens)
@@ -101,9 +100,9 @@ class TimestampingEnricher(Enricher):
hashes_media.set("cryptography v", version("cryptography")) hashes_media.set("cryptography v", version("cryptography"))
to_enrich.add_media(hashes_media, id="timestamped_hashes") to_enrich.add_media(hashes_media, id="timestamped_hashes")
to_enrich.set("timestamped", True) to_enrich.set("timestamped", True)
logger.info(f"{len(timestamp_tokens)} timestamp tokens created for {url=}") logger.info(f"{len(timestamp_tokens)} timestamp tokens created")
else: else:
logger.warning(f"No successful timestamps for {url=}") logger.warning(f"No successful timestamps found")
def save_timestamp_token(self, timestamp_token: bytes, tsa_url: str) -> str: def save_timestamp_token(self, timestamp_token: bytes, tsa_url: str) -> str:
""" """

View File

@@ -4,7 +4,7 @@ import re
import mimetypes import mimetypes
import requests import requests
from loguru import logger from auto_archiver.utils.custom_logger import logger
from pytwitter import Api from pytwitter import Api
from slugify import slugify from slugify import slugify
@@ -45,10 +45,9 @@ class TwitterApiExtractor(Extractor):
if "https://t.co/" in url: if "https://t.co/" in url:
try: try:
r = requests.get(url, timeout=30) r = requests.get(url, timeout=30)
logger.debug(f"Expanded url {url} to {r.url}")
url = r.url url = r.url
except Exception: except Exception as e:
logger.error(f"Failed to expand url {url}") logger.error(f"Failed to expand Twitter URL: {e}")
return url return url
def download(self, item: Metadata) -> Metadata: def download(self, item: Metadata) -> Metadata:
@@ -67,7 +66,7 @@ class TwitterApiExtractor(Extractor):
return False, False return False, False
username, tweet_id = matches[0] # only one URL supported username, tweet_id = matches[0] # only one URL supported
logger.debug(f"Found {username=} and {tweet_id=} in {url=}") logger.debug(f"Found {username=} and {tweet_id=}")
return username, tweet_id return username, tweet_id
@@ -85,7 +84,7 @@ class TwitterApiExtractor(Extractor):
media_fields=["type", "duration_ms", "url", "variants"], media_fields=["type", "duration_ms", "url", "variants"],
tweet_fields=["attachments", "author_id", "created_at", "entities", "id", "text", "possibly_sensitive"], tweet_fields=["attachments", "author_id", "created_at", "entities", "id", "text", "possibly_sensitive"],
) )
logger.debug(tweet) logger.debug(f"Got {tweet=}")
except Exception as e: except Exception as e:
logger.error(f"Could not get tweet: {e}") logger.error(f"Could not get tweet: {e}")
return False return False

View File

@@ -4,7 +4,7 @@ import os
import shutil import shutil
import subprocess import subprocess
from zipfile import ZipFile from zipfile import ZipFile
from loguru import logger from auto_archiver.utils.custom_logger import logger
from warcio.archiveiterator import ArchiveIterator from warcio.archiveiterator import ArchiveIterator
from auto_archiver.core import Media, Metadata from auto_archiver.core import Media, Metadata
@@ -94,7 +94,7 @@ class WaczExtractorEnricher(Enricher, Extractor):
# call docker if explicitly enabled or we are running on the host (not in docker) # call docker if explicitly enabled or we are running on the host (not in docker)
if self.use_docker: if self.use_docker:
logger.debug(f"generating WACZ in Docker for {url=}") logger.debug("Generating WACZ in Docker")
logger.debug(f"{browsertrix_home_host=} {browsertrix_home_container=}") logger.debug(f"{browsertrix_home_host=} {browsertrix_home_container=}")
if self.docker_commands: if self.docker_commands:
cmd = self.docker_commands + cmd cmd = self.docker_commands + cmd
@@ -111,12 +111,12 @@ class WaczExtractorEnricher(Enricher, Extractor):
if self.profile: if self.profile:
profile_file = f"profile-{self.crawl_id}.tar.gz" profile_file = f"profile-{self.crawl_id}.tar.gz"
profile_fn = os.path.join(browsertrix_home_container, profile_file) profile_fn = os.path.join(browsertrix_home_container, profile_file)
logger.debug(f"copying {self.profile} to {profile_fn}") logger.debug(f"Copying {self.profile} to {profile_fn}")
shutil.copyfile(self.profile, profile_fn) shutil.copyfile(self.profile, profile_fn)
cmd.extend(["--profile", os.path.join("/crawls", profile_file)]) cmd.extend(["--profile", os.path.join("/crawls", profile_file)])
else: else:
logger.debug(f"generating WACZ without Docker for {url=}") logger.debug("Generating WACZ without Docker")
if self.profile: if self.profile:
cmd.extend(["--profile", os.path.join("/app", str(self.profile))]) cmd.extend(["--profile", os.path.join("/app", str(self.profile))])

View File

@@ -1,5 +1,5 @@
import json import json
from loguru import logger from auto_archiver.utils.custom_logger import logger
import time import time
import requests import requests
@@ -31,15 +31,15 @@ class WaybackExtractorEnricher(Enricher, Extractor):
url = to_enrich.get_url() url = to_enrich.get_url()
if UrlUtil.is_auth_wall(url): if UrlUtil.is_auth_wall(url):
logger.debug(f"[SKIP] WAYBACK since url is behind AUTH WALL: {url=}") logger.debug("[SKIP] WAYBACK since url is behind AUTH WALL")
return return
logger.debug(f"calling wayback for {url=}")
if to_enrich.get("wayback"): if to_enrich.get("wayback"):
logger.info(f"Wayback enricher had already been executed: {to_enrich.get('wayback')}") logger.info(f"Wayback enricher had already been executed: {to_enrich.get('wayback')}")
return True return True
logger.debug("Calling Wayback")
ia_headers = {"Accept": "application/json", "Authorization": f"LOW {self.key}:{self.secret}"} ia_headers = {"Accept": "application/json", "Authorization": f"LOW {self.key}:{self.secret}"}
post_data = {"url": url} post_data = {"url": url}
if self.if_not_archived_within: if self.if_not_archived_within:
@@ -68,7 +68,7 @@ class WaybackExtractorEnricher(Enricher, Extractor):
attempt = 1 attempt = 1
while not wayback_url and time.time() - start_time <= self.timeout: while not wayback_url and time.time() - start_time <= self.timeout:
try: try:
logger.debug(f"GETting status for {job_id=} on {url=} ({attempt=})") logger.debug(f"GETting status for {job_id=} ({attempt=})")
r_status = requests.get( r_status = requests.get(
f"https://web.archive.org/save/status/{job_id}", headers=ia_headers, proxies=proxies f"https://web.archive.org/save/status/{job_id}", headers=ia_headers, proxies=proxies
) )
@@ -79,13 +79,13 @@ class WaybackExtractorEnricher(Enricher, Extractor):
logger.error(f"Wayback failed with {r_json}") logger.error(f"Wayback failed with {r_json}")
return False return False
except requests.exceptions.RequestException as e: except requests.exceptions.RequestException as e:
logger.warning(f"RequestException: fetching status for {url=} due to: {e}") logger.warning(f"RequestException: fetching status due to: {e}")
break break
except json.decoder.JSONDecodeError: except json.decoder.JSONDecodeError:
logger.error(f"Expected a JSON from Wayback and got {r.text} for {url=}") logger.error(f"Expected a JSON from Wayback and got {r.text}")
break break
except Exception as e: except Exception as e:
logger.warning(f"error fetching status for {url=} due to: {e}") logger.warning(f"error fetching status due to: {e}")
if not wayback_url: if not wayback_url:
attempt += 1 attempt += 1
time.sleep(1) # TODO: can be improved with exponential backoff time.sleep(1) # TODO: can be improved with exponential backoff

View File

@@ -1,7 +1,7 @@
import traceback import traceback
import requests import requests
import time import time
from loguru import logger from auto_archiver.utils.custom_logger import logger
from auto_archiver.core import Enricher from auto_archiver.core import Enricher
from auto_archiver.core import Metadata, Media from auto_archiver.core import Metadata, Media
@@ -25,7 +25,7 @@ class WhisperEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() url = to_enrich.get_url()
logger.debug(f"WHISPER[{self.action}]: iterating media items for {url=}.") logger.debug(f"WHISPER[{self.action}]: iterating media items")
job_results = {} job_results = {}
for i, m in enumerate(to_enrich.media): for i, m in enumerate(to_enrich.media):
@@ -35,7 +35,7 @@ class WhisperEnricher(Enricher):
try: try:
job_id = self.submit_job(m) job_id = self.submit_job(m)
job_results[job_id] = False job_results[job_id] = False
logger.debug(f"JOB SUBMITTED: {job_id=} for {m.key=}") logger.debug(f"Job submitted: {job_id=} for {m.key=}")
to_enrich.media[i].set("whisper_model", {"job_id": job_id}) to_enrich.media[i].set("whisper_model", {"job_id": job_id})
except Exception as e: except Exception as e:
logger.error( logger.error(
@@ -72,14 +72,14 @@ class WhisperEnricher(Enricher):
"type": self.action, "type": self.action,
# "language": "string" # may be a config # "language": "string" # may be a config
} }
logger.debug(f"calling API with {payload=}") logger.debug(f"Calling API with {payload=}")
response = requests.post( response = requests.post(
f"{self.api_endpoint}/jobs", json=payload, headers={"Authorization": f"Bearer {self.api_key}"} f"{self.api_endpoint}/jobs", json=payload, headers={"Authorization": f"Bearer {self.api_key}"}
) )
assert response.status_code == 201, ( assert response.status_code == 201, (
f"calling the whisper api {self.api_endpoint} returned a non-success code: {response.status_code}" f"calling the whisper api {self.api_endpoint} returned a non-success code: {response.status_code}"
) )
logger.debug(response.json()) logger.debug(f"Response from whisper API: {response.json()}")
return response.json()["id"] return response.json()["id"]
def check_jobs(self, job_results: dict): def check_jobs(self, job_results: dict):
@@ -115,7 +115,7 @@ class WhisperEnricher(Enricher):
assert r_res.status_code == 200, ( assert r_res.status_code == 200, (
f"Job artifacts did not respond with 200, instead with: {r_res.status_code}" f"Job artifacts did not respond with 200, instead with: {r_res.status_code}"
) )
logger.success(r_res.json()) logger.info(f"Job {job_id} completed successfully:{r_res.json()}")
result = {} result = {}
for art_id, artifact in enumerate(r_res.json()): for art_id, artifact in enumerate(r_res.json()):
subtitle = [] subtitle = []

View File

@@ -0,0 +1,59 @@
from loguru import logger
import json
def extract_location(record, short=False):
"""Extracts the file name, function name, and line number from the log record."""
if short:
return f"{record['file'].name}:{record['line']}"
return f"{record['file'].name}:{record['function']}:{record['line']}"
def extract_log_data(record):
subset = {
"level": record["level"].name,
"time": record["time"].isoformat(timespec="seconds"),
}
subset["loc"] = extract_location(record)
# This is where logger.contextualize() parameters can be added to the output
for extra_key in ["trace", "url", "worksheet", "row"]:
if extra_val := record.get("extra", {}).get(extra_key):
subset[extra_key] = extra_val
subset["message"] = record["message"]
if exception := record.get("exception"):
subset["exception"] = exception
return subset
def serialize_for_console(record):
subset = extract_log_data(record)
subset.pop("message", None)
subset.pop("level", None)
subset.pop("loc", None)
subset.pop("time", None)
if not subset:
return ""
return json.dumps(subset, ensure_ascii=False)
def serialize(record):
return json.dumps(extract_log_data(record), ensure_ascii=False)
def patching(record):
record["extra"]["serialized"] = serialize(record)
record["extra"]["serialize_for_console"] = serialize_for_console(record)
def format_for_human_readable_console():
return (
"<green>{time:YYYY-MM-DD HH:mm:ss.SSS}</green> | "
"<level>{level: <8}</level> | "
"<cyan>{file}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> | "
"{extra[serialize_for_console]} <level>{message}</level>"
)
logger = logger.patch(patching)

View File

@@ -6,8 +6,7 @@ import uuid
from datetime import datetime, timezone from datetime import datetime, timezone
from dateutil.parser import parse as parse_dt from dateutil.parser import parse as parse_dt
import requests from auto_archiver.utils.custom_logger import logger
from loguru import logger
def mkdir_if_not_exists(folder): def mkdir_if_not_exists(folder):
@@ -15,18 +14,6 @@ def mkdir_if_not_exists(folder):
os.makedirs(folder) os.makedirs(folder)
def expand_url(url):
# expand short URL links
if "https://t.co/" in url:
try:
r = requests.get(url)
logger.debug(f"Expanded url {url} to {r.url}")
return r.url
except Exception:
logger.error(f"Failed to expand url {url}")
return url
def getattr_or(o: object, prop: str, default=None): def getattr_or(o: object, prop: str, default=None):
try: try:
res = getattr(o, prop) res = getattr(o, prop)

View File

@@ -9,7 +9,7 @@ from tempfile import TemporaryDirectory
from typing import Dict, Tuple from typing import Dict, Tuple
import hashlib import hashlib
from loguru import logger from auto_archiver.utils.custom_logger import logger
import pytest import pytest
from auto_archiver.core.metadata import Metadata, Media from auto_archiver.core.metadata import Metadata, Media
from auto_archiver.core.module import ModuleFactory from auto_archiver.core.module import ModuleFactory

View File

@@ -1,6 +1,6 @@
from auto_archiver.core import Extractor from auto_archiver.core import Extractor
from loguru import logger from auto_archiver.utils.custom_logger import logger
class ExampleExtractor(Extractor): class ExampleExtractor(Extractor):

View File

@@ -1,6 +1,6 @@
from auto_archiver.core import Extractor, Enricher, Feeder, Database, Storage, Formatter, Metadata from auto_archiver.core import Extractor, Enricher, Feeder, Database, Storage, Formatter, Metadata
from loguru import logger from auto_archiver.utils.custom_logger import logger
class ExampleModule(Extractor, Enricher, Feeder, Database, Storage, Formatter): class ExampleModule(Extractor, Enricher, Feeder, Database, Storage, Formatter):

View File

@@ -29,7 +29,7 @@ def test_fetch_fail_status(api_db, metadata, mocker):
mock_get = mocker.patch("auto_archiver.modules.api_db.api_db.requests.get") mock_get = mocker.patch("auto_archiver.modules.api_db.api_db.requests.get")
mock_get.return_value.status_code = 400 mock_get.return_value.status_code = 400
mock_get.return_value.json.return_value = {} mock_get.return_value.json.return_value = {}
mock_error = mocker.patch("loguru.logger.error") mock_error = mocker.patch("auto_archiver.utils.custom_logger.logger.error")
assert api_db.fetch(metadata) is False assert api_db.fetch(metadata) is False
mock_error.assert_called_once_with("AA API FAIL (400): {}") mock_error.assert_called_once_with("AA API FAIL (400): {}")

View File

@@ -10,7 +10,10 @@ def mock_gworksheet(mocker):
mock_gworksheet = mocker.MagicMock(spec=GWorksheet) mock_gworksheet = mocker.MagicMock(spec=GWorksheet)
mock_gworksheet.col_exists.return_value = True mock_gworksheet.col_exists.return_value = True
mock_gworksheet.get_cell.return_value = "" mock_gworksheet.get_cell.return_value = ""
mock_gworksheet.get_row.return_value = {} mock_gworksheet.wks = mocker.MagicMock()
mock_gworksheet.wks.spreadsheet = mocker.MagicMock()
mock_gworksheet.wks.spreadsheet.title = "Test Spreadsheet"
mock_gworksheet.title = "Test Worksheet"
return mock_gworksheet return mock_gworksheet

View File

@@ -33,7 +33,6 @@ def test_enrich_skips_empty_metadata(meta_enricher, mock_metadata):
"""Test that enrich() does nothing when Metadata is empty.""" """Test that enrich() does nothing when Metadata is empty."""
mock_metadata.is_empty.return_value = True mock_metadata.is_empty.return_value = True
meta_enricher.enrich(mock_metadata) meta_enricher.enrich(mock_metadata)
mock_metadata.get_url.assert_called_once()
def test_enrich_file_sizes(meta_enricher, metadata, tmp_path): def test_enrich_file_sizes(meta_enricher, metadata, tmp_path):

View File

@@ -65,7 +65,7 @@ def test_enrich_empty_media(enricher, mocker):
def test_get_metadata_error_handling(enricher, mocker): def test_get_metadata_error_handling(enricher, mocker):
mocker.patch("subprocess.run", side_effect=Exception("Test error")) mocker.patch("subprocess.run", side_effect=Exception("Test error"))
mock_log = mocker.patch("loguru.logger.error") mock_log = mocker.patch("auto_archiver.utils.custom_logger.logger.error")
result = enricher.get_metadata("test.jpg") result = enricher.get_metadata("test.jpg")
assert result == {} assert result == {}
assert "Error occurred: " in mock_log.call_args[0][0] assert "Error occurred: " in mock_log.call_args[0][0]

View File

@@ -43,7 +43,7 @@ def test_enrich_skip_non_image(metadata_with_images, mocker):
def test_enrich_handles_corrupted_image(metadata_with_images, mocker): def test_enrich_handles_corrupted_image(metadata_with_images, mocker):
mocker.patch("PIL.Image.open", side_effect=UnidentifiedImageError("Corrupted image")) mocker.patch("PIL.Image.open", side_effect=UnidentifiedImageError("Corrupted image"))
mock_pdq = mocker.patch("pdqhash.compute") mock_pdq = mocker.patch("pdqhash.compute")
mock_logger = mocker.patch("loguru.logger.error") mock_logger = mocker.patch("auto_archiver.utils.custom_logger.logger.error")
enricher = PdqHashEnricher() enricher = PdqHashEnricher()
enricher.enrich(metadata_with_images) enricher.enrich(metadata_with_images)

View File

@@ -75,12 +75,12 @@ def test_enrich_thumbnail_limits(
def test_enrich_handles_probe_failure(thumbnail_enricher, metadata_with_video, mocker): def test_enrich_handles_probe_failure(thumbnail_enricher, metadata_with_video, mocker):
mocker.patch("ffmpeg.probe", side_effect=Exception("Probe error")) mocker.patch("ffmpeg.probe", side_effect=Exception("Probe error"))
mocker.patch("os.makedirs") mocker.patch("os.makedirs")
mock_logger = mocker.patch("loguru.logger.warning") mock_logger = mocker.patch("auto_archiver.utils.custom_logger.logger.warning")
mocker.patch.object(Media, "is_video", return_value=True) mocker.patch.object(Media, "is_video", return_value=True)
thumbnail_enricher.enrich(metadata_with_video) thumbnail_enricher.enrich(metadata_with_video)
# Ensure error was logged # Ensure error was logged
mock_logger.assert_called_with("cannot generate thumbnails for video.mp4 without valid duration") mock_logger.assert_called_with("Cannot generate thumbnails for video.mp4 without valid duration")
# Ensure no thumbnails were created # Ensure no thumbnails were created
thumbnails = metadata_with_video.media[0].get("thumbnails") thumbnails = metadata_with_video.media[0].get("thumbnails")
assert thumbnails is None assert thumbnails is None
@@ -128,12 +128,12 @@ def test_enrich_handles_short_video(
def test_uses_existing_duration_on_exception(thumbnail_enricher, metadata_with_video, mock_ffmpeg_environment, mocker): def test_uses_existing_duration_on_exception(thumbnail_enricher, metadata_with_video, mock_ffmpeg_environment, mocker):
mock_logger = mocker.patch("loguru.logger.warning") mock_logger = mocker.patch("auto_archiver.utils.custom_logger.logger.warning")
mock_probe = mocker.patch("ffmpeg.probe", side_effect=Exception("New probe error")) mock_probe = mocker.patch("ffmpeg.probe", side_effect=Exception("New probe error"))
metadata_with_video.media[0].set("duration", 3) metadata_with_video.media[0].set("duration", 3)
thumbnail_enricher.enrich(metadata_with_video) thumbnail_enricher.enrich(metadata_with_video)
mock_probe.assert_called_once() mock_probe.assert_called_once()
mock_logger.assert_called_with("failed to get duration with FFMPEG from video.mp4: New probe error") mock_logger.assert_called_with("Failed to get duration with FFMPEG from video.mp4: New probe error")
assert mock_ffmpeg_environment["mock_output"].run.call_count == 3 assert mock_ffmpeg_environment["mock_output"].run.call_count == 3

View File

@@ -46,7 +46,7 @@ def test_setup_with_docker(wacz_enricher, mocker):
def test_already_ran(wacz_enricher, metadata, mocker): def test_already_ran(wacz_enricher, metadata, mocker):
metadata.add_media(Media("test.wacz"), id="browsertrix") metadata.add_media(Media("test.wacz"), id="browsertrix")
mock_log = mocker.patch("loguru.logger.info") mock_log = mocker.patch("auto_archiver.utils.custom_logger.logger.info")
assert wacz_enricher.enrich(metadata) is True assert wacz_enricher.enrich(metadata) is True
assert "WACZ enricher had already been executed" in mock_log.call_args[0][0] assert "WACZ enricher had already been executed" in mock_log.call_args[0][0]
@@ -73,7 +73,7 @@ def test_download_success(wacz_enricher, mocker) -> None:
def test_enrich_already_executed(wacz_enricher, mocker) -> None: def test_enrich_already_executed(wacz_enricher, mocker) -> None:
"""Test enrich if already executed.""" """Test enrich if already executed."""
mock_log = mocker.patch("loguru.logger.info") mock_log = mocker.patch("auto_archiver.utils.custom_logger.logger.info")
metadata = Metadata().set_url("https://example.com") metadata = Metadata().set_url("https://example.com")
media = Media(filename="some_file.wacz") media = Media(filename="some_file.wacz")
metadata.add_media(media, id="browsertrix") metadata.add_media(media, id="browsertrix")

View File

@@ -1,4 +1,5 @@
from datetime import datetime from datetime import datetime
import math
import pytest import pytest
@@ -147,14 +148,14 @@ class TestInstagramAPIExtractor(TestExtractorBase):
self.extractor.full_profile = True self.extractor.full_profile = True
mock_call.side_effect = [mock_user_response, mock_story_response] mock_call.side_effect = [mock_user_response, mock_story_response]
mock_highlights.return_value = None mock_highlights.return_value = 1
mock_stories.return_value = mock_story_response mock_stories.return_value = mock_story_response
mock_posts.return_value = None mock_posts.return_value = 2
mock_tagged.return_value = None mock_tagged.return_value = 3
result = self.extractor.download_profile(metadata, "test_user") result = self.extractor.download_profile(metadata, "test_user")
assert result.get("#stories") == len(mock_story_response) assert result.get("#stories") == len(mock_story_response)
mock_posts.assert_called_once_with(result, "123") mock_posts.assert_called_once_with(result, "123", max_to_download=math.inf)
assert "errors" not in result.metadata assert "errors" not in result.metadata
def test_download_profile_not_found(self, metadata, mocker): def test_download_profile_not_found(self, metadata, mocker):
@@ -175,10 +176,10 @@ class TestInstagramAPIExtractor(TestExtractorBase):
self.extractor.full_profile = True self.extractor.full_profile = True
mock_call.side_effect = [mock_user_response, Exception("Stories API failed"), Exception("Posts API failed")] mock_call.side_effect = [mock_user_response, Exception("Stories API failed"), Exception("Posts API failed")]
mock_highlights.return_value = None mock_highlights.return_value = 1
mock_tagged.return_value = None mock_tagged.return_value = 2
stories_tagged.return_value = None stories_tagged.return_value = None
mock_posts.return_value = None mock_posts.return_value = 4
result = self.extractor.download_profile(metadata, "test_user") result = self.extractor.download_profile(metadata, "test_user")
assert result.is_success() assert result.is_success()

View File

@@ -3,6 +3,8 @@ from datetime import date
import pytest import pytest
from auto_archiver.modules.telethon_extractor.telethon_extractor import TelethonExtractor
@pytest.fixture(autouse=True) @pytest.fixture(autouse=True)
def mock_client_setup(mocker): def mock_client_setup(mocker):
@@ -24,3 +26,37 @@ def test_setup_fails_clear_session_file(get_lazy_module, tmp_path, mocker):
assert session_file.exists() assert session_file.exists()
assert f"telethon-{date.today().strftime('%Y-%m-%d')}" in lazy_module._instance.session_file assert f"telethon-{date.today().strftime('%Y-%m-%d')}" in lazy_module._instance.session_file
assert os.path.exists(lazy_module._instance.session_file + ".session") assert os.path.exists(lazy_module._instance.session_file + ".session")
@pytest.mark.parametrize(
"url,expected",
[
("https://t.me/channel/123", True),
("https://t.me/c/123/456", True),
("https://t.me/channel/s/789", True),
("https://t.me/c/123/s/456", True),
("https://t.me/with_single/1234567?single", True),
("https://t.me/invalid", False),
("https://example.com/nottelegram/123", False),
],
)
def test_valid_url_regex(url, expected, get_lazy_module):
match = TelethonExtractor.valid_url.search(url)
assert bool(match) == expected
@pytest.mark.parametrize(
"invite,expected",
[
("t.me/joinchat/AAAAAE", True),
("t.me/+AAAAAE", True),
("t.me/AAAAAE", True),
("https://t.me/joinchat/AAAAAE", True),
("https://t.me/+AAAAAE", True),
("https://t.me/AAAAAE", True),
("https://example.com/AAAAAE", False),
],
)
def test_invite_pattern_regex(invite, expected, get_lazy_module):
match = TelethonExtractor.invite_pattern.search(invite)
assert bool(match) == expected

View File

@@ -13,7 +13,7 @@ class TestTwitterApiExtractor(TestExtractorBase):
config = { config = {
"bearer_tokens": [], "bearer_tokens": [],
"bearer_token": os.environ.get("TWITTER_BEARER_TOKEN", "TEST_KEY"), "bearer_token": os.environ.get("TWITTER_BEARER_TOKEN") or "TEST_KEY",
"consumer_key": os.environ.get("TWITTER_CONSUMER_KEY"), "consumer_key": os.environ.get("TWITTER_CONSUMER_KEY"),
"consumer_secret": os.environ.get("TWITTER_CONSUMER_SECRET"), "consumer_secret": os.environ.get("TWITTER_CONSUMER_SECRET"),
"access_token": os.environ.get("TWITTER_ACCESS_TOKEN"), "access_token": os.environ.get("TWITTER_ACCESS_TOKEN"),

View File

@@ -25,7 +25,7 @@ def orchestration_file(orchestration_file_path):
def autoarchiver(tmp_path, monkeypatch, request): def autoarchiver(tmp_path, monkeypatch, request):
def _autoarchiver(args=[]): def _autoarchiver(args=[]):
def cleanup(): def cleanup():
from loguru import logger from auto_archiver.utils.custom_logger import logger
if not logger._core.handlers.get(0): if not logger._core.handlers.get(0):
logger._core.handlers_count = 0 logger._core.handlers_count = 0

View File

@@ -118,8 +118,7 @@ def test_check_required_values(orchestrator, caplog, test_args):
with pytest.raises(SystemExit): with pytest.raises(SystemExit):
orchestrator.setup_config(test_args) orchestrator.setup_config(test_args)
assert "the following arguments are required: --example_module.required_field" in caplog.records[0].message
assert caplog.records[1].message == "the following arguments are required: --example_module.required_field"
def test_get_required_values_from_config(orchestrator, test_args, tmp_path): def test_get_required_values_from_config(orchestrator, test_args, tmp_path):

View File

@@ -6,7 +6,6 @@ import pytest
from auto_archiver.utils.misc import ( from auto_archiver.utils.misc import (
mkdir_if_not_exists, mkdir_if_not_exists,
expand_url,
getattr_or, getattr_or,
DateTimeEncoder, DateTimeEncoder,
dump_payload, dump_payload,
@@ -39,26 +38,6 @@ class TestDirectoryUtils:
assert existing_dir.exists() assert existing_dir.exists()
class TestURLExpansion:
@pytest.mark.parametrize(
"input_url,expected",
[("https://example.com", "https://example.com"), ("https://t.co/test", "https://expanded.url")],
)
def test_expand_url(self, input_url, expected, mocker):
mock_response = mocker.Mock()
mock_response.url = "https://expanded.url"
mocker.patch("requests.get", return_value=mock_response)
result = expand_url(input_url)
assert result == expected
def test_expand_url_handles_errors(self, caplog, mocker):
mocker.patch("requests.get", side_effect=Exception("Connection error"))
url = "https://t.co/error"
result = expand_url(url)
assert result == url
assert f"Failed to expand url {url}" in caplog.text
class TestAttributeHandling: class TestAttributeHandling:
class Sample: class Sample:
exists = "value" exists = "value"