Compare commits
48 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a2c6cdc111 | ||
|
|
8bb7883eeb | ||
|
|
a0971fc601 | ||
|
|
0cba2c25c6 | ||
|
|
7c0b05b276 | ||
|
|
3bbfdf6eba | ||
|
|
a7a6bda1c2 | ||
|
|
d80145002d | ||
|
|
b4f86d0e8d | ||
|
|
6cf3e109ed | ||
|
|
d4f983e575 | ||
|
|
88b07d777b | ||
|
|
222e6ddb28 | ||
|
|
3e340b2580 | ||
|
|
9fc09c724b | ||
|
|
f6e5a14d75 | ||
|
|
0e9c765b96 | ||
|
|
87f553661b | ||
|
|
cc66ee3fd4 | ||
|
|
b3b727b005 | ||
|
|
ee37b20e6c | ||
|
|
a184bf7b97 | ||
|
|
e535f44a88 | ||
|
|
0f28bf0e35 | ||
|
|
18a8636552 | ||
|
|
81be65c828 | ||
|
|
0a91863212 | ||
|
|
3ad8349e3f | ||
|
|
2768225cd1 | ||
|
|
3e44b9b577 | ||
|
|
1a5797d0f8 | ||
|
|
768b8fce9f | ||
|
|
613b1f1e50 | ||
|
|
919c37bfb6 | ||
|
|
a655b3c987 | ||
|
|
d645b840ee | ||
|
|
3da9c9cf8f | ||
|
|
987bbcaad0 | ||
|
|
68e9d2a2ce | ||
|
|
76be271c18 | ||
|
|
074f132ad9 | ||
|
|
c47da0a46f | ||
|
|
eb82936a04 | ||
|
|
cc03ad7c49 | ||
|
|
6d2aa3dd7a | ||
|
|
f2e580de4e | ||
|
|
3f48d75d8f | ||
|
|
80ea912d0e |
2
.github/workflows/docker-publish.yaml
vendored
@@ -9,7 +9,7 @@ on:
|
|||||||
release:
|
release:
|
||||||
types: [published]
|
types: [published]
|
||||||
push:
|
push:
|
||||||
branches: [ "dockerize" ]
|
# branches: [ "main" ]
|
||||||
tags: [ "v*.*.*" ]
|
tags: [ "v*.*.*" ]
|
||||||
|
|
||||||
env:
|
env:
|
||||||
|
|||||||
2
.github/workflows/python-publish.yaml
vendored
@@ -12,7 +12,7 @@ on:
|
|||||||
release:
|
release:
|
||||||
types: [published]
|
types: [published]
|
||||||
push:
|
push:
|
||||||
branches: [ "dockerize" ]
|
# branches: [ "main" ]
|
||||||
tags: [ "v*.*.*" ]
|
tags: [ "v*.*.*" ]
|
||||||
|
|
||||||
permissions:
|
permissions:
|
||||||
|
|||||||
4
Pipfile
@@ -19,6 +19,8 @@ google-api-python-client = "*"
|
|||||||
google-auth-httplib2 = "*"
|
google-auth-httplib2 = "*"
|
||||||
google-auth-oauthlib = "*"
|
google-auth-oauthlib = "*"
|
||||||
oauth2client = "*"
|
oauth2client = "*"
|
||||||
|
pdqhash = "*"
|
||||||
|
pillow = "*"
|
||||||
python-slugify = "*"
|
python-slugify = "*"
|
||||||
pyyaml = "*"
|
pyyaml = "*"
|
||||||
dateparser = "*"
|
dateparser = "*"
|
||||||
@@ -33,7 +35,7 @@ vk-url-scraper = "*"
|
|||||||
uwsgi = "*"
|
uwsgi = "*"
|
||||||
requests = {extras = ["socks"], version = "*"}
|
requests = {extras = ["socks"], version = "*"}
|
||||||
# wacz = "==0.4.8"
|
# wacz = "==0.4.8"
|
||||||
pywb = ">=2.7.3"
|
numpy = "*"
|
||||||
|
|
||||||
[requires]
|
[requires]
|
||||||
python_version = "3.10"
|
python_version = "3.10"
|
||||||
|
|||||||
1066
Pipfile.lock
generated
77
README.md
@@ -1,7 +1,7 @@
|
|||||||
<h1 align="center">Auto Archiver</h1>
|
<h1 align="center">Auto Archiver</h1>
|
||||||
|
|
||||||
[](https://badge.fury.io/py/auto-archiver)
|
[](https://badge.fury.io/py/auto-archiver)
|
||||||
[](https://pypi.org/project/auto-archiver/)
|
[](https://hub.docker.com/r/bellingcat/auto-archiver)
|
||||||
<!--  -->
|
<!--  -->
|
||||||
<!-- [](https://pypi.python.org/pypi/auto-archiver/) -->
|
<!-- [](https://pypi.python.org/pypi/auto-archiver/) -->
|
||||||
<!-- [](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest) -->
|
<!-- [](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest) -->
|
||||||
@@ -20,12 +20,10 @@ There are 3 ways to use the auto-archiver:
|
|||||||
But **you always need a configuration/orchestration file**, which is where you'll configure where/what/how to archive. Make sure you read [orchestration](#orchestration).
|
But **you always need a configuration/orchestration file**, which is where you'll configure where/what/how to archive. Make sure you read [orchestration](#orchestration).
|
||||||
|
|
||||||
|
|
||||||
## How to run the auto-archiver
|
## How to install and run the auto-archiver
|
||||||
|
|
||||||
### Option 1 - docker
|
### Option 1 - docker
|
||||||
|
|
||||||
<details><summary><code>Docker instructions</code></summary>
|
|
||||||
|
|
||||||
[](https://hub.docker.com/r/bellingcat/auto-archiver)
|
[](https://hub.docker.com/r/bellingcat/auto-archiver)
|
||||||
|
|
||||||
Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.
|
Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.
|
||||||
@@ -45,8 +43,6 @@ Docker works like a virtual machine running inside your computer, it isolates ev
|
|||||||
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
|
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
|
||||||
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
|
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
|
||||||
|
|
||||||
</details>
|
|
||||||
|
|
||||||
### Option 2 - python package
|
### Option 2 - python package
|
||||||
|
|
||||||
<details><summary><code>Python package instructions</code></summary>
|
<details><summary><code>Python package instructions</code></summary>
|
||||||
@@ -54,8 +50,9 @@ Docker works like a virtual machine running inside your computer, it isolates ev
|
|||||||
1. make sure you have python 3.8 or higher installed
|
1. make sure you have python 3.8 or higher installed
|
||||||
2. install the package `pip/pipenv/conda install auto-archiver`
|
2. install the package `pip/pipenv/conda install auto-archiver`
|
||||||
3. test it's installed with `auto-archiver --help`
|
3. test it's installed with `auto-archiver --help`
|
||||||
4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml`
|
4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml` if your orchestration file is inside a `secrets/`, which we advise
|
||||||
1. if your orchestration file is inside a `secrets/` which we advise
|
|
||||||
|
You will also need [ffmpeg](https://www.ffmpeg.org/), [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases), and optionally [fonts-noto](https://fonts.google.com/noto). Similar to the local installation.
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
@@ -69,7 +66,7 @@ This can also be used for development.
|
|||||||
Install the following locally:
|
Install the following locally:
|
||||||
1. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work.
|
1. [ffmpeg](https://www.ffmpeg.org/) must also be installed locally for this tool to work.
|
||||||
2. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`.
|
2. [firefox](https://www.mozilla.org/en-US/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) on a path folder like `/usr/local/bin`.
|
||||||
3. [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`.
|
3. (optional) [fonts-noto](https://fonts.google.com/noto) to deal with multiple unicode characters during selenium/geckodriver's screenshots: `sudo apt install fonts-noto -y`.
|
||||||
|
|
||||||
Clone and run:
|
Clone and run:
|
||||||
1. `git clone https://github.com/bellingcat/auto-archiver`
|
1. `git clone https://github.com/bellingcat/auto-archiver`
|
||||||
@@ -87,7 +84,7 @@ The archiver work is orchestrated by the following workflow (we call each a **st
|
|||||||
4. **Formatter** creates a report from all the archived content (HTML, PDF, ...)
|
4. **Formatter** creates a report from all the archived content (HTML, PDF, ...)
|
||||||
5. **Database** knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)
|
5. **Database** knows what's been archived and also stores the archive result (spreadsheet, CSV, or just the console)
|
||||||
|
|
||||||
To setup an auto-archiver instance, instance, create an `orchestration.yaml` which contains the workflow you would like. We advise you put this file into a `secrets/` folder and do not share it with others because it will contain passwords and other secrets.
|
To setup an auto-archiver instance create an `orchestration.yaml` which contains the workflow you would like. We advise you put this file into a `secrets/` folder and do not share it with others because it will contain passwords and other secrets.
|
||||||
|
|
||||||
The structure of orchestration file is split into 2 parts: `steps` (what **steps** to use) and `configurations` (how those steps should behave), here's a simplification:
|
The structure of orchestration file is split into 2 parts: `steps` (what **steps** to use) and `configurations` (how those steps should behave), here's a simplification:
|
||||||
```yaml
|
```yaml
|
||||||
@@ -147,19 +144,30 @@ Use this to make sure you help making sure you did all the required steps:
|
|||||||
* [ ] (optional for browsertrix) `profile.tar.gz` file
|
* [ ] (optional for browsertrix) `profile.tar.gz` file
|
||||||
|
|
||||||
#### Example invocations
|
#### Example invocations
|
||||||
These assume you've installed with pipenv, see docker section above for how to run through docker
|
The recommended way to run the auto-archiver is through Docker. The invocations below will run the auto-archiver Docker image using a configuration file that you have specified
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# all the configurations come from ./secrets/orchestration.yaml
|
||||||
|
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml
|
||||||
|
# uses the same configurations but for another google docs sheet
|
||||||
|
# with a header on row 2 and with some different column names
|
||||||
|
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||||
|
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||||
|
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
||||||
|
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1
|
||||||
|
```
|
||||||
|
|
||||||
|
The auto-archiver can also be run locally, if pre-requisites are correctly configured. Equivalent invocations are below.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# all the configurations come from ./orchestration.yaml
|
|
||||||
auto-archiver
|
|
||||||
# all the configurations come from ./secrets/orchestration.yaml
|
# all the configurations come from ./secrets/orchestration.yaml
|
||||||
auto-archiver --config secrets/orchestration.yaml
|
auto-archiver --config secrets/orchestration.yaml
|
||||||
# uses the same configurations but for another google docs sheet
|
# uses the same configurations but for another google docs sheet
|
||||||
# with a header on row 2 and with some different column names
|
# with a header on row 2 and with some different column names
|
||||||
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
|
||||||
auto-archiver --config orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
auto-archiver --config secrets/orchestration.yaml --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
|
||||||
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
# all the configurations come from orchestration.yaml and specifies that s3 files should be private
|
||||||
auto-archiver --s3_storage.private=1
|
auto-archiver --config secrets/orchestration.yaml --s3_storage.private=1
|
||||||
```
|
```
|
||||||
|
|
||||||
### Extra notes on configuration
|
### Extra notes on configuration
|
||||||
@@ -173,18 +181,45 @@ The first time you run, you will be prompted to do a authentication with the pho
|
|||||||
## Running on Google Sheets Feeder (gsheet_feeder)
|
## Running on Google Sheets Feeder (gsheet_feeder)
|
||||||
The `--gseets_feeder.sheet` property is the name of the Google Sheet to check for URLs.
|
The `--gseets_feeder.sheet` property is the name of the Google Sheet to check for URLs.
|
||||||
This sheet must have been shared with the Google Service account used by `gspread`.
|
This sheet must have been shared with the Google Service account used by `gspread`.
|
||||||
This sheet must also have specific columns (case-insensitive) in the `header` row - see [Gsheet.configs](src/auto_archiver/utils/gsheet.py) for all their names.
|
This sheet must also have specific columns (case-insensitive) in the `header` as specified in [Gsheet.configs](src/auto_archiver/utils/gsheet.py). The default names of these columns and their purpose is:
|
||||||
|
|
||||||
For example, for use with this spreadsheet:
|
Inputs:
|
||||||
|
|
||||||

|
* **Link** *(required)*: the URL of the post to archive
|
||||||
|
* **Destination folder**: custom folder for archived file (regardless of storage)
|
||||||
|
|
||||||
|
Outputs:
|
||||||
|
* **Archive status** *(required)*: Status of archive operation
|
||||||
|
* **Archive location**: URL of archived post
|
||||||
|
* **Archive date**: Date archived
|
||||||
|
* **Thumbnail**: Embeds a thumbnail for the post in the spreadsheet
|
||||||
|
* **Timestamp**: Timestamp of original post
|
||||||
|
* **Title**: Post title
|
||||||
|
* **Text**: Post text
|
||||||
|
* **Screenshot**: Link to screenshot of post
|
||||||
|
* **Hash**: Hash of archived HTML file (which contains hashes of post media)
|
||||||
|
* **WACZ**: Link to a WACZ web archive of post
|
||||||
|
* **ReplayWebpage**: Link to a ReplayWebpage viewer of the WACZ archive
|
||||||
|
|
||||||
|
For example, this is a spreadsheet configured with all of the columns for the auto archiver and a few URLs to archive. (Note that the column names are not case sensitive.)
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Now the auto archiver can be invoked, with this command in this example: `docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --config secrets/orchestration-global.yaml --gsheet_feeder.sheet "Auto archive test 2023-2"`. Note that the sheet name has been overridden/specified in the command line invocation.
|
||||||
|
|
||||||
When the auto archiver starts running, it updates the "Archive status" column.
|
When the auto archiver starts running, it updates the "Archive status" column.
|
||||||

|
|
||||||
|

|
||||||
|
|
||||||
The links are downloaded and archived, and the spreadsheet is updated to the following:
|
The links are downloaded and archived, and the spreadsheet is updated to the following:
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
Note that the first row is skipped, as it is assumed to be a header row (`--gsheet_feeder.header=1` and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
|
Note that the first row is skipped, as it is assumed to be a header row (`--gsheet_feeder.header=1` and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
|
||||||
|
|
||||||
|
The "archive location" link contains the path of the archived file, in local storage, S3, or in Google Drive.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
---
|
---
|
||||||
## Development
|
## Development
|
||||||
@@ -193,7 +228,7 @@ Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run fro
|
|||||||
#### Docker development
|
#### Docker development
|
||||||
working with docker locally:
|
working with docker locally:
|
||||||
* `docker build . -t auto-archiver` to build a local image
|
* `docker build . -t auto-archiver` to build a local image
|
||||||
* `docker run --rm -v $PWD/secrets:/app/secrets aa pipenv run python3 -m auto_archiver --config secrets/orchestration.yaml`
|
* `docker run --rm -v $PWD/secrets:/app/secrets auto-archiver pipenv run python3 -m auto_archiver --config secrets/orchestration.yaml`
|
||||||
* to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
|
* to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
|
||||||
|
|
||||||
|
|
||||||
@@ -205,4 +240,4 @@ release to docker hub
|
|||||||
* update version in [version.py](src/auto_archiver/version.py)
|
* update version in [version.py](src/auto_archiver/version.py)
|
||||||
* run `bash ./scripts/release.sh` and confirm
|
* run `bash ./scripts/release.sh` and confirm
|
||||||
* package is automatically updated in pypi
|
* package is automatically updated in pypi
|
||||||
* docker image is automatically pushed to dockerhup
|
* docker image is automatically pushed to dockerhup
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 183 KiB |
|
Before Width: | Height: | Size: 486 KiB After Width: | Height: | Size: 1.5 MiB |
BIN
docs/demo-archive.png
Normal file
|
After Width: | Height: | Size: 819 KiB |
|
Before Width: | Height: | Size: 223 KiB After Width: | Height: | Size: 664 KiB |
|
Before Width: | Height: | Size: 241 KiB After Width: | Height: | Size: 698 KiB |
@@ -11,14 +11,14 @@ steps:
|
|||||||
# - instagram_archiver
|
# - instagram_archiver
|
||||||
# - tiktok_archiver
|
# - tiktok_archiver
|
||||||
- youtubedl_archiver
|
- youtubedl_archiver
|
||||||
- wayback_archiver_enricher
|
# - wayback_archiver_enricher
|
||||||
enrichers:
|
enrichers:
|
||||||
- hash_enricher
|
- hash_enricher
|
||||||
# - screenshot_enricher
|
# - screenshot_enricher
|
||||||
# - thumbnail_enricher
|
# - thumbnail_enricher
|
||||||
# - wayback_archiver_enricher
|
# - wayback_archiver_enricher
|
||||||
# - wacz_enricher
|
# - wacz_enricher
|
||||||
|
# - pdq_hash_enricher
|
||||||
formatter: html_formatter # defaults to mute_formatter
|
formatter: html_formatter # defaults to mute_formatter
|
||||||
storages:
|
storages:
|
||||||
- local_storage
|
- local_storage
|
||||||
@@ -50,6 +50,7 @@ configurations:
|
|||||||
text: textual content
|
text: textual content
|
||||||
screenshot: screenshot
|
screenshot: screenshot
|
||||||
hash: hash
|
hash: hash
|
||||||
|
pdq_hash: perceptual hashes
|
||||||
wacz: wacz
|
wacz: wacz
|
||||||
replaywebpage: replaywebpage
|
replaywebpage: replaywebpage
|
||||||
instagram_tbot_archiver:
|
instagram_tbot_archiver:
|
||||||
@@ -112,10 +113,11 @@ configurations:
|
|||||||
private: false
|
private: false
|
||||||
# with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
|
# with 'random' you can generate a random UUID for the URL instead of a predictable path, useful to still have public but unlisted files, alternative is 'default' or not omitted from config
|
||||||
key_path: random
|
key_path: random
|
||||||
|
|
||||||
gdrive_storage:
|
gdrive_storage:
|
||||||
path_generator: url
|
path_generator: url
|
||||||
filename_generator: random
|
filename_generator: random
|
||||||
root_folder_id: folder_id_from_url
|
root_folder_id: folder_id_from_url
|
||||||
oauth_token: secrets/gd-token.json # needs to be generated with scripts/create_update_gdrive_oauth_token.py
|
oauth_token: secrets/gd-token.json # needs to be generated with scripts/create_update_gdrive_oauth_token.py
|
||||||
service_account: "secrets/service_account.json"
|
service_account: "secrets/service_account.json"
|
||||||
|
csv_db:
|
||||||
|
csv_file: "./local_archive/db.csv"
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ def main():
|
|||||||
config = Config()
|
config = Config()
|
||||||
config.parse()
|
config.parse()
|
||||||
orchestrator = ArchivingOrchestrator(config)
|
orchestrator = ArchivingOrchestrator(config)
|
||||||
orchestrator.feed()
|
for r in orchestrator.feed(): pass
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
|||||||
@@ -13,6 +13,7 @@ from ..formatters import Formatter
|
|||||||
from ..storages import Storage
|
from ..storages import Storage
|
||||||
from ..enrichers import Enricher
|
from ..enrichers import Enricher
|
||||||
from . import Step
|
from . import Step
|
||||||
|
from ..utils import update_nested_dict
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -38,10 +39,11 @@ class Config:
|
|||||||
self.cli_ops = {}
|
self.cli_ops = {}
|
||||||
self.config = {}
|
self.config = {}
|
||||||
|
|
||||||
def parse(self, use_cli=True, yaml_config_filename: str = None):
|
def parse(self, use_cli=True, yaml_config_filename: str = None, overwrite_configs: str = {}):
|
||||||
"""
|
"""
|
||||||
if yaml_config_filename is provided, the --config argument is ignored,
|
if yaml_config_filename is provided, the --config argument is ignored,
|
||||||
useful for library usage when the config values are preloaded
|
useful for library usage when the config values are preloaded
|
||||||
|
overwrite_configs is a dict that overwrites the yaml file contents
|
||||||
"""
|
"""
|
||||||
# 1. parse CLI values
|
# 1. parse CLI values
|
||||||
if use_cli:
|
if use_cli:
|
||||||
@@ -80,6 +82,7 @@ class Config:
|
|||||||
|
|
||||||
# 2. read YAML config file (or use provided value)
|
# 2. read YAML config file (or use provided value)
|
||||||
self.yaml_config = self.read_yaml(yaml_config_filename)
|
self.yaml_config = self.read_yaml(yaml_config_filename)
|
||||||
|
update_nested_dict(self.yaml_config, overwrite_configs)
|
||||||
|
|
||||||
# 3. CONFIGS: decide value with priority: CLI >> config.yaml >> default
|
# 3. CONFIGS: decide value with priority: CLI >> config.yaml >> default
|
||||||
self.config = defaultdict(dict)
|
self.config = defaultdict(dict)
|
||||||
|
|||||||
@@ -1,7 +1,6 @@
|
|||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
from ast import List
|
from typing import Any, List
|
||||||
from typing import Any
|
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
from dataclasses_json import dataclass_json, config
|
from dataclasses_json import dataclass_json, config
|
||||||
import mimetypes
|
import mimetypes
|
||||||
@@ -31,15 +30,19 @@ class Media:
|
|||||||
return
|
return
|
||||||
|
|
||||||
for s in storages:
|
for s in storages:
|
||||||
s.store(self, url)
|
for any_media in self.all_inner_media(include_self=True):
|
||||||
# Media can be inside media properties, examples include transformations on original media
|
s.store(any_media, url)
|
||||||
for prop in self.properties.values():
|
|
||||||
if isinstance(prop, Media):
|
def all_inner_media(self, include_self=False):
|
||||||
s.store(prop, url)
|
""" Media can be inside media properties, examples include transformations on original media.
|
||||||
if isinstance(prop, list):
|
This function returns a generator for all the inner media.
|
||||||
for prop_media in prop:
|
"""
|
||||||
if isinstance(prop_media, Media):
|
if include_self: yield self
|
||||||
s.store(prop_media, url)
|
for prop in self.properties.values():
|
||||||
|
if isinstance(prop, Media): yield prop
|
||||||
|
if isinstance(prop, list):
|
||||||
|
for prop_media in prop:
|
||||||
|
if isinstance(prop_media, Media): yield prop_media
|
||||||
|
|
||||||
def is_stored(self) -> bool:
|
def is_stored(self) -> bool:
|
||||||
return len(self.urls) > 0 and len(self.urls) == len(ArchivingContext.get("storages"))
|
return len(self.urls) > 0 and len(self.urls) == len(ArchivingContext.get("storages"))
|
||||||
@@ -71,3 +74,6 @@ class Media:
|
|||||||
|
|
||||||
def is_audio(self) -> bool:
|
def is_audio(self) -> bool:
|
||||||
return self.mimetype.startswith("audio")
|
return self.mimetype.startswith("audio")
|
||||||
|
|
||||||
|
def is_image(self) -> bool:
|
||||||
|
return self.mimetype.startswith("image")
|
||||||
|
|||||||
@@ -1,7 +1,6 @@
|
|||||||
|
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
from ast import List, Set
|
from typing import Any, List, Union, Dict
|
||||||
from typing import Any, Union, Dict
|
|
||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
from dataclasses_json import dataclass_json, config
|
from dataclasses_json import dataclass_json, config
|
||||||
import datetime
|
import datetime
|
||||||
@@ -137,6 +136,10 @@ class Metadata:
|
|||||||
def get_final_media(self) -> Media:
|
def get_final_media(self) -> Media:
|
||||||
_default = self.media[0] if len(self.media) else None
|
_default = self.media[0] if len(self.media) else None
|
||||||
return self.get_media_by_id("_final_media", _default)
|
return self.get_media_by_id("_final_media", _default)
|
||||||
|
|
||||||
|
def get_all_media(self) -> List[Media]:
|
||||||
|
# returns a list with all the media and inner media
|
||||||
|
return [inner for m in self.media for inner in m.all_inner_media(True)]
|
||||||
|
|
||||||
def __str__(self) -> str:
|
def __str__(self) -> str:
|
||||||
return self.__repr__()
|
return self.__repr__()
|
||||||
|
|||||||
@@ -1,6 +1,5 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
from ast import List
|
from typing import Generator, Union, List
|
||||||
from typing import Union
|
|
||||||
|
|
||||||
from .context import ArchivingContext
|
from .context import ArchivingContext
|
||||||
|
|
||||||
@@ -10,7 +9,6 @@ from ..formatters import Formatter
|
|||||||
from ..storages import Storage
|
from ..storages import Storage
|
||||||
from ..enrichers import Enricher
|
from ..enrichers import Enricher
|
||||||
from ..databases import Database
|
from ..databases import Database
|
||||||
from .media import Media
|
|
||||||
from .metadata import Metadata
|
from .metadata import Metadata
|
||||||
|
|
||||||
import tempfile, traceback
|
import tempfile, traceback
|
||||||
@@ -29,9 +27,9 @@ class ArchivingOrchestrator:
|
|||||||
|
|
||||||
for a in self.archivers: a.setup()
|
for a in self.archivers: a.setup()
|
||||||
|
|
||||||
def feed(self) -> None:
|
def feed(self) -> Generator[Metadata]:
|
||||||
for item in self.feeder:
|
for item in self.feeder:
|
||||||
self.feed_item(item)
|
yield self.feed_item(item)
|
||||||
|
|
||||||
def feed_item(self, item: Metadata) -> Metadata:
|
def feed_item(self, item: Metadata) -> Metadata:
|
||||||
try:
|
try:
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ class Step(ABC):
|
|||||||
|
|
||||||
def init(name: str, config: dict, child: Type[Step]) -> Step:
|
def init(name: str, config: dict, child: Type[Step]) -> Step:
|
||||||
"""
|
"""
|
||||||
looks into direct subclasses of child for name and returns such ab object
|
looks into direct subclasses of child for name and returns such an object
|
||||||
TODO: cannot find subclasses of child.subclasses
|
TODO: cannot find subclasses of child.subclasses
|
||||||
"""
|
"""
|
||||||
for sub in child.__subclasses__():
|
for sub in child.__subclasses__():
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
from .database import Database
|
from .database import Database
|
||||||
from .gsheet_db import GsheetsDb
|
from .gsheet_db import GsheetsDb
|
||||||
from .console_db import ConsoleDb
|
from .console_db import ConsoleDb
|
||||||
from .csv_db import CSVDb
|
from .csv_db import CSVDb
|
||||||
|
from .api_db import AAApiDb
|
||||||
41
src/auto_archiver/databases/api_db.py
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
import requests, os
|
||||||
|
from loguru import logger
|
||||||
|
|
||||||
|
from . import Database
|
||||||
|
from ..core import Metadata
|
||||||
|
|
||||||
|
|
||||||
|
class AAApiDb(Database):
|
||||||
|
"""
|
||||||
|
Connects to auto-archiver-api instance
|
||||||
|
"""
|
||||||
|
name = "auto_archiver_api_db"
|
||||||
|
|
||||||
|
def __init__(self, config: dict) -> None:
|
||||||
|
# without this STEP.__init__ is not called
|
||||||
|
super().__init__(config)
|
||||||
|
self.assert_valid_string("api_endpoint")
|
||||||
|
self.assert_valid_string("api_secret")
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def configs() -> dict:
|
||||||
|
return {
|
||||||
|
"api_endpoint": {"default": None, "help": "API endpoint where calls are made to"},
|
||||||
|
"api_secret": {"default": None, "help": "API authentication secret"},
|
||||||
|
"public": {"default": False, "help": "whether the URL should be publicly available via the API"},
|
||||||
|
"author_id": {"default": None, "help": "which email to assign as author"},
|
||||||
|
"group_id": {"default": None, "help": "which group of users have access to the archive in case public=false as author"},
|
||||||
|
"tags": {"default": [], "help": "what tags to add to the archived URL", "cli_set": lambda cli_val, cur_val: set(cli_val.split(","))},
|
||||||
|
}
|
||||||
|
|
||||||
|
def done(self, item: Metadata) -> None:
|
||||||
|
"""archival result ready - should be saved to DB"""
|
||||||
|
logger.info(f"saving archive of {item.get_url()} to the AA API.")
|
||||||
|
|
||||||
|
payload = {'result': item.to_json(), 'public': self.public, 'author_id': self.author_id, 'group_id': self.group_id, 'tags': list(self.tags)}
|
||||||
|
response = requests.post(os.path.join(self.api_endpoint, "submit-archive"), json=payload, auth=("abc", self.api_secret))
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
logger.success(f"AA API: {response.json()}")
|
||||||
|
else:
|
||||||
|
logger.error(f"AA API FAIL ({response.status_code}): {response.json()}")
|
||||||
@@ -52,8 +52,11 @@ class GsheetsDb(Database):
|
|||||||
|
|
||||||
def batch_if_valid(col, val, final_value=None):
|
def batch_if_valid(col, val, final_value=None):
|
||||||
final_value = final_value or val
|
final_value = final_value or val
|
||||||
if val and gw.col_exists(col) and gw.get_cell(row_values, col) == '':
|
try:
|
||||||
cell_updates.append((row, col, final_value))
|
if val and gw.col_exists(col) and gw.get_cell(row_values, col) == '':
|
||||||
|
cell_updates.append((row, col, final_value))
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Unable to batch {col}={final_value} due to {e}")
|
||||||
|
|
||||||
cell_updates.append((row, 'status', item.status))
|
cell_updates.append((row, 'status', item.status))
|
||||||
|
|
||||||
@@ -65,6 +68,16 @@ class GsheetsDb(Database):
|
|||||||
batch_if_valid('text', item.get("content", ""))
|
batch_if_valid('text', item.get("content", ""))
|
||||||
batch_if_valid('timestamp', item.get_timestamp())
|
batch_if_valid('timestamp', item.get_timestamp())
|
||||||
batch_if_valid('hash', media.get("hash", "not-calculated"))
|
batch_if_valid('hash', media.get("hash", "not-calculated"))
|
||||||
|
|
||||||
|
# merge all pdq hashes into a single string, if present
|
||||||
|
pdq_hashes = []
|
||||||
|
all_media = item.get_all_media()
|
||||||
|
for m in all_media:
|
||||||
|
if pdq := m.get("pdq_hash"):
|
||||||
|
pdq_hashes.append(pdq)
|
||||||
|
if len(pdq_hashes):
|
||||||
|
batch_if_valid('pdq_hash', ",".join(pdq_hashes))
|
||||||
|
|
||||||
if (screenshot := item.get_media_by_id("screenshot")) and hasattr(screenshot, "urls"):
|
if (screenshot := item.get_media_by_id("screenshot")) and hasattr(screenshot, "urls"):
|
||||||
batch_if_valid('screenshot', "\n".join(screenshot.urls))
|
batch_if_valid('screenshot', "\n".join(screenshot.urls))
|
||||||
|
|
||||||
|
|||||||
@@ -4,4 +4,5 @@ from .wayback_enricher import WaybackArchiverEnricher
|
|||||||
from .hash_enricher import HashEnricher
|
from .hash_enricher import HashEnricher
|
||||||
from .thumbnail_enricher import ThumbnailEnricher
|
from .thumbnail_enricher import ThumbnailEnricher
|
||||||
from .wacz_enricher import WaczEnricher
|
from .wacz_enricher import WaczEnricher
|
||||||
from .whisper_enricher import WhisperEnricher
|
from .whisper_enricher import WhisperEnricher
|
||||||
|
from .pdq_hash_enricher import PdqHashEnricher
|
||||||
42
src/auto_archiver/enrichers/pdq_hash_enricher.py
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
import pdqhash
|
||||||
|
import numpy as np
|
||||||
|
from PIL import Image
|
||||||
|
from loguru import logger
|
||||||
|
|
||||||
|
from . import Enricher
|
||||||
|
from ..core import Metadata
|
||||||
|
|
||||||
|
|
||||||
|
class PdqHashEnricher(Enricher):
|
||||||
|
"""
|
||||||
|
Calculates perceptual hashes for Media instances using PDQ, allowing for (near-)duplicate detection.
|
||||||
|
Ideally this enrichment is orchestrated to run after the thumbnail_enricher.
|
||||||
|
"""
|
||||||
|
name = "pdq_hash_enricher"
|
||||||
|
|
||||||
|
def __init__(self, config: dict) -> None:
|
||||||
|
# Without this STEP.__init__ is not called
|
||||||
|
super().__init__(config)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def configs() -> dict:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
def enrich(self, to_enrich: Metadata) -> None:
|
||||||
|
url = to_enrich.get_url()
|
||||||
|
logger.debug(f"calculating perceptual hashes for {url=}")
|
||||||
|
|
||||||
|
for m in to_enrich.media:
|
||||||
|
for media in m.all_inner_media(True):
|
||||||
|
if media.is_image() and media.get("id") != "screenshot" and len(hd := self.calculate_pdq_hash(media.filename)):
|
||||||
|
media.set("pdq_hash", hd)
|
||||||
|
|
||||||
|
def calculate_pdq_hash(self, filename):
|
||||||
|
# returns a hexadecimal string with the perceptual hash for the given filename
|
||||||
|
with Image.open(filename) as img:
|
||||||
|
# convert the image to RGB
|
||||||
|
image_rgb = np.array(img.convert("RGB"))
|
||||||
|
# compute the 256-bit PDQ hash (we do not store the quality score)
|
||||||
|
hash_array, _ = pdqhash.compute(image_rgb)
|
||||||
|
hash = "".join(str(b) for b in hash_array)
|
||||||
|
return hex(int(hash, 2))[2:]
|
||||||
@@ -25,13 +25,7 @@ class WaczEnricher(Enricher):
|
|||||||
}
|
}
|
||||||
|
|
||||||
def enrich(self, to_enrich: Metadata) -> bool:
|
def enrich(self, to_enrich: Metadata) -> bool:
|
||||||
# TODO: figure out support for browsertrix in docker
|
|
||||||
|
|
||||||
url = to_enrich.get_url()
|
url = to_enrich.get_url()
|
||||||
|
|
||||||
if UrlUtil.is_auth_wall(url):
|
|
||||||
logger.debug(f"[SKIP] SCREENSHOT since url is behind AUTH WALL: {url=}")
|
|
||||||
return
|
|
||||||
|
|
||||||
collection = str(uuid.uuid4())[0:8]
|
collection = str(uuid.uuid4())[0:8]
|
||||||
browsertrix_home = os.path.abspath(ArchivingContext.get_tmp_dir())
|
browsertrix_home = os.path.abspath(ArchivingContext.get_tmp_dir())
|
||||||
@@ -50,9 +44,10 @@ class WaczEnricher(Enricher):
|
|||||||
"--saveState", "never",
|
"--saveState", "never",
|
||||||
"--behaviors", "autoscroll,autoplay,autofetch,siteSpecific",
|
"--behaviors", "autoscroll,autoplay,autofetch,siteSpecific",
|
||||||
"--behaviorTimeout", str(self.timeout),
|
"--behaviorTimeout", str(self.timeout),
|
||||||
"--timeout", str(self.timeout),
|
"--timeout", str(self.timeout)]
|
||||||
"--profile", str(self.profile)
|
|
||||||
]
|
if self.profile:
|
||||||
|
cmd.extend(["--profile", os.path.join("/app", str(self.profile))])
|
||||||
else:
|
else:
|
||||||
logger.debug(f"generating WACZ in Docker for {url=}")
|
logger.debug(f"generating WACZ in Docker for {url=}")
|
||||||
|
|
||||||
@@ -75,9 +70,7 @@ class WaczEnricher(Enricher):
|
|||||||
if self.profile:
|
if self.profile:
|
||||||
profile_fn = os.path.join(browsertrix_home, "profile.tar.gz")
|
profile_fn = os.path.join(browsertrix_home, "profile.tar.gz")
|
||||||
shutil.copyfile(self.profile, profile_fn)
|
shutil.copyfile(self.profile, profile_fn)
|
||||||
# TODO: test which is right
|
cmd.extend(["--profile", os.path.join("/crawls", "profile.tar.gz")])
|
||||||
cmd.extend(["--profile", profile_fn])
|
|
||||||
# cmd.extend(["--profile", "/crawls/profile.tar.gz"])
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
logger.info(f"Running browsertrix-crawler: {' '.join(cmd)}")
|
logger.info(f"Running browsertrix-crawler: {' '.join(cmd)}")
|
||||||
|
|||||||
@@ -39,7 +39,7 @@ class GsheetsFeeder(Gsheets, Feeder):
|
|||||||
})
|
})
|
||||||
|
|
||||||
def __iter__(self) -> Metadata:
|
def __iter__(self) -> Metadata:
|
||||||
sh = self.gsheets_client.open(self.sheet)
|
sh = self.open_sheet()
|
||||||
for ii, wks in enumerate(sh.worksheets()):
|
for ii, wks in enumerate(sh.worksheets()):
|
||||||
if not self.should_process_sheet(wks.title):
|
if not self.should_process_sheet(wks.title):
|
||||||
logger.debug(f"SKIPPED worksheet '{wks.title}' due to allow/block rules")
|
logger.debug(f"SKIPPED worksheet '{wks.title}' due to allow/block rules")
|
||||||
@@ -64,7 +64,10 @@ class GsheetsFeeder(Gsheets, Feeder):
|
|||||||
# All checks done - archival process starts here
|
# All checks done - archival process starts here
|
||||||
m = Metadata().set_url(url)
|
m = Metadata().set_url(url)
|
||||||
ArchivingContext.set("gsheet", {"row": row, "worksheet": gw}, keep_on_reset=True)
|
ArchivingContext.set("gsheet", {"row": row, "worksheet": gw}, keep_on_reset=True)
|
||||||
folder = slugify(gw.get_cell(row, 'folder').strip())
|
if gw.get_cell_or_default(row, 'folder', "") is None:
|
||||||
|
folder = ''
|
||||||
|
else:
|
||||||
|
folder = slugify(gw.get_cell_or_default(row, 'folder', "").strip())
|
||||||
if len(folder):
|
if len(folder):
|
||||||
if self.use_sheet_names_in_stored_paths:
|
if self.use_sheet_names_in_stored_paths:
|
||||||
ArchivingContext.set("folder", os.path.join(folder, slugify(self.sheet), slugify(wks.title)), True)
|
ArchivingContext.set("folder", os.path.join(folder, slugify(self.sheet), slugify(wks.title)), True)
|
||||||
|
|||||||
@@ -125,7 +125,14 @@
|
|||||||
<div class="collapsible-content">
|
<div class="collapsible-content">
|
||||||
{% for subprop in m.properties[prop] %}
|
{% for subprop in m.properties[prop] %}
|
||||||
{% if subprop | is_media %}
|
{% if subprop | is_media %}
|
||||||
{{ macros.display_media(subprop, false, url) }}
|
{{ macros.display_media(subprop, true, url) }}
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
{% for subprop_prop in subprop.properties %}
|
||||||
|
<li><b>{{ subprop_prop }}:</b> {{ macros.copy_urlize(subprop.properties[subprop_prop]) }}</li>
|
||||||
|
{% endfor %}
|
||||||
|
</ul>
|
||||||
|
|
||||||
{% else %}
|
{% else %}
|
||||||
{{ subprop }}
|
{{ subprop }}
|
||||||
{% endif %}
|
{% endif %}
|
||||||
@@ -162,7 +169,8 @@
|
|||||||
{% endfor %}
|
{% endfor %}
|
||||||
</table>
|
</table>
|
||||||
|
|
||||||
<p style="text-align:center;">Made with <a href="https://github.com/bellingcat/auto-archiver">bellingcat/auto-archiver</a> v{{ version }}</p>
|
<p style="text-align:center;">Made with <a
|
||||||
|
href="https://github.com/bellingcat/auto-archiver">bellingcat/auto-archiver</a> v{{ version }}</p>
|
||||||
</body>
|
</body>
|
||||||
<script defer>
|
<script defer>
|
||||||
// notification logic
|
// notification logic
|
||||||
@@ -201,7 +209,7 @@
|
|||||||
let i;
|
let i;
|
||||||
|
|
||||||
for (i = 0; i < coll.length; i++) {
|
for (i = 0; i < coll.length; i++) {
|
||||||
coll[i].addEventListener("click", function() {
|
coll[i].addEventListener("click", function () {
|
||||||
this.classList.toggle("active");
|
this.classList.toggle("active");
|
||||||
// let content = this.nextElementSibling;
|
// let content = this.nextElementSibling;
|
||||||
let content = this.parentElement.querySelector(".collapsible-content");
|
let content = this.parentElement.querySelector(".collapsible-content");
|
||||||
|
|||||||
@@ -10,16 +10,17 @@ class Gsheets(Step):
|
|||||||
# without this STEP.__init__ is not called
|
# without this STEP.__init__ is not called
|
||||||
super().__init__(config)
|
super().__init__(config)
|
||||||
self.gsheets_client = gspread.service_account(filename=self.service_account)
|
self.gsheets_client = gspread.service_account(filename=self.service_account)
|
||||||
#TODO: config should be responsible for conversions
|
# TODO: config should be responsible for conversions
|
||||||
try: self.header = int(self.header)
|
try: self.header = int(self.header)
|
||||||
except: pass
|
except: pass
|
||||||
assert type(self.header) == int, f"header ({self.header}) value must be an integer not {type(self.header)}"
|
assert type(self.header) == int, f"header ({self.header}) value must be an integer not {type(self.header)}"
|
||||||
assert self.sheet is not None, "You need to define a sheet name in your orchestration file when using gsheets."
|
assert self.sheet is not None or self.sheet_id is not None, "You need to define either a 'sheet' name or a 'sheet_id' in your orchestration file when using gsheets."
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def configs() -> dict:
|
def configs() -> dict:
|
||||||
return {
|
return {
|
||||||
"sheet": {"default": None, "help": "name of the sheet to archive"},
|
"sheet": {"default": None, "help": "name of the sheet to archive"},
|
||||||
|
"sheet_id": {"default": None, "help": "(alternative to sheet name) the id of the sheet to archive"},
|
||||||
"header": {"default": 1, "help": "index of the header row (starts at 1)"},
|
"header": {"default": 1, "help": "index of the header row (starts at 1)"},
|
||||||
"service_account": {"default": "secrets/service_account.json", "help": "service account JSON file path"},
|
"service_account": {"default": "secrets/service_account.json", "help": "service account JSON file path"},
|
||||||
"columns": {
|
"columns": {
|
||||||
@@ -35,10 +36,17 @@ class Gsheets(Step):
|
|||||||
'text': 'text content',
|
'text': 'text content',
|
||||||
'screenshot': 'screenshot',
|
'screenshot': 'screenshot',
|
||||||
'hash': 'hash',
|
'hash': 'hash',
|
||||||
|
'pdq_hash': 'perceptual hashes',
|
||||||
'wacz': 'wacz',
|
'wacz': 'wacz',
|
||||||
'replaywebpage': 'replaywebpage',
|
'replaywebpage': 'replaywebpage',
|
||||||
},
|
},
|
||||||
"help": "names of columns in the google sheet (stringified JSON object)",
|
"help": "names of columns in the google sheet (stringified JSON object)",
|
||||||
"cli_set": lambda cli_val, cur_val: dict(cur_val, **json.loads(cli_val))
|
"cli_set": lambda cli_val, cur_val: dict(cur_val, **json.loads(cli_val))
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
def open_sheet(self):
|
||||||
|
if self.sheet:
|
||||||
|
return self.gsheets_client.open(self.sheet)
|
||||||
|
else: # self.sheet_id
|
||||||
|
return self.gsheets_client.open_by_key(self.sheet_id)
|
||||||
|
|||||||
@@ -19,6 +19,7 @@ class GWorksheet:
|
|||||||
'title': 'upload title',
|
'title': 'upload title',
|
||||||
'screenshot': 'screenshot',
|
'screenshot': 'screenshot',
|
||||||
'hash': 'hash',
|
'hash': 'hash',
|
||||||
|
'pdq_hash': 'perceptual hashes',
|
||||||
'wacz': 'wacz',
|
'wacz': 'wacz',
|
||||||
'replaywebpage': 'replaywebpage',
|
'replaywebpage': 'replaywebpage',
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -40,3 +40,12 @@ class DateTimeEncoder(json.JSONEncoder):
|
|||||||
|
|
||||||
def dump_payload(p):
|
def dump_payload(p):
|
||||||
return json.dumps(p, ensure_ascii=False, indent=4, cls=DateTimeEncoder)
|
return json.dumps(p, ensure_ascii=False, indent=4, cls=DateTimeEncoder)
|
||||||
|
|
||||||
|
|
||||||
|
def update_nested_dict(dictionary, update_dict):
|
||||||
|
# takes 2 dicts and overwrites the first with the second only on the changed balues
|
||||||
|
for key, value in update_dict.items():
|
||||||
|
if key in dictionary and isinstance(value, dict) and isinstance(dictionary[key], dict):
|
||||||
|
update_nested_dict(dictionary[key], value)
|
||||||
|
else:
|
||||||
|
dictionary[key] = value
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ _MAJOR = "0"
|
|||||||
_MINOR = "5"
|
_MINOR = "5"
|
||||||
# On main and in a nightly release the patch should be one ahead of the last
|
# On main and in a nightly release the patch should be one ahead of the last
|
||||||
# released build.
|
# released build.
|
||||||
_PATCH = "12"
|
_PATCH = "24"
|
||||||
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
|
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
|
||||||
# https://semver.org/#is-v123-a-semantic-version for the semantics.
|
# https://semver.org/#is-v123-a-semantic-version for the semantics.
|
||||||
_SUFFIX = ""
|
_SUFFIX = ""
|
||||||
|
|||||||