Compare commits

...

12 Commits

Author SHA1 Message Date
Patrick Robertson
ab03e48708 Add info on building RTD versions + automated building of tagged versions 2025-03-17 12:52:04 +00:00
Patrick Robertson
f56cd6891b Finish incomplete sentence 2025-03-17 10:33:50 +00:00
Patrick Robertson
9e03d745d8 Add '-it' to the list of docker flags, so that docker gives a colour log output 2025-03-17 09:45:12 +00:00
Patrick Robertson
7badf89c28 Create the 'secrets' folder if it doesn't exist on first run
Easier setup for users
2025-03-17 09:40:46 +00:00
Patrick Robertson
d59530c8e7 Fix if logic bug 2025-03-17 09:40:27 +00:00
Patrick Robertson
0ec5451f66 Nicer error log when no URLs provided for CLI feeder - don't need the stacktrace 2025-03-17 09:34:33 +00:00
Patrick Robertson
99e9ac2465 Fix 'Syntax Error' warning in python3.12+ 2025-03-17 09:29:51 +00:00
Patrick Robertson
42162c5e3f Various docs improvements based on Friday Office Hours discussion 2025-03-17 09:23:43 +00:00
Patrick Robertson
3afe519176 Fix link to module types in config editor 2025-03-17 09:17:17 +00:00
Patrick Robertson
f13349bacf Fix incorrect path in cp 2025-03-16 10:33:52 +00:00
Patrick Robertson
92c79ed994 Remove schema.json file from git - is auto-generated on release 2025-03-16 10:27:08 +00:00
Patrick Robertson
2643b8e717 Update material version, minify code 2025-03-16 10:22:54 +00:00
21 changed files with 359 additions and 50609 deletions

1
.gitignore vendored
View File

@@ -34,4 +34,5 @@ docs/_build/
docs/source/autoapi/
docs/source/modules/autogen/
scripts/settings_page.html
scripts/settings/src/schema.json
.vite

View File

@@ -21,7 +21,7 @@ build:
# generate the config editor page. Schema then HTML
- VIRTUAL_ENV=$READTHEDOCS_VIRTUALENV_PATH poetry run python scripts/generate_settings_schema.py
# install node dependencies and build the settings
- cd scripts/settings && npm install && npm run build && yes | cp dist/index.html ../../docs/source/installation/settings_base.html && cd ../..
- cd scripts/settings && npm install && npm run build && yes | cp -v dist/index.html ../../docs/source/installation/settings.html && cd ../..
sphinx:

View File

@@ -29,7 +29,7 @@ View the [Installation Guide](https://auto-archiver.readthedocs.io/en/latest/ins
To get started quickly using Docker:
`docker pull bellingcat/auto-archiver && docker run --rm -v secrets:/app/secrets bellingcat/auto-archiver --config secrets/orchestration.yaml`
`docker pull bellingcat/auto-archiver && docker run -it --rm -v secrets:/app/secrets bellingcat/auto-archiver --config secrets/orchestration.yaml`
Or pip:

View File

@@ -36,3 +36,12 @@ open docs/_build/html/index.html
sphinx-autobuild docs/source docs/_build/html
```
### Managing Readthedocs (RTD) Versions
Version management is done at [https://app.readthedocs.org/projects/auto-archiver/](https://app.readthedocs.org/projects/auto-archiver/)
(login required). Once logged in, you can create new versions, delete old versions or change visibility of versions. More info on
[RTD](https://docs.readthedocs.com/platform/stable/versions.html).
Currently, the Auto Archiver project is set up to automatically create a new docs version for each `vX.Y.Z` release. For more on this,
see the RTD [instructions on automation](https://docs.readthedocs.com/platform/stable/guides/automation-rules.html) or edit the existing automation rule in the project settings.

View File

@@ -86,7 +86,7 @@ gsheet_feeder_db:
You can also pass these settings directly on the command line without having to edit the file, here'a an example of how to do that (using docker):
`docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --gsheet_feeder_db.sheet "My Awesome Sheet 2"`.
`docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --gsheet_feeder_db.sheet "My Awesome Sheet 2"`.
Here, the sheet name has been overridden/specified in the command line invocation.

View File

@@ -0,0 +1,60 @@
# Frequently Asked Questions
### Q: What websites does the Auto Archiver support?
**A:** The Auto Archiver works for a large variety of sites. Firstly, the Auto Archiver can download
and archive any video website supported by YT-DLP, a powerful video-downloading tool ([full list of of
sites here](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)). Aside from these sites,
there are various different 'Extractors' for specific websites. See the full list of extractors that
are available on the [extractors](../modules/extractor.md) page. Some sites supported include:
* Twitter
* Instagram
* Telegram
* VKontact
* Tiktok
* Bluesky
```{note} What websites the Auto Archiver can archie depends on what extractors you have enabled in
your configuration. See [configuration](./configurations.md) for more info.
```
### Q: Does the Auto Archiver only work for social media posts ?
**A:** No, the Auto Archiver can archive any web page on the internet, not just social media posts.
However, for social media posts Auto Archiver can extract more relevant/useful information (such as
post comments, likes, author etc.) which may not be available for a generic website. If you are looking
to more generally archive webpages, then you should make sure to enable the [](../modules/autogen/extractor/wacz_extractor_enricher.md)
and the [](../modules/autogen/extractor/wayback_extractor_enricher.md).
### Q: What kind of data is stored for each webpage that's archived?
**A:** This depends on the website archived, but more generally, for social media posts any videos and photos in
the post will be archived. For video sites, the video will be downloaded separately. For most of these sites, additional
metadata such as published date, uploader/author and ratings/comments will also be saved. Additionally, further data can be
saved depending on the enrichers that you have enabled. Some other types of data saved are timestamps if you have the
[](../modules/autogen/enricher/timestamping_enricher.md) or [](../modules/autogen/enricher/opentimestamps_enricher.md) enabled,
screenshots of the web page with the [](../modules/autogen/enricher/screenshot_enricher.md), and for videos, thumbnails of the
video with the [](../modules/autogen/enricher/thumbnail_enricher.md). You can also store things like hashes (SHA256, or pdq hashes)
with the various hash enrichers.
### Q: Where is my data stored?
**A:** With the default configuration, data is stored on your local computer in the `local_storage` folder. You can adjust these settings by
changing the [storage modules](../modules/storage.md) you have enabled. For example, you could choose to store your data in an S3 bucket or
on Google Drive.
```{note}
You can choose to store your data in multiple places, for example your local drive **and** an S3 bucket for redundancy.
```
### Q: What should I do is something doesn't work?
**A:** First, read through the log files to see if you can find a specific reason why something isn't working. Learn more about logging
and how to enable debug logging in the [Logging Howto](../how_to/logging.md).
If you cannot find an answer in the logs, then try searching this documentation or existing / closed issues on the [Github Issue Tracker](https://github.com/bellingcat/auto-archiver/issues?q=is%3Aissue%20). If you still cannot find an answer, then consider opening an issue on the Github Issue Tracker or asking in the Bellingcat Discord
'Auto Archiver' group.
#### Common reasons why an archiving might not work:
* The website may have temporarily adjusted its settings - sometimes sites like Telegram or Twitter adjust their scraping protection settings. Often,
waiting a day or two and then trying again can work.
* The site requires you to be logged in - you could try using cookies or authentication to bypass any blocks. See [](../installation/authentication.md) for more information.
* The website you're trying to archive has changed its settings/structure. Make sure you're using the latest version of Auto Archiver and try again.

View File

@@ -1,5 +1,11 @@
# Installation
```{toctree}
:maxdepth: 1
upgrading.md
```
There are 3 main ways to use the auto-archiver. We recommend the 'docker' method for most uses. This installs all the requirements in one command.
1. Easiest (recommended): [via docker](#installing-with-docker)

File diff suppressed because one or more lines are too long

View File

@@ -1,7 +1,6 @@
# Getting Started
```{toctree}
:maxdepth: 1
:hidden:
installation.md
@@ -9,6 +8,7 @@ configurations.md
config_editor.md
authentication.md
requirements.md
faq.md
config_cheatsheet.md
```
@@ -27,17 +27,18 @@ The way you run the Auto Archiver depends on how you installed it (docker instal
If you installed Auto Archiver using docker, open up your terminal, and copy-paste / type the following command:
```bash
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
```
breaking this command down:
1. `docker run` tells docker to start a new container (an instance of the image)
2. `--rm` makes sure this container is removed after execution (less garbage locally)
3. `-v $PWD/secrets:/app/secrets` - your secrets folder with settings
2. `-it` tells docker to run in 'interactive mode' so that we get nice colour logs
3. `--rm` makes sure this container is removed after execution (less garbage locally)
4. `-v $PWD/secrets:/app/secrets` - your secrets folder with settings
1. `-v` is a volume flag which means a folder that you have on your computer will be connected to a folder inside the docker container
2. `$PWD/secrets` points to a `secrets/` folder in your current working directory (where your console points to), we use this folder as a best practice to hold all the secrets/tokens/passwords/... you use
3. `/app/secrets` points to the path the docker container where this image can be found
4. `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
5. `-v $PWD/local_archive:/app/local_archive` - (optional) if you use local_storage
1. `-v` same as above, this is a volume instruction
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
@@ -48,14 +49,14 @@ The invocations below will run the auto-archiver Docker image using a configurat
```bash
# Have auto-archiver run with the default settings, generating a settings file in ./secrets/orchestration.yaml
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver
# uses the same configuration, but with the `gsheet_feeder`, a header on row 2 and with some different column names
# Note this expects you to have followed the [Google Sheets setup](how_to/google_sheets.md) and added your service_account.json to the `secrets/` folder
# notice that columns is a dictionary so you need to pass it as JSON and it will override only the values provided
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --feeders=gsheet_feeder --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --feeders=gsheet_feeder --gsheet_feeder.sheet="use it on another sheets doc" --gsheet_feeder.header=2 --gsheet_feeder.columns='{"url": "link"}'
# Runs auto-archiver for the first time, but in 'full' mode, enabling all modules to get a full settings file
docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --mode full
docker run -it --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver --mode full
```
------------

View File

@@ -0,0 +1,30 @@
# Upgrading
If an update is available, then you will see a message in the logs when you
run Auto Archiver. Here's what those logs look like:
```{code} bash
********* IMPORTANT: UPDATE AVAILABLE ********
A new version of auto-archiver is available (v0.13.6, you have 0.13.4)
Make sure to update to the latest version using: `pip install --upgrade auto-archiver`
```
Upgrading Auto Archiver depends on the way you installed it.
## Docker
To upgrade using docker, update the docker image with:
```
docker pull bellingcat/auto-archiver:latest
```
## Pip
To upgrade the pip package, use:
```
pip install --upgrade auto-archiver
```

View File

@@ -59,4 +59,5 @@ output_schema = {
current_file_dir = os.path.dirname(os.path.abspath(__file__))
output_file = os.path.join(current_file_dir, "settings/src/schema.json")
with open(output_file, "w") as file:
print(f"Writing schema to {output_file}")
json.dump(output_schema, file, indent=4, cls=SchemaEncoder)

View File

@@ -12,7 +12,7 @@
"@dnd-kit/sortable": "^10.0.0",
"@emotion/react": "latest",
"@emotion/styled": "latest",
"@mui/icons-material": "latest",
"@mui/icons-material": "^6.4.7",
"@mui/material": "latest",
"react": "19.0.0",
"react-dom": "19.0.0",
@@ -997,9 +997,9 @@
}
},
"node_modules/@mui/core-downloads-tracker": {
"version": "6.4.6",
"resolved": "https://registry.npmjs.org/@mui/core-downloads-tracker/-/core-downloads-tracker-6.4.6.tgz",
"integrity": "sha512-rho5Q4IscbrVmK9rCrLTJmjLjfH6m/NcqKr/mchvck0EIXlyYUB9+Z0oVmkt/+Mben43LMRYBH8q/Uzxj/c4Vw==",
"version": "6.4.7",
"resolved": "https://registry.npmjs.org/@mui/core-downloads-tracker/-/core-downloads-tracker-6.4.7.tgz",
"integrity": "sha512-XjJrKFNt9zAKvcnoIIBquXyFyhfrHYuttqMsoDS7lM7VwufYG4fAPw4kINjBFg++fqXM2BNAuWR9J7XVIuKIKg==",
"license": "MIT",
"funding": {
"type": "opencollective",
@@ -1007,9 +1007,9 @@
}
},
"node_modules/@mui/icons-material": {
"version": "6.4.6",
"resolved": "https://registry.npmjs.org/@mui/icons-material/-/icons-material-6.4.6.tgz",
"integrity": "sha512-rGJBvIQQbQAlyKYljHQ8wAQS/K2/uYwvemcpygnAmCizmCI4zSF9HQPuiG8Ql4YLZ6V/uKjA3WHIYmF/8sV+pQ==",
"version": "6.4.7",
"resolved": "https://registry.npmjs.org/@mui/icons-material/-/icons-material-6.4.7.tgz",
"integrity": "sha512-Rk8cs9ufQoLBw582Rdqq7fnSXXZTqhYRbpe1Y5SAz9lJKZP3CIdrj0PfG8HJLGw1hrsHFN/rkkm70IDzhJsG1g==",
"license": "MIT",
"dependencies": {
"@babel/runtime": "^7.26.0"
@@ -1022,7 +1022,7 @@
"url": "https://opencollective.com/mui-org"
},
"peerDependencies": {
"@mui/material": "^6.4.6",
"@mui/material": "^6.4.7",
"@types/react": "^17.0.0 || ^18.0.0 || ^19.0.0",
"react": "^17.0.0 || ^18.0.0 || ^19.0.0"
},
@@ -1033,14 +1033,14 @@
}
},
"node_modules/@mui/material": {
"version": "6.4.6",
"resolved": "https://registry.npmjs.org/@mui/material/-/material-6.4.6.tgz",
"integrity": "sha512-6UyAju+DBOdMogfYmLiT3Nu7RgliorimNBny1pN/acOjc+THNFVE7hlxLyn3RDONoZJNDi/8vO4AQQr6dLAXqA==",
"version": "6.4.7",
"resolved": "https://registry.npmjs.org/@mui/material/-/material-6.4.7.tgz",
"integrity": "sha512-K65StXUeGAtFJ4ikvHKtmDCO5Ab7g0FZUu2J5VpoKD+O6Y3CjLYzRi+TMlI3kaL4CL158+FccMoOd/eaddmeRQ==",
"license": "MIT",
"dependencies": {
"@babel/runtime": "^7.26.0",
"@mui/core-downloads-tracker": "^6.4.6",
"@mui/system": "^6.4.6",
"@mui/core-downloads-tracker": "^6.4.7",
"@mui/system": "^6.4.7",
"@mui/types": "^7.2.21",
"@mui/utils": "^6.4.6",
"@popperjs/core": "^2.11.8",
@@ -1061,7 +1061,7 @@
"peerDependencies": {
"@emotion/react": "^11.5.0",
"@emotion/styled": "^11.3.0",
"@mui/material-pigment-css": "^6.4.6",
"@mui/material-pigment-css": "^6.4.7",
"@types/react": "^17.0.0 || ^18.0.0 || ^19.0.0",
"react": "^17.0.0 || ^18.0.0 || ^19.0.0",
"react-dom": "^17.0.0 || ^18.0.0 || ^19.0.0"
@@ -1143,9 +1143,9 @@
}
},
"node_modules/@mui/system": {
"version": "6.4.6",
"resolved": "https://registry.npmjs.org/@mui/system/-/system-6.4.6.tgz",
"integrity": "sha512-FQjWwPec7pMTtB/jw5f9eyLynKFZ6/Ej9vhm5kGdtmts1z5b7Vyn3Rz6kasfYm1j2TfrfGnSXRvvtwVWxjpz6g==",
"version": "6.4.7",
"resolved": "https://registry.npmjs.org/@mui/system/-/system-6.4.7.tgz",
"integrity": "sha512-7wwc4++Ak6tGIooEVA9AY7FhH2p9fvBMORT4vNLMAysH3Yus/9B9RYMbrn3ANgsOyvT3Z7nE+SP8/+3FimQmcg==",
"license": "MIT",
"dependencies": {
"@babel/runtime": "^7.26.0",

View File

@@ -13,7 +13,7 @@
"@dnd-kit/sortable": "^10.0.0",
"@emotion/react": "latest",
"@emotion/styled": "latest",
"@mui/icons-material": "latest",
"@mui/icons-material": "^6.4.7",
"@mui/material": "latest",
"react": "19.0.0",
"react-dom": "19.0.0",

View File

@@ -4,7 +4,7 @@ import Container from '@mui/material/Container';
import Typography from '@mui/material/Typography';
import Box from '@mui/material/Box';
import FileUploadIcon from '@mui/icons-material/FileUpload';
//
import {
DndContext,
closestCenter,
@@ -204,7 +204,7 @@ function ModuleTypes({ stepType, setEnabledModules, enabledModules, configValues
{stepType}
</Typography>
<Typography variant="body1" >
Select the <a href="<a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`}" target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
Select the <a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`} target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
</Typography>
</Box>
{showError ? <Typography variant="body1" color="error" >Only one {stepType.slice(0,-1)} can be enabled at a time.</Typography> : null}

File diff suppressed because it is too large Load Diff

View File

@@ -6,7 +6,7 @@ import { viteSingleFile } from "vite-plugin-singlefile"
export default defineConfig({
plugins: [react(), viteSingleFile()],
build: {
minify: false,
sourcemap: true,
// minify: false,
// sourcemap: true,
}
});

View File

@@ -8,6 +8,7 @@ flexible setup in various environments.
import argparse
from ruamel.yaml import YAML, CommentedMap
import json
import os
from loguru import logger
@@ -230,6 +231,10 @@ def read_yaml(yaml_filename: str) -> CommentedMap:
def store_yaml(config: CommentedMap, yaml_filename: str) -> None:
config_to_save = deepcopy(config)
## if the save path is the default location (secrets) then create the 'secrets' folder
if os.path.dirname(yaml_filename) == "secrets":
os.makedirs("secrets", exist_ok=True)
auth_dict = config_to_save.get("authentication", {})
if auth_dict and auth_dict.get("load_from_file"):
# remove all other values from the config, don't want to store it in the config file

View File

@@ -112,7 +112,7 @@ class ArchivingOrchestrator:
def check_steps(self, config):
for module_type in MODULE_TYPES:
if not config["steps"].get(f"{module_type}s", []):
if module_type == "feeder" or module_type == "formatter" and config["steps"].get(f"{module_type}"):
if (module_type == "feeder" or module_type == "formatter") and config["steps"].get(f"{module_type}"):
raise SetupError(
f"It appears you have '{module_type}' set under 'steps' in your configuration file, but as of version 0.13.0 of Auto Archiver, you must use '{module_type}s'. Change this in your configuration file and try again. \
Here's how that would look: \n\nsteps:\n {module_type}s:\n - [your_{module_type}_name_here]\n {'extractors:...' if module_type == 'feeder' else '...'}\n"
@@ -377,7 +377,8 @@ Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_
try:
loaded_module: BaseModule = self.module_factory.get_module(module, self.config)
except (KeyboardInterrupt, Exception) as e:
logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
if not isinstance(e, KeyboardInterrupt) and not isinstance(e, SetupError):
logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
if loaded_module and module_type == "extractor":
loaded_module.cleanup()
raise e

View File

@@ -2,13 +2,14 @@ from loguru import logger
from auto_archiver.core.feeder import Feeder
from auto_archiver.core.metadata import Metadata
from auto_archiver.core.consts import SetupError
class CLIFeeder(Feeder):
def setup(self) -> None:
self.urls = self.config["urls"]
if not self.urls:
raise ValueError(
raise SetupError(
"No URLs provided. Please provide at least one URL via the command line, or set up an alternative feeder. Use --help for more information."
)

View File

@@ -15,6 +15,9 @@ supported by `yt-dlp`, such as YouTube, Facebook, and others. It provides functi
for retrieving videos, subtitles, comments, and other metadata, and it integrates with
the broader archiving framework.
For a full list of video platforms supported by `yt-dlp`, see the
[official documentation](https://github.com/yt-dlp/yt-dlp/blob/master/supportedsites.md)
### Features
- Supports downloading videos and playlists.
- Retrieves metadata like titles, descriptions, upload dates, and durations.

View File

@@ -49,7 +49,7 @@ class CookieSettingDriver(webdriver.Firefox):
self.driver.add_cookie({"name": name, "value": value})
elif self.cookiejar:
domain = urlparse(url).netloc
regex = re.compile(f"(www)?\.?{domain}$")
regex = re.compile(f"(www)?.?{domain}$")
for cookie in self.cookiejar:
if regex.match(cookie.domain):
try: