Files
auto-archiver/docs/source/_auto/configs.rst
2025-01-21 09:48:46 +00:00

743 lines
23 KiB
ReStructuredText

Configs
-------
This section documents all configuration options available for various components.
InstagramAPIArchiver
--------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - access_token
- None
- a valid instagrapi-api token
* - api_endpoint
- None
- API endpoint to use
* - full_profile
- False
- if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information.
* - full_profile_max_posts
- 0
- Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights
* - minimize_json_output
- True
- if true, will remove empty values from the json output
InstagramArchiver
-----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - username
- None
- a valid Instagram username
* - password
- None
- the corresponding Instagram account password
* - download_folder
- instaloader
- name of a folder to temporarily download content to
* - session_file
- secrets/instaloader.session
- path to the instagram session which saves session credentials
InstagramTbotArchiver
---------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_id
- None
- telegram API_ID value, go to https://my.telegram.org/apps
* - api_hash
- None
- telegram API_HASH value, go to https://my.telegram.org/apps
* - session_file
- secrets/anon-insta
- optional, records the telegram login session for future usage, '.session' will be appended to the provided value.
* - timeout
- 45
- timeout to fetch the instagram content in seconds.
TelethonArchiver
----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_id
- None
- telegram API_ID value, go to https://my.telegram.org/apps
* - api_hash
- None
- telegram API_HASH value, go to https://my.telegram.org/apps
* - bot_token
- None
- optional, but allows access to more content such as large videos, talk to @botfather
* - session_file
- secrets/anon
- optional, records the telegram login session for future usage, '.session' will be appended to the provided value.
* - join_channels
- True
- disables the initial setup with channel_invites config, useful if you have a lot and get stuck
* - channel_invites
- {}
- (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup
TwitterApiArchiver
------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - bearer_token
- None
- [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret
* - bearer_tokens
- []
- a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line
* - consumer_key
- None
- twitter API consumer_key
* - consumer_secret
- None
- twitter API consumer_secret
* - access_token
- None
- twitter API access_token
* - access_secret
- None
- twitter API access_secret
VkArchiver
----------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - username
- None
- valid VKontakte username
* - password
- None
- valid VKontakte password
* - session_file
- secrets/vk_config.v2.json
- valid VKontakte password
YoutubeDLArchiver
-----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - facebook_cookie
- None
- optional facebook cookie to have more access to content, from browser, looks like 'cookie: datr= xxxx'
* - subtitles
- True
- download subtitles if available
* - comments
- False
- download all comments if available, may lead to large metadata
* - livestreams
- False
- if set, will download live streams, otherwise will skip them; see --max-filesize for more control
* - live_from_start
- False
- if set, will download live streams from their earliest available moment, otherwise starts now.
* - proxy
-
- http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy- user:password@proxy-ip:port
* - end_means_success
- True
- if True, any archived content will mean a 'success', if False this archiver will not return a 'success' stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve.
* - allow_playlist
- False
- If True will also download playlists, set to False if the expectation is to download a single video.
* - max_downloads
- inf
- Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit.
* - cookies_from_browser
- None
- optional browser for ytdl to extract cookies from, can be one of: brave, chrome, chromium, edge, firefox, opera, safari, vivaldi, whale
* - cookie_file
- None
- optional cookie file to use for Youtube, see instructions here on how to export from your browser: https://github.com/yt-dlp/yt- dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp
AAApiDb
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_endpoint
- None
- API endpoint where calls are made to
* - api_token
- None
- API Bearer token.
* - public
- False
- whether the URL should be publicly available via the API
* - author_id
- None
- which email to assign as author
* - group_id
- None
- which group of users have access to the archive in case public=false as author
* - allow_rearchive
- True
- if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived
* - store_results
- True
- when set, will send the results to the API database.
* - tags
- []
- what tags to add to the archived URL
AtlosDb
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_token
- None
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
* - atlos_url
- https://platform.atlos.org
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
CSVDb
-----
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - csv_file
- db.csv
- CSV file name
HashEnricher
------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - algorithm
- SHA-256
- hash algorithm to use
* - chunksize
- 16000000
- number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB
ScreenshotEnricher
------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - width
- 1280
- width of the screenshots
* - height
- 720
- height of the screenshots
* - timeout
- 60
- timeout for taking the screenshot
* - sleep_before_screenshot
- 4
- seconds to wait for the pages to load before taking screenshot
* - http_proxy
-
- http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port
* - save_to_pdf
- False
- save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter
* - print_options
- {}
- options to pass to the pdf printer
SSLEnricher
-----------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - skip_when_nothing_archived
- True
- if true, will skip enriching when no media is archived
ThumbnailEnricher
-----------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - thumbnails_per_minute
- 60
- how many thumbnails to generate per minute of video, can be limited by max_thumbnails
* - max_thumbnails
- 16
- limit the number of thumbnails to generate per video, 0 means no limit
TimestampingEnricher
--------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - tsa_urls
- ['http://timestamp.digicert.com', 'http://timestamp.identrust.com', 'http://timestamp.globalsign.com/tsa/r6advanced1', 'http://tss.accv.es:8318/tsa']
- List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line.
WaczArchiverEnricher
--------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - profile
- None
- browsertrix-profile (for profile generation see https://github.com/webrecorder/browsertrix- crawler#creating-and-using-browser-profiles).
* - docker_commands
- None
- if a custom docker invocation is needed
* - timeout
- 120
- timeout for WACZ generation in seconds
* - extract_media
- False
- If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.
* - extract_screenshot
- True
- If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched.
* - socks_proxy_host
- None
- SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host
* - socks_proxy_port
- None
- SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234
* - proxy_server
- None
- SOCKS server proxy URL, in development
WaybackArchiverEnricher
-----------------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - timeout
- 15
- seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.
* - if_not_archived_within
- None
- only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1N sv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA
* - key
- None
- wayback API key. to get credentials visit https://archive.org/account/s3.php
* - secret
- None
- wayback API secret. to get credentials visit https://archive.org/account/s3.php
* - proxy_http
- None
- http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port
* - proxy_https
- None
- https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port
WhisperEnricher
---------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_endpoint
- None
- WhisperApi api endpoint, eg: https://whisperbox- api.com/api/v1, a deployment of https://github.com/bellingcat/whisperbox- transcribe.
* - api_key
- None
- WhisperApi api key for authentication
* - include_srt
- False
- Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players).
* - timeout
- 90
- How many seconds to wait at most for a successful job completion.
* - action
- translate
- which Whisper operation to execute
AtlosFeeder
-----------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - api_token
- None
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
* - atlos_url
- https://platform.atlos.org
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
CLIFeeder
---------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - urls
- None
- URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml
GsheetsFeeder
-------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - sheet
- None
- name of the sheet to archive
* - sheet_id
- None
- (alternative to sheet name) the id of the sheet to archive
* - header
- 1
- index of the header row (starts at 1)
* - service_account
- secrets/service_account.json
- service account JSON file path
* - columns
- {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'}
- names of columns in the google sheet (stringified JSON object)
* - allow_worksheets
- set()
- (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed
* - block_worksheets
- set()
- (CSV) explicitly block some worksheets from being processed
* - use_sheet_names_in_stored_paths
- True
- if True the stored files path will include 'workbook_name/worksheet_name/...'
HtmlFormatter
-------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - detect_thumbnails
- True
- if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'
AtlosStorage
------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - api_token
- None
- An Atlos API token. For more information, see https://docs.atlos.org/technical/api/
* - atlos_url
- https://platform.atlos.org
- The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash.
GDriveStorage
-------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - root_folder_id
- None
- root google drive folder ID to use as storage, found in URL: 'https://drive.google.com/drive/folders/FOLDER_ID'
* - oauth_token
- None
- JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account.
* - service_account
- secrets/service_account.json
- service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account.
LocalStorage
------------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - save_to
- ./archived
- folder where to save archived content
* - save_absolute
- False
- whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)
S3Storage
---------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
* - bucket
- None
- S3 bucket name
* - region
- None
- S3 region name
* - key
- None
- S3 API key
* - secret
- None
- S3 API secret
* - random_no_duplicate
- False
- if set, it will override `path_generator`, `filename_generator` and `folder`. It will check if the file already exists and if so it will not upload it again. Creates a new root folder path `no-dups/`
* - endpoint_url
- https://{region}.digitaloceanspaces.com
- S3 bucket endpoint, {region} are inserted at runtime
* - cdn_url
- https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key}
- S3 CDN url, {bucket}, {region} and {key} are inserted at runtime
* - private
- False
- if true S3 files will not be readable online
Storage
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - path_generator
- url
- how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory.
* - filename_generator
- random
- how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash.
Gsheets
-------
The following table lists all configuration options for this component:
.. list-table:: Configuration Options
:header-rows: 1
:widths: 25 20 55
* - **Key**
- **Default**
- **Description**
* - sheet
- None
- name of the sheet to archive
* - sheet_id
- None
- (alternative to sheet name) the id of the sheet to archive
* - header
- 1
- index of the header row (starts at 1)
* - service_account
- secrets/service_account.json
- service account JSON file path
* - columns
- {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'}
- names of columns in the google sheet (stringified JSON object)