From a6aacfa3fbfb9722479383b992909fd03e091f37 Mon Sep 17 00:00:00 2001 From: erinhmclark Date: Thu, 16 Jan 2025 09:31:50 +0000 Subject: [PATCH] Add example pre-generated configs.rst --- docs/source/configs.rst | 741 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 741 insertions(+) create mode 100644 docs/source/configs.rst diff --git a/docs/source/configs.rst b/docs/source/configs.rst new file mode 100644 index 0000000..9f793e1 --- /dev/null +++ b/docs/source/configs.rst @@ -0,0 +1,741 @@ +Configs +======= + +This section documents all configuration options available for various components. + +InstagramAPIArchiver +-------------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - access_token + - None + - a valid instagrapi-api token + * - api_endpoint + - None + - API endpoint to use + * - full_profile + - False + - if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information. + * - full_profile_max_posts + - 0 + - Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights + * - minimize_json_output + - True + - if true, will remove empty values from the json output + +InstagramArchiver +----------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - username + - None + - a valid Instagram username + * - password + - None + - the corresponding Instagram account password + * - download_folder + - instaloader + - name of a folder to temporarily download content to + * - session_file + - secrets/instaloader.session + - path to the instagram session which saves session credentials + +InstagramTbotArchiver +--------------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - api_id + - None + - telegram API_ID value, go to https://my.telegram.org/apps + * - api_hash + - None + - telegram API_HASH value, go to https://my.telegram.org/apps + * - session_file + - secrets/anon-insta + - optional, records the telegram login session for future usage, '.session' will be appended to the provided value. + * - timeout + - 45 + - timeout to fetch the instagram content in seconds. + +TelethonArchiver +---------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - api_id + - None + - telegram API_ID value, go to https://my.telegram.org/apps + * - api_hash + - None + - telegram API_HASH value, go to https://my.telegram.org/apps + * - bot_token + - None + - optional, but allows access to more content such as large videos, talk to @botfather + * - session_file + - secrets/anon + - optional, records the telegram login session for future usage, '.session' will be appended to the provided value. + * - join_channels + - True + - disables the initial setup with channel_invites config, useful if you have a lot and get stuck + * - channel_invites + - {} + - (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup + +TwitterApiArchiver +------------------ + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - bearer_token + - None + - [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret + * - bearer_tokens + - [] + - a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line + * - consumer_key + - None + - twitter API consumer_key + * - consumer_secret + - None + - twitter API consumer_secret + * - access_token + - None + - twitter API access_token + * - access_secret + - None + - twitter API access_secret + +VkArchiver +---------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - username + - None + - valid VKontakte username + * - password + - None + - valid VKontakte password + * - session_file + - secrets/vk_config.v2.json + - valid VKontakte password + +YoutubeDLArchiver +----------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - facebook_cookie + - None + - optional facebook cookie to have more access to content, from browser, looks like 'cookie: datr= xxxx' + * - subtitles + - True + - download subtitles if available + * - comments + - False + - download all comments if available, may lead to large metadata + * - livestreams + - False + - if set, will download live streams, otherwise will skip them; see --max-filesize for more control + * - live_from_start + - False + - if set, will download live streams from their earliest available moment, otherwise starts now. + * - proxy + - + - http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy- user:password@proxy-ip:port + * - end_means_success + - True + - if True, any archived content will mean a 'success', if False this archiver will not return a 'success' stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve. + * - allow_playlist + - False + - If True will also download playlists, set to False if the expectation is to download a single video. + * - max_downloads + - inf + - Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit. + * - cookies_from_browser + - None + - optional browser for ytdl to extract cookies from, can be one of: brave, chrome, chromium, edge, firefox, opera, safari, vivaldi, whale + * - cookie_file + - None + - optional cookie file to use for Youtube, see instructions here on how to export from your browser: https://github.com/yt-dlp/yt- dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp + +AAApiDb +------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - api_endpoint + - None + - API endpoint where calls are made to + * - api_token + - None + - API Bearer token. + * - public + - False + - whether the URL should be publicly available via the API + * - author_id + - None + - which email to assign as author + * - group_id + - None + - which group of users have access to the archive in case public=false as author + * - allow_rearchive + - True + - if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived + * - store_results + - True + - when set, will send the results to the API database. + * - tags + - [] + - what tags to add to the archived URL + +AtlosDb +------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - api_token + - None + - An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ + * - atlos_url + - https://platform.atlos.org + - The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. + +CSVDb +----- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - csv_file + - db.csv + - CSV file name + +HashEnricher +------------ + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - algorithm + - SHA-256 + - hash algorithm to use + * - chunksize + - 16000000 + - number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB + +ScreenshotEnricher +------------------ + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - width + - 1280 + - width of the screenshots + * - height + - 720 + - height of the screenshots + * - timeout + - 60 + - timeout for taking the screenshot + * - sleep_before_screenshot + - 4 + - seconds to wait for the pages to load before taking screenshot + * - http_proxy + - + - http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port + * - save_to_pdf + - False + - save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter + * - print_options + - {} + - options to pass to the pdf printer + +SSLEnricher +----------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - skip_when_nothing_archived + - True + - if true, will skip enriching when no media is archived + +ThumbnailEnricher +----------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - thumbnails_per_minute + - 60 + - how many thumbnails to generate per minute of video, can be limited by max_thumbnails + * - max_thumbnails + - 16 + - limit the number of thumbnails to generate per video, 0 means no limit + +TimestampingEnricher +-------------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - tsa_urls + - ['http://timestamp.digicert.com', 'http://timestamp.identrust.com', 'http://timestamp.globalsign.com/tsa/r6advanced1', 'http://tss.accv.es:8318/tsa'] + - List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line. + +WaczArchiverEnricher +-------------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - profile + - None + - browsertrix-profile (for profile generation see https://github.com/webrecorder/browsertrix- crawler#creating-and-using-browser-profiles). + * - docker_commands + - None + - if a custom docker invocation is needed + * - timeout + - 120 + - timeout for WACZ generation in seconds + * - extract_media + - False + - If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. + * - extract_screenshot + - True + - If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. + * - socks_proxy_host + - None + - SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host + * - socks_proxy_port + - None + - SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234 + * - proxy_server + - None + - SOCKS server proxy URL, in development + +WaybackArchiverEnricher +----------------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - timeout + - 15 + - seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually. + * - if_not_archived_within + - None + - only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1N sv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA + * - key + - None + - wayback API key. to get credentials visit https://archive.org/account/s3.php + * - secret + - None + - wayback API secret. to get credentials visit https://archive.org/account/s3.php + * - proxy_http + - None + - http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port + * - proxy_https + - None + - https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port + +WhisperEnricher +--------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - api_endpoint + - None + - WhisperApi api endpoint, eg: https://whisperbox- api.com/api/v1, a deployment of https://github.com/bellingcat/whisperbox- transcribe. + * - api_key + - None + - WhisperApi api key for authentication + * - include_srt + - False + - Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players). + * - timeout + - 90 + - How many seconds to wait at most for a successful job completion. + * - action + - translate + - which Whisper operation to execute + +AtlosFeeder +----------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - api_token + - None + - An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ + * - atlos_url + - https://platform.atlos.org + - The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. + +CLIFeeder +--------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - urls + - None + - URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml + +GsheetsFeeder +------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - sheet + - None + - name of the sheet to archive + * - sheet_id + - None + - (alternative to sheet name) the id of the sheet to archive + * - header + - 1 + - index of the header row (starts at 1) + * - service_account + - secrets/service_account.json + - service account JSON file path + * - columns + - {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'} + - names of columns in the google sheet (stringified JSON object) + * - allow_worksheets + - set() + - (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed + * - block_worksheets + - set() + - (CSV) explicitly block some worksheets from being processed + * - use_sheet_names_in_stored_paths + - True + - if True the stored files path will include 'workbook_name/worksheet_name/...' + +HtmlFormatter +------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - detect_thumbnails + - True + - if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00' + +AtlosStorage +------------ + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - path_generator + - url + - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. + * - filename_generator + - random + - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. + * - api_token + - None + - An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ + * - atlos_url + - https://platform.atlos.org + - The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. + +GDriveStorage +------------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - path_generator + - url + - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. + * - filename_generator + - random + - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. + * - root_folder_id + - None + - root google drive folder ID to use as storage, found in URL: 'https://drive.google.com/drive/folders/FOLDER_ID' + * - oauth_token + - None + - JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account. + * - service_account + - secrets/service_account.json + - service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account. + +LocalStorage +------------ + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - path_generator + - url + - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. + * - filename_generator + - random + - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. + * - save_to + - ./archived + - folder where to save archived content + * - save_absolute + - False + - whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure) + +S3Storage +--------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - path_generator + - url + - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. + * - filename_generator + - random + - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. + * - bucket + - None + - S3 bucket name + * - region + - None + - S3 region name + * - key + - None + - S3 API key + * - secret + - None + - S3 API secret + * - random_no_duplicate + - False + - if set, it will override `path_generator`, `filename_generator` and `folder`. It will check if the file already exists and if so it will not upload it again. Creates a new root folder path `no-dups/` + * - endpoint_url + - https://{region}.digitaloceanspaces.com + - S3 bucket endpoint, {region} are inserted at runtime + * - cdn_url + - https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key} + - S3 CDN url, {bucket}, {region} and {key} are inserted at runtime + * - private + - False + - if true S3 files will not be readable online + +Storage +------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - path_generator + - url + - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. + * - filename_generator + - random + - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. + +Gsheets +------- + +The following table lists all configuration options for this component: + +.. list-table:: Configuration Options + :header-rows: 1 + :widths: 25 20 55 + + * - **Key** + - **Default** + - **Description** + * - sheet + - None + - name of the sheet to archive + * - sheet_id + - None + - (alternative to sheet name) the id of the sheet to archive + * - header + - 1 + - index of the header row (starts at 1) + * - service_account + - secrets/service_account.json + - service account JSON file path + * - columns + - {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'} + - names of columns in the google sheet (stringified JSON object) +