Configs ------- This section documents all configuration options available for various components. InstagramAPIArchiver -------------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - access_token - None - a valid instagrapi-api token * - api_endpoint - None - API endpoint to use * - full_profile - False - if true, will download all posts, tagged posts, stories, and highlights for a profile, if false, will only download the profile pic and information. * - full_profile_max_posts - 0 - Use to limit the number of posts to download when full_profile is true. 0 means no limit. limit is applied softly since posts are fetched in batch, once to: posts, tagged posts, and highlights * - minimize_json_output - True - if true, will remove empty values from the json output InstagramArchiver ----------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - username - None - a valid Instagram username * - password - None - the corresponding Instagram account password * - download_folder - instaloader - name of a folder to temporarily download content to * - session_file - secrets/instaloader.session - path to the instagram session which saves session credentials InstagramTbotArchiver --------------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - api_id - None - telegram API_ID value, go to https://my.telegram.org/apps * - api_hash - None - telegram API_HASH value, go to https://my.telegram.org/apps * - session_file - secrets/anon-insta - optional, records the telegram login session for future usage, '.session' will be appended to the provided value. * - timeout - 45 - timeout to fetch the instagram content in seconds. TelethonArchiver ---------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - api_id - None - telegram API_ID value, go to https://my.telegram.org/apps * - api_hash - None - telegram API_HASH value, go to https://my.telegram.org/apps * - bot_token - None - optional, but allows access to more content such as large videos, talk to @botfather * - session_file - secrets/anon - optional, records the telegram login session for future usage, '.session' will be appended to the provided value. * - join_channels - True - disables the initial setup with channel_invites config, useful if you have a lot and get stuck * - channel_invites - {} - (JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup TwitterApiArchiver ------------------ The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - bearer_token - None - [deprecated: see bearer_tokens] twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret * - bearer_tokens - [] - a list of twitter API bearer_token which is enough for archiving, if not provided you will need consumer_key, consumer_secret, access_token, access_secret, if provided you can still add those for better rate limits. CSV of bearer tokens if provided via the command line * - consumer_key - None - twitter API consumer_key * - consumer_secret - None - twitter API consumer_secret * - access_token - None - twitter API access_token * - access_secret - None - twitter API access_secret VkArchiver ---------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - username - None - valid VKontakte username * - password - None - valid VKontakte password * - session_file - secrets/vk_config.v2.json - valid VKontakte password YoutubeDLArchiver ----------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - facebook_cookie - None - optional facebook cookie to have more access to content, from browser, looks like 'cookie: datr= xxxx' * - subtitles - True - download subtitles if available * - comments - False - download all comments if available, may lead to large metadata * - livestreams - False - if set, will download live streams, otherwise will skip them; see --max-filesize for more control * - live_from_start - False - if set, will download live streams from their earliest available moment, otherwise starts now. * - proxy - - http/socks (https seems to not work atm) proxy to use for the webdriver, eg https://proxy- user:password@proxy-ip:port * - end_means_success - True - if True, any archived content will mean a 'success', if False this archiver will not return a 'success' stage; this is useful for cases when the yt-dlp will archive a video but ignore other types of content like images or text only pages that the subsequent archivers can retrieve. * - allow_playlist - False - If True will also download playlists, set to False if the expectation is to download a single video. * - max_downloads - inf - Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit. * - cookies_from_browser - None - optional browser for ytdl to extract cookies from, can be one of: brave, chrome, chromium, edge, firefox, opera, safari, vivaldi, whale * - cookie_file - None - optional cookie file to use for Youtube, see instructions here on how to export from your browser: https://github.com/yt-dlp/yt- dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp AAApiDb ------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - api_endpoint - None - API endpoint where calls are made to * - api_token - None - API Bearer token. * - public - False - whether the URL should be publicly available via the API * - author_id - None - which email to assign as author * - group_id - None - which group of users have access to the archive in case public=false as author * - allow_rearchive - True - if False then the API database will be queried prior to any archiving operations and stop if the link has already been archived * - store_results - True - when set, will send the results to the API database. * - tags - [] - what tags to add to the archived URL AtlosDb ------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - api_token - None - An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ * - atlos_url - https://platform.atlos.org - The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. CSVDb ----- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - csv_file - db.csv - CSV file name HashEnricher ------------ The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - algorithm - SHA-256 - hash algorithm to use * - chunksize - 16000000 - number of bytes to use when reading files in chunks (if this value is too large you will run out of RAM), default is 16MB ScreenshotEnricher ------------------ The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - width - 1280 - width of the screenshots * - height - 720 - height of the screenshots * - timeout - 60 - timeout for taking the screenshot * - sleep_before_screenshot - 4 - seconds to wait for the pages to load before taking screenshot * - http_proxy - - http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port * - save_to_pdf - False - save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter * - print_options - {} - options to pass to the pdf printer SSLEnricher ----------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - skip_when_nothing_archived - True - if true, will skip enriching when no media is archived ThumbnailEnricher ----------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - thumbnails_per_minute - 60 - how many thumbnails to generate per minute of video, can be limited by max_thumbnails * - max_thumbnails - 16 - limit the number of thumbnails to generate per video, 0 means no limit TimestampingEnricher -------------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - tsa_urls - ['http://timestamp.digicert.com', 'http://timestamp.identrust.com', 'http://timestamp.globalsign.com/tsa/r6advanced1', 'http://tss.accv.es:8318/tsa'] - List of RFC3161 Time Stamp Authorities to use, separate with commas if passed via the command line. WaczArchiverEnricher -------------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - profile - None - browsertrix-profile (for profile generation see https://github.com/webrecorder/browsertrix- crawler#creating-and-using-browser-profiles). * - docker_commands - None - if a custom docker invocation is needed * - timeout - 120 - timeout for WACZ generation in seconds * - extract_media - False - If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. * - extract_screenshot - True - If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched. * - socks_proxy_host - None - SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host * - socks_proxy_port - None - SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234 * - proxy_server - None - SOCKS server proxy URL, in development WaybackArchiverEnricher ----------------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - timeout - 15 - seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually. * - if_not_archived_within - None - only tell wayback to archive if no archive is available before the number of seconds specified, use None to ignore this option. For more information: https://docs.google.com/document/d/1N sv52MvSjbLb2PCpHlat0gkzw0EvtSgpKHu4mk0MnrA * - key - None - wayback API key. to get credentials visit https://archive.org/account/s3.php * - secret - None - wayback API secret. to get credentials visit https://archive.org/account/s3.php * - proxy_http - None - http proxy to use for wayback requests, eg http://proxy-user:password@proxy-ip:port * - proxy_https - None - https proxy to use for wayback requests, eg https://proxy-user:password@proxy-ip:port WhisperEnricher --------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - api_endpoint - None - WhisperApi api endpoint, eg: https://whisperbox- api.com/api/v1, a deployment of https://github.com/bellingcat/whisperbox- transcribe. * - api_key - None - WhisperApi api key for authentication * - include_srt - False - Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players). * - timeout - 90 - How many seconds to wait at most for a successful job completion. * - action - translate - which Whisper operation to execute AtlosFeeder ----------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - api_token - None - An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ * - atlos_url - https://platform.atlos.org - The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. CLIFeeder --------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - urls - None - URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml GsheetsFeeder ------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - sheet - None - name of the sheet to archive * - sheet_id - None - (alternative to sheet name) the id of the sheet to archive * - header - 1 - index of the header row (starts at 1) * - service_account - secrets/service_account.json - service account JSON file path * - columns - {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'} - names of columns in the google sheet (stringified JSON object) * - allow_worksheets - set() - (CSV) only worksheets whose name is included in allow are included (overrides worksheet_block), leave empty so all are allowed * - block_worksheets - set() - (CSV) explicitly block some worksheets from being processed * - use_sheet_names_in_stored_paths - True - if True the stored files path will include 'workbook_name/worksheet_name/...' HtmlFormatter ------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - detect_thumbnails - True - if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00' AtlosStorage ------------ The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - path_generator - url - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. * - filename_generator - random - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. * - api_token - None - An Atlos API token. For more information, see https://docs.atlos.org/technical/api/ * - atlos_url - https://platform.atlos.org - The URL of your Atlos instance (e.g., https://platform.atlos.org), without a trailing slash. GDriveStorage ------------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - path_generator - url - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. * - filename_generator - random - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. * - root_folder_id - None - root google drive folder ID to use as storage, found in URL: 'https://drive.google.com/drive/folders/FOLDER_ID' * - oauth_token - None - JSON filename with Google Drive OAuth token: check auto-archiver repository scripts folder for create_update_gdrive_oauth_token.py. NOTE: storage used will count towards owner of GDrive folder, therefore it is best to use oauth_token_filename over service_account. * - service_account - secrets/service_account.json - service account JSON file path, same as used for Google Sheets. NOTE: storage used will count towards the developer account. LocalStorage ------------ The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - path_generator - url - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. * - filename_generator - random - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. * - save_to - ./archived - folder where to save archived content * - save_absolute - False - whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure) S3Storage --------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - path_generator - url - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. * - filename_generator - random - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. * - bucket - None - S3 bucket name * - region - None - S3 region name * - key - None - S3 API key * - secret - None - S3 API secret * - random_no_duplicate - False - if set, it will override `path_generator`, `filename_generator` and `folder`. It will check if the file already exists and if so it will not upload it again. Creates a new root folder path `no-dups/` * - endpoint_url - https://{region}.digitaloceanspaces.com - S3 bucket endpoint, {region} are inserted at runtime * - cdn_url - https://{bucket}.{region}.cdn.digitaloceanspaces.com/{key} - S3 CDN url, {bucket}, {region} and {key} are inserted at runtime * - private - False - if true S3 files will not be readable online Storage ------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - path_generator - url - how to store the file in terms of directory structure: 'flat' sets to root; 'url' creates a directory based on the provided URL; 'random' creates a random directory. * - filename_generator - random - how to name stored files: 'random' creates a random string; 'static' uses a replicable strategy such as a hash. Gsheets ------- The following table lists all configuration options for this component: .. list-table:: Configuration Options :header-rows: 1 :widths: 25 20 55 * - **Key** - **Default** - **Description** * - sheet - None - name of the sheet to archive * - sheet_id - None - (alternative to sheet name) the id of the sheet to archive * - header - 1 - index of the header row (starts at 1) * - service_account - secrets/service_account.json - service account JSON file path * - columns - {'url': 'link', 'status': 'archive status', 'folder': 'destination folder', 'archive': 'archive location', 'date': 'archive date', 'thumbnail': 'thumbnail', 'timestamp': 'upload timestamp', 'title': 'upload title', 'text': 'text content', 'screenshot': 'screenshot', 'hash': 'hash', 'pdq_hash': 'perceptual hashes', 'wacz': 'wacz', 'replaywebpage': 'replaywebpage'} - names of columns in the google sheet (stringified JSON object)