mirror of
https://github.com/bellingcat/auto-archiver.git
synced 2026-06-12 05:08:28 +03:00
WIP: Docs tidyups+add howto on logging and authentication
(Authentication is WIP)
This commit is contained in:
@@ -1,57 +0,0 @@
|
||||
# Authentication
|
||||
|
||||
The Authentication framework for auto-archiver allows you to add login details for various websites in a flexible way, directly from the configuration file.
|
||||
|
||||
There are two main use cases for authentication:
|
||||
* Some websites require some kind of authentication in order to view the content. Examples include Facebook, Telegram etc.
|
||||
* Some websites use anti-bot systems to block bot-like tools from accessig the website. Adding real login information to auto-archiver can sometimes bypass this.
|
||||
|
||||
## The Authentication Config
|
||||
|
||||
You can save your authentication information directly inside your orchestration config file, or as a separate file (for security/multi-deploy purposes). Whether storing your settings inside the orchestration file, or as a separate file, the configuration format is the same.
|
||||
|
||||
```{code} yaml
|
||||
authentication:
|
||||
# optional file to load authentication information from, for security or multi-system deploy purposes
|
||||
load_from_file: path/to/authentication/file.txt
|
||||
# optional setting to load cookies from the named browser on the system.
|
||||
cookies_from_browser: firefox
|
||||
# optional setting to load cookies from a cookies.txt/cookies.jar file. See note below on extracting these
|
||||
cookies_file: path/to/cookies.jar
|
||||
|
||||
twitter.com,x.com:
|
||||
username: myusername
|
||||
password: 123
|
||||
|
||||
facebook.com:
|
||||
cookie: single_cookie
|
||||
|
||||
othersite.com:
|
||||
api_key: 123
|
||||
api_secret: 1234
|
||||
|
||||
# All available options:
|
||||
# - username: str - the username to use for login
|
||||
# - password: str - the password to use for login
|
||||
# - api_key: str - the API key to use for login
|
||||
# - api_secret: str - the API secret to use for login
|
||||
# - cookie: str - a cookie string to use for login (specific to this site)
|
||||
```
|
||||
|
||||
### Recommendations for authentication
|
||||
|
||||
1. **Store authentication information separately:**
|
||||
The authentication part of your configuration contains sensitive information. You should make efforts not to share this with others. For extra security, use the `load_from_file` option to keep your authentication settings out of your configuration file, ideally in a different folder.
|
||||
|
||||
2. **Don't use your own personal credentials**
|
||||
Depending on the website you are extracting information from, there may be rules (Terms of Service) that prohibit you from scraping or extracting information using a bot. If you use your own personal account, there's a possibility it might get blocked/disabled. It's recommended to set up a separate, 'throwaway' account. In that way, if it gets blocked you can easily create another one to continue your archiving.
|
||||
|
||||
|
||||
### How to create a cookies.jar or pass cookies directly to auto-archiver
|
||||
|
||||
auto-archiver uses yt-dlp's powerful cookies features under the hood. For instructions on how to extract a cookies.jar (or cookies.txt) file directly from your browser, see the FAQ in the [yt-dlp documentation](https://github.com/yt-dlp/yt-dlp/wiki/FAQ#how-do-i-pass-cookies-to-yt-dlp)
|
||||
|
||||
```{note} For developers:
|
||||
|
||||
For information on how to access and use authentication settings from within your module, see the `{generic_extractor}` for an example, or view the [`auth_for_site()` function in BaseModule](../autoapi/core/base_module/index.rst)
|
||||
```
|
||||
6
docs/source/how_to/authentication_how_to.md
Normal file
6
docs/source/how_to/authentication_how_to.md
Normal file
@@ -0,0 +1,6 @@
|
||||
# How to login (authenticate) to websites
|
||||
|
||||
This how-to guide shows you how you can add authentication to Auto Archiver for a site you are trying to archive. In this example, we will authenticate on use Twitter/X.com using cookies, and on XXXX using username/password.
|
||||
|
||||
```{note} This page is still under construction 🚧
|
||||
```
|
||||
44
docs/source/how_to/gsheets_setup.md
Normal file
44
docs/source/how_to/gsheets_setup.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Using Google Sheets
|
||||
|
||||
The `--gsheet_feeder.sheet` property is the name of the Google Sheet to check for URLs.
|
||||
This sheet must have been shared with the Google Service account used by `gspread`.
|
||||
This sheet must also have specific columns (case-insensitive) in the `header` - see the [Gsheet Feeder Docs](modules/autogen/feeder/gsheet_feeder.md) for more info. The default names of these columns and their purpose is:
|
||||
|
||||
Inputs:
|
||||
|
||||
* **Link** *(required)*: the URL of the post to archive
|
||||
* **Destination folder**: custom folder for archived file (regardless of storage)
|
||||
|
||||
Outputs:
|
||||
* **Archive status** *(required)*: Status of archive operation
|
||||
* **Archive location**: URL of archived post
|
||||
* **Archive date**: Date archived
|
||||
* **Thumbnail**: Embeds a thumbnail for the post in the spreadsheet
|
||||
* **Timestamp**: Timestamp of original post
|
||||
* **Title**: Post title
|
||||
* **Text**: Post text
|
||||
* **Screenshot**: Link to screenshot of post
|
||||
* **Hash**: Hash of archived HTML file (which contains hashes of post media) - for checksums/verification
|
||||
* **Perceptual Hash**: Perceptual hashes of found images - these can be used for de-duplication of content
|
||||
* **WACZ**: Link to a WACZ web archive of post
|
||||
* **ReplayWebpage**: Link to a ReplayWebpage viewer of the WACZ archive
|
||||
|
||||
For example, this is a spreadsheet configured with all of the columns for the auto archiver and a few URLs to archive. (Note that the column names are not case sensitive.)
|
||||
|
||||

|
||||
|
||||
Now the auto archiver can be invoked, with this command in this example: `docker run --rm -v $PWD/secrets:/app/secrets -v $PWD/local_archive:/app/local_archive bellingcat/auto-archiver:dockerize --config secrets/orchestration-global.yaml --gsheet_feeder.sheet "Auto archive test 2023-2"`. Note that the sheet name has been overridden/specified in the command line invocation.
|
||||
|
||||
When the auto archiver starts running, it updates the "Archive status" column.
|
||||
|
||||

|
||||
|
||||
The links are downloaded and archived, and the spreadsheet is updated to the following:
|
||||
|
||||

|
||||
|
||||
Note that the first row is skipped, as it is assumed to be a header row (`--gsheet_feeder.header=1` and you can change it if you use more rows above). Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
|
||||
|
||||
The "archive location" link contains the path of the archived file, in local storage, S3, or in Google Drive.
|
||||
|
||||

|
||||
55
docs/source/how_to/logging.md
Normal file
55
docs/source/how_to/logging.md
Normal file
@@ -0,0 +1,55 @@
|
||||
# Logging
|
||||
|
||||
Auto Archiver's logs can be helpful for debugging problematic archiving processes. This guide shows you how to use the logs to
|
||||
|
||||
## Setting up logging
|
||||
|
||||
Logging settings can be set on the command line or using the orchestration config file ([learn more](../installation/configuration)). A special `logging` section defines the logging options.
|
||||
|
||||
#### Logging Level
|
||||
|
||||
There are 7 logging levels in total, with 4 commonly used levels. They are: `DEBUG`, `INFO`, `WARNING` and `ERROR`.
|
||||
|
||||
Change the warning level by setting the value in your orchestration config file:
|
||||
|
||||
```{code} yaml
|
||||
:caption: orchestration.yaml
|
||||
|
||||
...
|
||||
logging:
|
||||
level: DEBUG # or INFO / WARNING / ERROR
|
||||
...
|
||||
```
|
||||
|
||||
For normal usage, it is recommended to use the `INFO` level, or if you prefer quieter logs with less information, you can use the `WARNING` level. If you encounter issues with the archiving, then it's recommended to enable the `DEBUG` level.
|
||||
|
||||
```{note} To learn about all logging levels, see the [loguru documentation](https://loguru.readthedocs.io/en/stable/api/logger.html)
|
||||
```
|
||||
|
||||
### Logging to a file
|
||||
|
||||
As default, auto-archiver will log to the console. But if you wish to store your logs for future reference, or you are running the auto-archiver from within code a implementation, then you may with to enable file logging. This can be done by setting the `file:` config value in the logging settings.
|
||||
|
||||
**Rotation:** For file logging, you can choose to 'rotate' your log files (creating new log files) so they do not get too large. Change this by setting the 'rotation' option in your logging settings. For a full list of rotation options, see the [loguru docs](https://loguru.readthedocs.io/en/stable/overview.html#easier-file-logging-with-rotation-retention-compression).
|
||||
|
||||
```{code} yaml
|
||||
:caption: orchestration.yaml
|
||||
|
||||
logging:
|
||||
...
|
||||
file: /my/log/file.log
|
||||
rotation: 1 day
|
||||
```
|
||||
|
||||
### Full logging example
|
||||
|
||||
The below example logs only `WARNING` logs to the console and to the file `/my/file.log`, rotating that file once per week:
|
||||
|
||||
```{code} yaml
|
||||
:caption: orchestration.yaml
|
||||
|
||||
logging:
|
||||
level: WARNING
|
||||
file: /my/file.log
|
||||
rotation: 1 week
|
||||
```
|
||||
Reference in New Issue
Block a user