mirror of
https://github.com/bellingcat/auto-archiver.git
synced 2026-06-12 13:18:28 +03:00
58 lines
3.5 KiB
Markdown
58 lines
3.5 KiB
Markdown
# auto-archiver
|
|
|
|
This Python script will look for links to Youtube, Twitter, etc,. in a specified column of a Google Sheet, uses YoutubeDL to download the media, stores the result in a Digital Ocean space, and updates the Google Sheet with the archive location, status, and date. It can be run manually or on an automated basis.
|
|
|
|
## Setup
|
|
|
|
If you are using `pipenv` (recommended), `pipenv install` is sufficient to install Python prerequisites.
|
|
|
|
[A Google Service account is necessary for use with `gspread`.](https://gspread.readthedocs.io/en/latest/oauth2.html#for-bots-using-service-account) Credentials for this account should be stored in `~/.config/gspread/service_account.json`.
|
|
|
|
A `.env` file is required for saving content to a Digital Ocean space. This file should contain the following variables:
|
|
|
|
```
|
|
DO_SPACES_REGION=
|
|
DO_BUCKET=
|
|
DO_SPACES_KEY=
|
|
DO_SPACES_SECRET=
|
|
```
|
|
|
|
## Running
|
|
|
|
There is just one necessary command line flag, `--sheet name` which the name of the Google Sheet to check for URLs. This sheet must have been shared with the Google Service account used by `gspread`. This sheet must also have specific columns in the first row:
|
|
* `Media URL` (required): the location of the media to be archived. This is the only column that should be supplied with data initially
|
|
* `Archive status` (required): the status of the auto archiver script. Any row with text in this column will be skipped automatically.
|
|
* `Archive location` (required): the location of the archived version. For files that were not able to be auto archived, this can be manually updated.
|
|
* `Archive date`: the date that the auto archiver script ran for this file
|
|
* `Upload timestamp`: the timestamp extracted from the video. (For YouTube, this unfortunately does not currently include the time)
|
|
* `Duration`: the duration of the video
|
|
* `Upload title`: the "title" of the video from the original source
|
|
* `Thumbnail`: an image thumbnail of the video (resize row height to make this more visible)
|
|
* `Thumbnail index`: a link to a page that shows many thumbnails for the video, useful for quickly seeing video content
|
|
|
|
For example, for use with this spreadsheet:
|
|
|
|

|
|
|
|
```pipenv run python auto-archive.py --sheet archiver-test```
|
|
|
|
When the auto archiver starts running, it updates the "Archive status" column.
|
|
|
|

|
|
|
|
The links are downloaded and archived, and the spreadsheet is updated to the following:
|
|
|
|

|
|
|
|
Live streaming content is recorded in a separate thread.
|
|
|
|
Note that the first row is skipped, as it is assumed to be a header row. Rows with an empty URL column, or a non-empty archive column are also skipped. All sheets in the document will be checked.
|
|
|
|
## Automating
|
|
|
|
The auto-archiver can be run automatically via cron. An example crontab entry that runs the archiver every minute is as follows.
|
|
|
|
```* * * * * python auto-archive.py --sheet archiver-test```
|
|
|
|
With this configuration, the archiver should archive and store all media added to the Google Sheet every 60 seconds. Of course, additional logging information, etc. might be required.
|