6.8 KiB
Auto Archiver API
An api that uses celery workers to process URL archive requests via bellingcat/auto-archiver, it allows authentication via Google OAuth Apps and enables CORS, everything runs on docker but development can be done without docker (except for redis).
User, Domains, Groups, and permissions management
there are 2 ways to access the API
- via an API token which has full control/privileges to archive/search
- via a Google Auth token which goes through the user access model
User access model
The permissions are defined solely via the user-groups.yaml file
- users belong to groups which determine their access level/quotas/orchestration setup
- users are assigned to groups explicitly (via email)
- users are assigned to groups implicitly (via email domains)
- domains are associated to groups
- users that are not explicitly or implicitly in the system belong to the
defaultgroup, restrict their permissions if you do not wish them to be able to search/archive - if a user is assigned to one group which is not explicitly defined, a warning will be thrown, it may be necessary to do that if you discontinue a given group but the database still has entries for it and so
- groups determine
- which orchestrator to use for single URL archives and for spreadsheet archives
- a set of permissions
readcan be [all], [] or a comma separated list of group names, meaning people in this group can access either all, none, or those belonging to explicitly listed groups.- the group itself must be included in the list, otherwise the user cannot search archives of that group
archive_urla boolean that enables the user to archive links in this grouparchive_sheeta boolean that enables the user to archive spreadsheetssheet_frequencya list of options for the sheet archiving frequency, currently max permissions is["hourly", "daily"]max_sheetsdefines the maximum amount of spreadsheets someone can have in total (-1means no limit)max_archive_lifespan_monthsdefines the lifespan of an archive before being deleted from S3, users will be notified 1 month in advance with instructions to download TODOmonthly_urlshow many total URLs someone can archive per month (-1means no limit)monthly_mbshow many MBs of data someone can archive per month (-1means no limit)priorityone ofhighorlow, this will be used to give archiving priority
- group names are all lower-case
To figure out:
- workshop participants should be able to test this.
public - how can people bring their own storage/api keys?
- how to implement lifespan of archives? 6 months lifespan example. they should expect a way to download all archives locally.
- how to deactivate unused sheets and notify?
- how to mark URLs for deletion, and then do a hard delete?
- what actions can people take:
- URL (P=needs permission, O=open)
- P archive
- P search
- O find own links
- DISABLED find by id
- P delete archive (soft)
- Sheets
- P create a new sheet
- O get my sheets
- O delete a sheet
- P archive a sheet now
- URL (P=needs permission, O=open)
Development
TODO: update .env file instructions, should use .env.prod and .env.dev and only use .env for always overwriting dev/prod settings.
requires src/.env
cd /src
- console 1 -
docker compose up redisoptionally addwebif not running uvicorn locally - console 2 -
pipenv shell+celery worker --app=worker.celery --loglevel=info --logfile=logs/celery_dev.logcelery --app=worker.celery worker --loglevel=info --logfile=logs/celery_dev.logcelery 5- or with watchdog for dev auto-reload
watchmedo auto-restart -d ./ -- celery --app=worker.celery worker --loglevel=info --logfile=logs/celery_dev.log
- console 3 -
pipenv shell+uvicorn main:app --host 0.0.0.0 --reloadorchestration must be from the console(?) - turn off VPNs if connection to docker is not working
User management
TODO: update description and example
- users/domains/groups
Copy example.user-groups.yaml into a new file and set the environment variable
USER_GROUPS_FILENAMEto that filename (defaults touser-groups.yaml).
This file contains 2 parts user-groups specifications. Each user can archive URLs publicly, privately, or privately for a group so long as they are declared as part of that group. In the example bellow email1 has 2 groups while email3 has none.
users:
email1@example.com:
- group1
- group2
email2@example.com:
- group2
email3@example-no-group.com:
Auto-archiver orchestrator files configurations. For each archiving task an orchestrator is chosen, either from a specified group (if group-level visibility) or the first group the user is assigned to in the above file or the default orchestration file which is a required config.
orchestrators:
group1: secrets/orchestration-group1.yaml
group2: secrets/orchestration-group2.yaml
default: secrets/orchestration-default:orchestration.yaml
Database migrations
check https://alembic.sqlalchemy.org/en/latest/tutorial.html#the-migration-environment
- create migrations with
alembic revision -m "create account table" - if running in the normal pipenv environment use
PIPENV_DOTENV_LOCATION=.env.alembic pipenv runfollowed by:- migrate to most recent with
alembic upgrade head - downgrade with
alembic downgrade -1
- migrate to most recent with
Release
Update main.py:VERSION.
Copy .env and src/.env to deployment, along with the contents of secrets/ including secrets/orchestration.yaml.
Then make prod.
updating packages/app/access
If pipenv packages are updated: make prod to build images with new packages.
New users should be added to the src/.env file ALLOWED_EMAILS prop.
Run pipenv update auto-archiver inside src to update the auto-archiver version being used, then test with make dev.
# CALL /sheet POST endpoint
curl -XPOST -H "Authorization: Bearer GOOGLE_OAUTH_TOKEN" -H "Content-type: application/json" -d '{"sheet_id": "SHEET_ID", "header": 1}' 'http://localhost:8004/sheet'
Testing
# can be done from top level but let's do it from the src folder for consistency with CI etc
cd src
# run tests and generate coverage
PYTHONPATH=. PIPENV_DOTENV_LOCATION=.env.test pipenv run coverage run -m pytest -vv --disable-warnings --color=yes tests/ && pipenv run coverage html
# get coverage report in command line
pipenv run coverage report
# get coverage HTML
pipenv run coverage html
# > open/run server on htmlcov/index.html to navigate through line coverage