mirror of
https://github.com/bellingcat/auto-archiver.git
synced 2026-06-11 04:38:29 +03:00
Merge branch 'main' into merge_modules
This commit is contained in:
6
.github/workflows/docker-publish.yaml
vendored
6
.github/workflows/docker-publish.yaml
vendored
@@ -11,7 +11,7 @@ on:
|
||||
|
||||
env:
|
||||
# Use docker.io for Docker Hub if empty
|
||||
REGISTRY: ghcr.io
|
||||
REGISTRY: docker.io
|
||||
# github.repository as <account>/<repo>
|
||||
IMAGE_NAME: ${{ github.repository }}
|
||||
|
||||
@@ -45,10 +45,12 @@ jobs:
|
||||
images: bellingcat/auto-archiver
|
||||
|
||||
- name: Build and push Docker image
|
||||
uses: docker/build-push-action@v2
|
||||
uses: docker/build-push-action@v6
|
||||
with:
|
||||
context: .
|
||||
platforms: linux/amd64,linux/arm64
|
||||
push: ${{ github.event_name != 'pull_request' }}
|
||||
tags: ${{ steps.meta.outputs.tags }}
|
||||
labels: ${{ steps.meta.outputs.labels }}
|
||||
cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache
|
||||
cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:cache,mode=max
|
||||
|
||||
2
.gitignore
vendored
2
.gitignore
vendored
@@ -33,3 +33,5 @@ dist*
|
||||
docs/_build/
|
||||
docs/source/autoapi/
|
||||
docs/source/modules/autogen/
|
||||
scripts/settings_page.html
|
||||
.vite
|
||||
|
||||
@@ -9,6 +9,7 @@ build:
|
||||
os: ubuntu-22.04
|
||||
tools:
|
||||
python: "3.10"
|
||||
nodejs: "22"
|
||||
jobs:
|
||||
post_install:
|
||||
- pip install poetry
|
||||
@@ -17,6 +18,11 @@ build:
|
||||
# See https://github.com/readthedocs/readthedocs.org/pull/11152/
|
||||
- VIRTUAL_ENV=$READTHEDOCS_VIRTUALENV_PATH poetry install --with docs
|
||||
|
||||
# generate the config editor page. Schema then HTML
|
||||
- VIRTUAL_ENV=$READTHEDOCS_VIRTUALENV_PATH poetry run python scripts/generate_settings_schema.py
|
||||
# install node dependencies and build the settings
|
||||
- cd scripts/settings && npm install && npm run build && yes | cp dist/index.html ../../docs/source/installation/settings_base.html && cd ../..
|
||||
|
||||
|
||||
sphinx:
|
||||
configuration: docs/source/conf.py
|
||||
|
||||
15
Dockerfile
15
Dockerfile
@@ -7,13 +7,24 @@ ENV RUNNING_IN_DOCKER=1 \
|
||||
PYTHONFAULTHANDLER=1 \
|
||||
PATH="/root/.local/bin:$PATH"
|
||||
|
||||
|
||||
ARG TARGETARCH
|
||||
|
||||
# Installing system dependencies
|
||||
RUN add-apt-repository ppa:mozillateam/ppa && \
|
||||
apt-get update && \
|
||||
apt-get install -y --no-install-recommends gcc ffmpeg fonts-noto exiftool && \
|
||||
apt-get install -y --no-install-recommends firefox-esr && \
|
||||
ln -s /usr/bin/firefox-esr /usr/bin/firefox && \
|
||||
wget https://github.com/mozilla/geckodriver/releases/download/v0.35.0/geckodriver-v0.35.0-linux64.tar.gz && \
|
||||
ln -s /usr/bin/firefox-esr /usr/bin/firefox
|
||||
|
||||
ARG GECKODRIVER_VERSION=0.36.0
|
||||
|
||||
RUN if [ $(uname -m) = "aarch64" ]; then \
|
||||
GECKODRIVER_ARCH=linux-aarch64; \
|
||||
else \
|
||||
GECKODRIVER_ARCH=linux64; \
|
||||
fi && \
|
||||
wget https://github.com/mozilla/geckodriver/releases/download/v${GECKODRIVER_VERSION}/geckodriver-v${GECKODRIVER_VERSION}-${GECKODRIVER_ARCH}.tar.gz && \
|
||||
tar -xvzf geckodriver* -C /usr/local/bin && \
|
||||
chmod +x /usr/local/bin/geckodriver && \
|
||||
rm geckodriver-v* && \
|
||||
|
||||
@@ -31,4 +31,5 @@ docker_development
|
||||
testing
|
||||
docs
|
||||
release
|
||||
settings_page
|
||||
```
|
||||
@@ -13,3 +13,8 @@
|
||||
manual release to docker hub
|
||||
* `docker image tag auto-archiver bellingcat/auto-archiver:latest`
|
||||
* `docker push bellingcat/auto-archiver`
|
||||
|
||||
|
||||
### Building the Settings Page
|
||||
|
||||
The Settings page is built as part of the python-publish workflow and packaged within the app.
|
||||
31
docs/source/development/settings_page.md
Normal file
31
docs/source/development/settings_page.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Configuration Editor
|
||||
|
||||
The [configuration editor](../installation/config_editor.md), is an easy-to-use UI for users to edit their auto-archiver settings.
|
||||
|
||||
The single-file app is built using React and vite. To get started developing the package, follow these steps:
|
||||
|
||||
1. Make sure you have Node v22 installed.
|
||||
|
||||
```{note} Tip: if you don't have node installed:
|
||||
|
||||
Use `nvm` to manage your node installations. Use:
|
||||
`curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash` to install `nvm` and then `nvm i 22` to install Node v22
|
||||
```
|
||||
|
||||
2. Generate the `schema.json` file for the currently installed modules using `python scripts/generate_settings_schema.py`
|
||||
3. Go to the settings folder `cd scripts/settings/` and build your environment with `npm i`
|
||||
4. Run a development version of the page with `npm run dev` and then open localhost:5173.
|
||||
5. Build a release version of the page with `npm run build`
|
||||
|
||||
A release version creates a single-file app called `dist/index.html`. This file should be copied to `docs/source/installation/settings_base.html` so that it can be integrated into the sphinx docs.
|
||||
|
||||
```{note}
|
||||
|
||||
The single-file app dist/index.html does not include any `<html>` or `<head>` tags as it is designed to be built into a RTD docs page. Edit `index.html` in the settings folder if you wish to modify the built page.
|
||||
```
|
||||
|
||||
## Readthedocs Integration
|
||||
|
||||
The configuration editor is built as part of the RTD deployment (see `.readthedocs.yaml` file). This command is run every time RTD is built:
|
||||
|
||||
`cd scripts/settings && npm install && npm run build && yes | cp dist/index.html ../../docs/source/installation/settings_base.html && cd ../..`
|
||||
@@ -1,11 +1,11 @@
|
||||
# Upgrading to v0.13
|
||||
# Upgrading from v0.12
|
||||
|
||||
```{note} This how-to is only relevant for people who used Auto Archiver before February 2025 (versions prior to 0.13).
|
||||
|
||||
If you are new to Auto Archiver, then you are already using the latest configuration format and this how-to is not relevant for you.
|
||||
```
|
||||
|
||||
Version 0.13 of Auto Archiver has breaking changes in the configuration format, which means earlier configuration formats will not work without slight modifications.
|
||||
Versions 0.13+ of Auto Archiver has breaking changes in the configuration format, which means earlier configuration formats will not work without slight modifications.
|
||||
|
||||
## How do I know if I need to update my configuration format?
|
||||
|
||||
|
||||
5
docs/source/installation/config_editor.md
Normal file
5
docs/source/installation/config_editor.md
Normal file
@@ -0,0 +1,5 @@
|
||||
# Configuration Editor
|
||||
|
||||
```{raw} html
|
||||
:file: settings.html
|
||||
```
|
||||
48685
docs/source/installation/settings.html
Normal file
48685
docs/source/installation/settings.html
Normal file
File diff suppressed because one or more lines are too long
@@ -6,6 +6,7 @@
|
||||
|
||||
installation.md
|
||||
configurations.md
|
||||
config_editor.md
|
||||
authentication.md
|
||||
requirements.md
|
||||
config_cheatsheet.md
|
||||
|
||||
52
scripts/generate_settings_schema.py
Normal file
52
scripts/generate_settings_schema.py
Normal file
@@ -0,0 +1,52 @@
|
||||
import json
|
||||
import os
|
||||
import io
|
||||
|
||||
from ruamel.yaml import YAML
|
||||
|
||||
from auto_archiver.core.module import ModuleFactory
|
||||
from auto_archiver.core.consts import MODULE_TYPES
|
||||
from auto_archiver.core.config import EMPTY_CONFIG
|
||||
|
||||
class SchemaEncoder(json.JSONEncoder):
|
||||
def default(self, obj):
|
||||
if isinstance(obj, set):
|
||||
return list(obj)
|
||||
return json.JSONEncoder.default(self, obj)
|
||||
|
||||
# Get available modules
|
||||
module_factory = ModuleFactory()
|
||||
available_modules = module_factory.available_modules()
|
||||
|
||||
modules_by_type = {}
|
||||
# Categorize modules by type
|
||||
for module in available_modules:
|
||||
for type in module.manifest.get('type', []):
|
||||
modules_by_type.setdefault(type, []).append(module)
|
||||
|
||||
all_modules_ordered_by_type = sorted(available_modules, key=lambda x: (MODULE_TYPES.index(x.type[0]), not x.requires_setup))
|
||||
|
||||
yaml: YAML = YAML()
|
||||
|
||||
config_string = io.BytesIO()
|
||||
yaml.dump(EMPTY_CONFIG, config_string)
|
||||
config_string = config_string.getvalue().decode('utf-8')
|
||||
output_schema = {
|
||||
'modules': dict((module.name,
|
||||
{
|
||||
'name': module.name,
|
||||
'display_name': module.display_name,
|
||||
'manifest': module.manifest,
|
||||
'configs': module.configs or None
|
||||
}
|
||||
) for module in all_modules_ordered_by_type),
|
||||
'steps': dict((f"{module_type}s", [module.name for module in modules_by_type[module_type]]) for module_type in MODULE_TYPES),
|
||||
'configs': [m.name for m in all_modules_ordered_by_type if m.configs],
|
||||
'module_types': MODULE_TYPES,
|
||||
'empty_config': config_string
|
||||
}
|
||||
|
||||
current_file_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
output_file = os.path.join(current_file_dir, 'settings/src/schema.json')
|
||||
with open(output_file, 'w') as file:
|
||||
json.dump(output_schema, file, indent=4, cls=SchemaEncoder)
|
||||
24
scripts/settings/.gitignore
vendored
Normal file
24
scripts/settings/.gitignore
vendored
Normal file
@@ -0,0 +1,24 @@
|
||||
# Logs
|
||||
logs
|
||||
*.log
|
||||
npm-debug.log*
|
||||
yarn-debug.log*
|
||||
yarn-error.log*
|
||||
pnpm-debug.log*
|
||||
lerna-debug.log*
|
||||
|
||||
node_modules
|
||||
dist
|
||||
dist-ssr
|
||||
*.local
|
||||
|
||||
# Editor directories and files
|
||||
.vscode/*
|
||||
!.vscode/extensions.json
|
||||
.idea
|
||||
.DS_Store
|
||||
*.suo
|
||||
*.ntvs*
|
||||
*.njsproj
|
||||
*.sln
|
||||
*.sw?
|
||||
3
scripts/settings/index.html
Normal file
3
scripts/settings/index.html
Normal file
@@ -0,0 +1,3 @@
|
||||
|
||||
<div id="root"></div>
|
||||
<script type="module" src="/src/main.tsx"></script>
|
||||
3743
scripts/settings/package-lock.json
generated
Normal file
3743
scripts/settings/package-lock.json
generated
Normal file
File diff suppressed because it is too large
Load Diff
31
scripts/settings/package.json
Normal file
31
scripts/settings/package.json
Normal file
@@ -0,0 +1,31 @@
|
||||
{
|
||||
"name": "material-ui-vite-ts",
|
||||
"private": true,
|
||||
"version": "5.0.0",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite",
|
||||
"build": "vite build",
|
||||
"preview": "vite preview"
|
||||
},
|
||||
"dependencies": {
|
||||
"@dnd-kit/core": "^6.3.1",
|
||||
"@dnd-kit/sortable": "^10.0.0",
|
||||
"@emotion/react": "latest",
|
||||
"@emotion/styled": "latest",
|
||||
"@mui/icons-material": "latest",
|
||||
"@mui/material": "latest",
|
||||
"react": "19.0.0",
|
||||
"react-dom": "19.0.0",
|
||||
"react-markdown": "^10.0.0",
|
||||
"yaml": "^2.7.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/react": "latest",
|
||||
"@types/react-dom": "latest",
|
||||
"@vitejs/plugin-react": "latest",
|
||||
"typescript": "latest",
|
||||
"vite": "latest",
|
||||
"vite-plugin-singlefile": "^2.1.0"
|
||||
}
|
||||
}
|
||||
450
scripts/settings/src/App.tsx
Normal file
450
scripts/settings/src/App.tsx
Normal file
@@ -0,0 +1,450 @@
|
||||
import * as React from 'react';
|
||||
import { useEffect, useState, useRef } from 'react';
|
||||
import Container from '@mui/material/Container';
|
||||
import Typography from '@mui/material/Typography';
|
||||
import Box from '@mui/material/Box';
|
||||
import FileUploadIcon from '@mui/icons-material/FileUpload';
|
||||
//
|
||||
import {
|
||||
DndContext,
|
||||
closestCenter,
|
||||
KeyboardSensor,
|
||||
PointerSensor,
|
||||
useSensor,
|
||||
useSensors,
|
||||
DragOverlay
|
||||
} from "@dnd-kit/core";
|
||||
import {
|
||||
arrayMove,
|
||||
SortableContext,
|
||||
sortableKeyboardCoordinates,
|
||||
rectSortingStrategy
|
||||
} from "@dnd-kit/sortable";
|
||||
|
||||
import type { DragStartEvent, DragEndEvent, UniqueIdentifier } from "@dnd-kit/core";
|
||||
|
||||
|
||||
import { Module } from './types';
|
||||
|
||||
import { modules, steps, module_types, empty_config } from './schema.json';
|
||||
import {
|
||||
Stack,
|
||||
Button,
|
||||
} from '@mui/material';
|
||||
import Grid from '@mui/material/Grid2';
|
||||
|
||||
import { parseDocument, Document, YAMLSeq, YAMLMap, Scalar } from 'yaml'
|
||||
import StepCard from './StepCard';
|
||||
|
||||
|
||||
function FileDrop({ setYamlFile }: { setYamlFile: React.Dispatch<React.SetStateAction<Document>> }) {
|
||||
|
||||
const [showError, setShowError] = useState(false);
|
||||
const [label, setLabel] = useState(<>Drag and drop your orchestration.yaml file here, or click to select a file.</>);
|
||||
const wrapperRef = useRef(null);
|
||||
|
||||
function openYAMLFile(event: any) {
|
||||
let file = event.target.files[0];
|
||||
if (file.type.indexOf('yaml') === -1) {
|
||||
setShowError(true);
|
||||
setLabel(<>Invalid type, only YAML files are accepted.</>)
|
||||
return;
|
||||
}
|
||||
let reader = new FileReader();
|
||||
reader.onload = function (e) {
|
||||
let contents = e.target ? e.target.result : '';
|
||||
try {
|
||||
let document = parseDocument(contents as string);
|
||||
if (document.errors.length > 0) {
|
||||
// not a valid yaml file
|
||||
setShowError(true);
|
||||
setLabel(<>Invalid file. Make sure your Orchestration is a valid YAML file with a 'steps' section in it.</>)
|
||||
return;
|
||||
} else {
|
||||
setShowError(false);
|
||||
setLabel(<>File loaded successfully.</>)
|
||||
}
|
||||
// do some basic validation of 'steps'
|
||||
let steps = document.get('steps');
|
||||
if (!steps) {
|
||||
setShowError(true);
|
||||
setLabel(<>Invalid file. Your orchestration file must have a 'steps' section in it.</>)
|
||||
return;
|
||||
}
|
||||
const replacements = {
|
||||
feeder: 'feeders',
|
||||
formatter: 'formatters',
|
||||
archivers: 'extractors',
|
||||
};
|
||||
|
||||
let error = false;
|
||||
for (let stepType of Object.keys(replacements)) {
|
||||
if (steps.get(stepType) !== undefined) {
|
||||
setShowError(true);
|
||||
setLabel(<>Invalid file. Your orchestration file appears to be in the old (v0.12) format with a '{stepType}' section.<br/>You should manually update your orchestration file first (hint: {stepType} → {replacements[stepType]})</>);
|
||||
error = true;
|
||||
return;
|
||||
}
|
||||
};
|
||||
setYamlFile(document);
|
||||
} catch (e) {
|
||||
console.error(e);
|
||||
}
|
||||
}
|
||||
reader.readAsText(file);
|
||||
}
|
||||
return (
|
||||
<>
|
||||
<div
|
||||
style={{
|
||||
position: 'relative',
|
||||
width: '100%',
|
||||
border: 'dashed',
|
||||
borderRadius:'5px',
|
||||
textAlign: 'center',
|
||||
borderWidth: '1px',
|
||||
padding: '20px' }}
|
||||
onDragEnter={(e) => {
|
||||
e.currentTarget.style.backgroundColor = 'var(--mui-palette-LinearProgress-infoBg)';
|
||||
}}
|
||||
onDragLeave={(e) => {
|
||||
e.currentTarget.style.backgroundColor = '';
|
||||
}}
|
||||
onDrop={(e) => {
|
||||
e.currentTarget.style.backgroundColor = '';
|
||||
}}
|
||||
>
|
||||
<FileUploadIcon style={{ fontSize: 50 }} />
|
||||
<input style={{
|
||||
opacity: 0,
|
||||
position: 'absolute',
|
||||
top: 0,
|
||||
left: 0,
|
||||
width: '100%',
|
||||
height: '100%',
|
||||
cursor: 'pointer',
|
||||
}}
|
||||
type="file" id="file"
|
||||
accept=".yaml"
|
||||
onChange={openYAMLFile} />
|
||||
<Typography variant="body1" color={showError ? 'error' : ''} >
|
||||
{label}
|
||||
</Typography>
|
||||
</div>
|
||||
</>
|
||||
);
|
||||
}
|
||||
|
||||
function ModuleTypes({ stepType, setEnabledModules, enabledModules, configValues }: { stepType: string, setEnabledModules: any, enabledModules: any, configValues: any }) {
|
||||
const [showError, setShowError] = useState<boolean>(false);
|
||||
const [activeId, setActiveId] = useState<UniqueIdentifier>();
|
||||
const [items, setItems] = useState<string[]>([]);
|
||||
|
||||
useEffect(() => {
|
||||
setItems(enabledModules[stepType].map(([name, enabled]: [string, boolean]) => name));
|
||||
}
|
||||
, [enabledModules]);
|
||||
|
||||
const toggleModule = (event: any) => {
|
||||
// make sure that 'feeder' and 'formatter' types only have one value
|
||||
let name = event.target.id;
|
||||
let checked = event.target.checked;
|
||||
if (stepType === 'feeders' || stepType === 'formatters') {
|
||||
// check how many modules of this type are enabled
|
||||
const checkedModules = enabledModules[stepType].filter(([m, enabled]: [string, boolean]) => {
|
||||
return (m !== name && enabled) || (checked && m === name)
|
||||
});
|
||||
if (checkedModules.length > 1) {
|
||||
setShowError(true);
|
||||
} else {
|
||||
setShowError(false);
|
||||
}
|
||||
} else {
|
||||
setShowError(false);
|
||||
}
|
||||
let newEnabledModules = { ...enabledModules };
|
||||
newEnabledModules[stepType] = enabledModules[stepType].map(([m, enabled]: [string, boolean]) => {
|
||||
return (m === name) ? [m, checked] : [m, enabled];
|
||||
});
|
||||
setEnabledModules(newEnabledModules);
|
||||
}
|
||||
|
||||
const sensors = useSensors(
|
||||
useSensor(PointerSensor),
|
||||
useSensor(KeyboardSensor, {
|
||||
coordinateGetter: sortableKeyboardCoordinates
|
||||
})
|
||||
);
|
||||
|
||||
const handleDragStart = (event: DragStartEvent) => {
|
||||
setActiveId(event.active.id);
|
||||
};
|
||||
|
||||
const handleDragEnd = (event: DragEndEvent) => {
|
||||
setActiveId(undefined);
|
||||
const { active, over } = event;
|
||||
|
||||
if (active.id !== over?.id) {
|
||||
const oldIndex = items.indexOf(active.id as string);
|
||||
const newIndex = items.indexOf(over?.id as string);
|
||||
|
||||
let newArray = arrayMove(items, oldIndex, newIndex);
|
||||
// set it also on steps
|
||||
let newEnabledModules = { ...enabledModules };
|
||||
newEnabledModules[stepType] = enabledModules[stepType].sort((a, b) => {
|
||||
return newArray.indexOf(a[0]) - newArray.indexOf(b[0]);
|
||||
})
|
||||
setEnabledModules(newEnabledModules);
|
||||
}
|
||||
};
|
||||
return (
|
||||
<>
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography id={stepType} variant="h6" style={{ textTransform: 'capitalize' }} >
|
||||
{stepType}
|
||||
</Typography>
|
||||
<Typography variant="body1" >
|
||||
Select the <a href="<a href={`https://auto-archiver.readthedocs.io/en/latest/modules/${stepType.slice(0,-1)}.html`}" target="_blank">{stepType}</a> you wish to enable. Drag to reorder.
|
||||
</Typography>
|
||||
</Box>
|
||||
{showError ? <Typography variant="body1" color="error" >Only one {stepType.slice(0,-1)} can be enabled at a time.</Typography> : null}
|
||||
|
||||
<DndContext
|
||||
sensors={sensors}
|
||||
collisionDetection={closestCenter}
|
||||
onDragEnd={handleDragEnd}
|
||||
onDragStart={handleDragStart}
|
||||
>
|
||||
<Grid container spacing={1} key={stepType}>
|
||||
<SortableContext items={items} strategy={rectSortingStrategy}>
|
||||
{items.map((name: string) => {
|
||||
let m: Module = modules[name];
|
||||
return (
|
||||
<StepCard key={name} type={stepType} module={m} toggleModule={toggleModule} enabledModules={enabledModules} configValues={configValues} />
|
||||
);
|
||||
})}
|
||||
<DragOverlay>
|
||||
{activeId ? (
|
||||
<div
|
||||
style={{
|
||||
width: "100%",
|
||||
height: "100%",
|
||||
backgroundColor: "grey",
|
||||
opacity: 0.1,
|
||||
}}
|
||||
></div>
|
||||
|
||||
) : null}
|
||||
</DragOverlay>
|
||||
</SortableContext>
|
||||
</Grid>
|
||||
</DndContext>
|
||||
</>
|
||||
);
|
||||
}
|
||||
|
||||
|
||||
export default function App() {
|
||||
const [yamlFile, setYamlFile] = useState<Document>(new Document());
|
||||
const [enabledModules, setEnabledModules] = useState<{}>(Object.fromEntries(Object.keys(steps).map(type => [type, steps[type].map((name: string) => [name, false])])));
|
||||
const [configValues, setConfigValues] = useState<{
|
||||
[key: string]: {
|
||||
[key: string
|
||||
]: any
|
||||
}
|
||||
}>(
|
||||
Object.keys(modules).reduce((acc, module) => {
|
||||
acc[module] = {};
|
||||
return acc;
|
||||
}, {})
|
||||
);
|
||||
|
||||
const saveSettings = function (copy: boolean = false) {
|
||||
// edit the yamlFile
|
||||
|
||||
// generate the steps config
|
||||
let stepsConfig = enabledModules;
|
||||
|
||||
let finalYamlFile: Document = null;
|
||||
if (!yamlFile || yamlFile.contents == null) {
|
||||
// create the yaml file from
|
||||
finalYamlFile = parseDocument(empty_config as string);
|
||||
} else {
|
||||
finalYamlFile = yamlFile;
|
||||
}
|
||||
|
||||
// set the steps
|
||||
module_types.forEach((type: string) => {
|
||||
let stepType = type + 's';
|
||||
let existingSteps = finalYamlFile.getIn(['steps', stepType]) as YAMLSeq;
|
||||
stepsConfig[stepType].forEach(([name, enabled]: [string, boolean]) => {
|
||||
let index = existingSteps.items.findIndex((item) => {
|
||||
return (item.value || item) === name
|
||||
});
|
||||
let stepItem = finalYamlFile.getIn(['steps', stepType], true) as YAMLSeq;
|
||||
|
||||
if (enabled && index === -1) {
|
||||
finalYamlFile.addIn(['steps', stepType], name);
|
||||
stepItem.commentBefore = stepItem.commentBefore?.replace("\n - " + name, '');
|
||||
stepItem.comment = stepItem.comment?.replace("\n - " + name, '');
|
||||
} else if (!enabled && index !== -1) {
|
||||
// set the value to empty and add a comment before with the commented value
|
||||
finalYamlFile.deleteIn(['steps', stepType, index]);
|
||||
stepItem.commentBefore += "\n - " + name;
|
||||
finalYamlFile.setIn(['steps', stepType], stepItem);
|
||||
}
|
||||
});
|
||||
// sort the items
|
||||
existingSteps.items.sort((a: Scalar | string, b: Scalar | string) => {
|
||||
return (stepsConfig[stepType].findIndex((val: [string, boolean]) => {return val[0] === (a.value || a)}) -
|
||||
stepsConfig[stepType].findIndex((val: [string, boolean]) => {return val[0] === (b.value || b)}))
|
||||
});
|
||||
existingSteps.flow = existingSteps.items.length ? false : true;
|
||||
});
|
||||
|
||||
// set all other settings
|
||||
// loop through each item that isn't 'steps' in the finalYamlFile and check if it exists in configValues
|
||||
|
||||
Object.keys(configValues).forEach((module_name: string) => {
|
||||
// get an existing key
|
||||
let existingConfig = finalYamlFile.get(module_name, true) as YAMLMap;
|
||||
if (existingConfig) {
|
||||
Object.keys(configValues[module_name]).forEach((config_name: string) => {
|
||||
let existingConfigYAML = existingConfig.get(config_name, true) as Scalar;
|
||||
if (existingConfigYAML) {
|
||||
existingConfigYAML.value = configValues[module_name][config_name];
|
||||
existingConfig.set(config_name, existingConfigYAML);
|
||||
} else {
|
||||
existingConfig.set(config_name, configValues[module_name][config_name]);
|
||||
}
|
||||
});
|
||||
finalYamlFile.set(module_name, existingConfig);
|
||||
} else {
|
||||
if (configValues[module_name] && Object.keys(configValues[module_name]).length > 0) {
|
||||
finalYamlFile.set(module_name, configValues[module_name]);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
if (copy) {
|
||||
navigator.clipboard.writeText(String(finalYamlFile)).then(() => {
|
||||
alert("Settings copied to clipboard.");
|
||||
});
|
||||
} else {
|
||||
// offer the file for download
|
||||
const blob = new Blob([String(finalYamlFile)], { type: 'application/x-yaml' });
|
||||
const url = URL.createObjectURL(blob);
|
||||
const a = document.createElement('a');
|
||||
a.href = url;
|
||||
a.download = 'orchestration.yaml';
|
||||
a.click();
|
||||
}
|
||||
}
|
||||
|
||||
useEffect(() => {
|
||||
// load the configs, and set the default values if they exist
|
||||
let newConfigValues = {};
|
||||
Object.keys(modules).map((module: string) => {
|
||||
let m = modules[module];
|
||||
let configs = m.configs;
|
||||
if (!configs) {
|
||||
return;
|
||||
}
|
||||
newConfigValues[module] = {};
|
||||
Object.keys(configs).map((config: string) => {
|
||||
let config_args = configs[config];
|
||||
if (config_args.default !== undefined) {
|
||||
newConfigValues[module][config] = config_args.default;
|
||||
}
|
||||
});
|
||||
})
|
||||
setConfigValues(newConfigValues);
|
||||
}, []);
|
||||
|
||||
useEffect(() => {
|
||||
if (!yamlFile || yamlFile.contents == null) {
|
||||
return;
|
||||
}
|
||||
|
||||
let settings = yamlFile.toJS();
|
||||
// make a deep copy of settings
|
||||
let stepSettings = settings['steps'];
|
||||
|
||||
let newEnabledModules = Object.fromEntries(Object.keys(steps).map((type: string) => {
|
||||
return [type, steps[type].map((name: string) => {
|
||||
return [name, stepSettings[type].indexOf(name) !== -1];
|
||||
}).sort((a, b) => {
|
||||
let aIndex = stepSettings[type].indexOf(a[0]);
|
||||
let bIndex = stepSettings[type].indexOf(b[0]);
|
||||
if (aIndex === -1 && bIndex === -1) {
|
||||
return a - b;
|
||||
}
|
||||
if (bIndex === -1) {
|
||||
return -1;
|
||||
}
|
||||
if (aIndex === -1) {
|
||||
return 1;
|
||||
}
|
||||
return aIndex - bIndex;
|
||||
})];
|
||||
}).sort((a, b) => {
|
||||
return module_types.indexOf(a[0]) - module_types.indexOf(b[0]);
|
||||
}));
|
||||
setEnabledModules(newEnabledModules);
|
||||
|
||||
// set the config values
|
||||
let newConfigValues = settings;
|
||||
delete newConfigValues['steps'];
|
||||
|
||||
|
||||
setConfigValues(Object.keys(modules).reduce((acc, module) => {
|
||||
acc[module] = newConfigValues[module] || {};
|
||||
return acc;
|
||||
}, {}));
|
||||
}, [yamlFile]);
|
||||
|
||||
|
||||
|
||||
return (
|
||||
<Container maxWidth="lg">
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography variant="h5" >
|
||||
1. Select your orchestration.yaml settings file.
|
||||
</Typography>
|
||||
<Typography variant="body1">Or skip this step to start from scratch</Typography>
|
||||
<FileDrop setYamlFile={setYamlFile} />
|
||||
</Box>
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography variant="h5" >
|
||||
2. Choose the Modules you wish to enable/disable
|
||||
</Typography>
|
||||
{Object.keys(steps).map((stepType: string) => {
|
||||
return (
|
||||
<Box key={stepType} sx={{ my: 4 }}>
|
||||
<ModuleTypes stepType={stepType} setEnabledModules={setEnabledModules} enabledModules={enabledModules} configValues={configValues} />
|
||||
</Box>
|
||||
);
|
||||
})}
|
||||
</Box>
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography variant="h5" >
|
||||
3. Configure your Enabled Modules
|
||||
</Typography>
|
||||
<Typography variant="body1" >
|
||||
Next to each module you've enabled, you can click 'Configure' to set the module's settings.
|
||||
</Typography>
|
||||
</Box>
|
||||
<Box sx={{ my: 4 }}>
|
||||
<Typography variant="h5" >
|
||||
4. Save your settings
|
||||
</Typography>
|
||||
<Stack direction="row" spacing={2} sx={{ my: 2 }}>
|
||||
<Button variant="contained" color="primary" onClick={() => saveSettings(true)}>Copy Settings to Clipboard</Button>
|
||||
<Button variant="contained" color="primary" onClick={() => saveSettings()}>Save Settings to File</Button>
|
||||
</Stack>
|
||||
</Box>
|
||||
</Box>
|
||||
</Container>
|
||||
);
|
||||
}
|
||||
258
scripts/settings/src/StepCard.tsx
Normal file
258
scripts/settings/src/StepCard.tsx
Normal file
@@ -0,0 +1,258 @@
|
||||
import { useState } from "react";
|
||||
import { useSortable } from "@dnd-kit/sortable";
|
||||
import ReactMarkdown from 'react-markdown';
|
||||
|
||||
import { CSS } from "@dnd-kit/utilities";
|
||||
|
||||
import {
|
||||
Card,
|
||||
CardActions,
|
||||
CardHeader,
|
||||
Button,
|
||||
Dialog,
|
||||
DialogTitle,
|
||||
DialogContent,
|
||||
Box,
|
||||
IconButton,
|
||||
Checkbox,
|
||||
Select,
|
||||
MenuItem,
|
||||
FormControl,
|
||||
FormControlLabel,
|
||||
FormHelperText,
|
||||
TextField,
|
||||
Stack,
|
||||
Typography,
|
||||
InputAdornment,
|
||||
} from '@mui/material';
|
||||
import Grid from '@mui/material/Grid2';
|
||||
import DragIndicatorIcon from '@mui/icons-material/DragIndicator';
|
||||
import Visibility from '@mui/icons-material/Visibility';
|
||||
import VisibilityOff from '@mui/icons-material/VisibilityOff';
|
||||
import HelpIconOutlined from '@mui/icons-material/HelpOutline';
|
||||
import { Module, Config } from "./types";
|
||||
|
||||
|
||||
// adds 'capitalize' method to String prototype
|
||||
declare global {
|
||||
interface String {
|
||||
capitalize(): string;
|
||||
}
|
||||
}
|
||||
String.prototype.capitalize = function (this: string) {
|
||||
return this.charAt(0).toUpperCase() + this.slice(1);
|
||||
};
|
||||
|
||||
const StepCard = ({
|
||||
type,
|
||||
module,
|
||||
toggleModule,
|
||||
enabledModules,
|
||||
configValues
|
||||
}: {
|
||||
type: string,
|
||||
module: Module,
|
||||
toggleModule: any,
|
||||
enabledModules: any,
|
||||
configValues: any
|
||||
}) => {
|
||||
const {
|
||||
attributes,
|
||||
listeners,
|
||||
setNodeRef,
|
||||
transform,
|
||||
transition,
|
||||
isDragging
|
||||
} = useSortable({ id: module.name });
|
||||
|
||||
|
||||
const style = {
|
||||
...Card.style,
|
||||
transform: CSS.Transform.toString(transform),
|
||||
transition,
|
||||
zIndex: isDragging ? "100" : "auto",
|
||||
opacity: isDragging ? 0.3 : 1
|
||||
};
|
||||
|
||||
let name = module.name;
|
||||
const [helpOpen, setHelpOpen] = useState(false);
|
||||
const [configOpen, setConfigOpen] = useState(false);
|
||||
const enabled = enabledModules[type].find((m: any) => m[0] === name)[1];
|
||||
|
||||
return (
|
||||
<Grid ref={setNodeRef} size={{ xs: 6, sm: 4, md: 3 }} style={style}>
|
||||
<Card >
|
||||
<CardHeader
|
||||
title={
|
||||
<FormControlLabel
|
||||
style={{paddingRight: '0 !important'}}
|
||||
control={<Checkbox title="Check to enable this module" sx={{paddingTop:0, paddingBottom:0}} id={name} onClick={toggleModule} checked={enabled} />}
|
||||
label={module.display_name} />
|
||||
}
|
||||
/>
|
||||
<CardActions>
|
||||
<Box sx={{ justifyContent: 'space-between', display: 'flex', width: '100%' }}>
|
||||
<Box>
|
||||
<IconButton title="Module information" size="small" onClick={() => setHelpOpen(true)}>
|
||||
<HelpIconOutlined />
|
||||
</IconButton>
|
||||
{enabled && module.configs && name != 'cli_feeder' ? (
|
||||
<Button size="small" onClick={() => setConfigOpen(true)}>Configure</Button>
|
||||
) : null}
|
||||
</Box>
|
||||
<IconButton size="small" title="Drag to reorder" sx={{ cursor: 'grab' }} {...listeners} {...attributes}>
|
||||
<DragIndicatorIcon/>
|
||||
</IconButton>
|
||||
</Box>
|
||||
</CardActions>
|
||||
</Card>
|
||||
<Dialog
|
||||
open={helpOpen}
|
||||
onClose={() => setHelpOpen(false)}
|
||||
maxWidth="lg"
|
||||
>
|
||||
<DialogTitle>
|
||||
{module.display_name}
|
||||
</DialogTitle>
|
||||
<DialogContent>
|
||||
<ReactMarkdown>
|
||||
{module.manifest.description.split("\n").map((line: string) => line.trim()).join("\n")}
|
||||
</ReactMarkdown>
|
||||
</DialogContent>
|
||||
</Dialog>
|
||||
{module.configs && name != 'cli_feeder' && <ConfigPanel module={module} open={configOpen} setOpen={setConfigOpen} configValues={configValues} />}
|
||||
</Grid>
|
||||
)
|
||||
}
|
||||
|
||||
function ConfigField({ config_value, module, configValues }: { config_value: any, module: Module, configValues: any }) {
|
||||
const [showPassword, setShowPassword] = useState(false);
|
||||
const handleClickShowPassword = () => setShowPassword((show) => !show);
|
||||
|
||||
const handleMouseDownPassword = (event: React.MouseEvent<HTMLButtonElement>) => {
|
||||
event.preventDefault();
|
||||
};
|
||||
|
||||
const handleMouseUpPassword = (event: React.MouseEvent<HTMLButtonElement>) => {
|
||||
event.preventDefault();
|
||||
};
|
||||
|
||||
function setConfigValue(config: any, value: any) {
|
||||
configValues[module.name][config] = value;
|
||||
}
|
||||
const config_args: Config = module.configs[config_value];
|
||||
const config_name: string = config_value.replace(/_/g, " ");
|
||||
const config_display_name = config_name.capitalize();
|
||||
const value = configValues[module.name][config_value] || config_args.default;
|
||||
|
||||
|
||||
const config_value_lower = config_value.toLowerCase();
|
||||
const is_password = config_value_lower.includes('password') ||
|
||||
config_value_lower.includes('secret') ||
|
||||
config_value_lower.includes('token') ||
|
||||
config_value_lower.includes('key') ||
|
||||
config_value_lower.includes('api_hash') ||
|
||||
config_args.type === 'password';
|
||||
|
||||
const text_input_type = is_password ? 'password' : (config_args.type === 'int' ? 'number' : 'text');
|
||||
|
||||
return (
|
||||
<Box>
|
||||
<Typography variant='body1' style={{ fontWeight: 'bold' }}>{config_display_name} {config_args.required && (`(required)`)} </Typography>
|
||||
<FormControl size="small">
|
||||
{config_args.type === 'bool' ?
|
||||
<FormControlLabel control={
|
||||
<Checkbox defaultChecked={value} size="small" id={`${module}.${config_value}`}
|
||||
onChange={(e) => {
|
||||
setConfigValue(config_value, e.target.checked);
|
||||
}}
|
||||
/>} label={config_args.help.capitalize()}
|
||||
/>
|
||||
:
|
||||
(
|
||||
config_args.choices !== undefined ?
|
||||
<Select size="small" id={`${module}.${config_value}`}
|
||||
defaultValue={config_args.default}
|
||||
value={value}
|
||||
onChange={(e) => {
|
||||
setConfigValue(config_value, e.target.value);
|
||||
}}
|
||||
>
|
||||
{config_args.choices.map((choice: any) => {
|
||||
return (
|
||||
<MenuItem key={`${module}.${config_value}.${choice}`}
|
||||
value={choice}>{choice}</MenuItem>
|
||||
);
|
||||
})}
|
||||
</Select>
|
||||
:
|
||||
(config_args.type === 'json_loader' ?
|
||||
<TextField multiline size="small" id={`${module}.${config_value}`} defaultValue={JSON.stringify(value, null, 2)} rows={6} onChange={
|
||||
(e) => {
|
||||
try {
|
||||
let val = JSON.parse(e.target.value);
|
||||
setConfigValue(config_value, val);
|
||||
} catch (e) {
|
||||
console.log(e);
|
||||
}
|
||||
}
|
||||
} />
|
||||
:
|
||||
<TextField size="small" id={`${module}.${config_value}`} defaultValue={value} type={showPassword ? 'text' : text_input_type}
|
||||
onChange={(e) => {
|
||||
setConfigValue(config_value, e.target.value);
|
||||
}}
|
||||
required={config_args.required}
|
||||
slotProps={ is_password ? {
|
||||
input: { endAdornment: (
|
||||
<InputAdornment position="end">
|
||||
<IconButton
|
||||
aria-label="toggle password visibility"
|
||||
onClick={handleClickShowPassword}
|
||||
onMouseDown={handleMouseDownPassword}
|
||||
onMouseUp={handleMouseUpPassword}
|
||||
>
|
||||
{showPassword ? <VisibilityOff /> : <Visibility />}
|
||||
</IconButton>
|
||||
</InputAdornment>
|
||||
)}
|
||||
} : {}}
|
||||
/>
|
||||
)
|
||||
)
|
||||
}
|
||||
{config_args.type !== 'bool' && (
|
||||
<FormHelperText >{config_args.help.capitalize()}</FormHelperText>
|
||||
)}
|
||||
</FormControl>
|
||||
</Box>
|
||||
)
|
||||
}
|
||||
|
||||
function ConfigPanel({ module, open, setOpen, configValues }: { module: Module, open: boolean, setOpen: any, configValues: any }) {
|
||||
|
||||
return (
|
||||
<>
|
||||
<Dialog
|
||||
open={open}
|
||||
onClose={() => setOpen(false)}
|
||||
maxWidth="lg"
|
||||
>
|
||||
<DialogTitle>
|
||||
{module.display_name}
|
||||
</DialogTitle>
|
||||
<DialogContent>
|
||||
<Stack direction="column" spacing={1}>
|
||||
{Object.keys(module.configs).map((config_value: any) => {
|
||||
return (
|
||||
<ConfigField key={config_value} config_value={config_value} module={module} configValues={configValues} />
|
||||
);
|
||||
})}
|
||||
</Stack>
|
||||
</DialogContent>
|
||||
</Dialog>
|
||||
</>
|
||||
);
|
||||
}
|
||||
|
||||
export default StepCard;
|
||||
44
scripts/settings/src/main.tsx
Normal file
44
scripts/settings/src/main.tsx
Normal file
@@ -0,0 +1,44 @@
|
||||
import * as React from 'react';
|
||||
import * as ReactDOM from 'react-dom/client';
|
||||
import { ThemeProvider } from '@mui/material/styles';
|
||||
import { CssBaseline } from '@mui/material';
|
||||
import App from './App';
|
||||
import { createTheme } from '@mui/material/styles';
|
||||
import { red } from '@mui/material/colors';
|
||||
import { useState, useEffect } from 'react';
|
||||
|
||||
function RootApp() {
|
||||
const [mode, setMode] = useState('light');
|
||||
|
||||
useEffect(() => {
|
||||
setMode(window.localStorage.getItem('theme') || 'light');
|
||||
}, []);
|
||||
|
||||
var observer = new MutationObserver(function(mutations) {
|
||||
setMode(window.localStorage.getItem('theme') || 'light');
|
||||
|
||||
})
|
||||
observer.observe(document.documentElement, {attributes: true, attributeFilter: ['data-theme']});
|
||||
|
||||
// A custom theme for this app
|
||||
const theme = createTheme({
|
||||
palette: {
|
||||
mode: mode == 'light' ? 'light' : 'dark',
|
||||
},
|
||||
cssVariables: true
|
||||
});
|
||||
|
||||
return (
|
||||
<ThemeProvider theme={theme}>
|
||||
<CssBaseline />
|
||||
<App />
|
||||
</ThemeProvider>
|
||||
);
|
||||
}
|
||||
|
||||
ReactDOM.createRoot(document.getElementById('root')!).render(
|
||||
<React.StrictMode>
|
||||
<RootApp />
|
||||
</React.StrictMode>,
|
||||
);
|
||||
|
||||
2118
scripts/settings/src/schema.json
Normal file
2118
scripts/settings/src/schema.json
Normal file
File diff suppressed because it is too large
Load Diff
21
scripts/settings/src/types.d.ts
vendored
Normal file
21
scripts/settings/src/types.d.ts
vendored
Normal file
@@ -0,0 +1,21 @@
|
||||
export interface Config {
|
||||
name: string;
|
||||
description: string;
|
||||
type: string?;
|
||||
default: any;
|
||||
help: string;
|
||||
choices: string[];
|
||||
required: boolean;
|
||||
}
|
||||
|
||||
interface Manifest {
|
||||
description: string;
|
||||
}
|
||||
|
||||
export interface Module {
|
||||
name: string;
|
||||
description: string;
|
||||
configs: { [key: string]: Config };
|
||||
manifest: Manifest;
|
||||
display_name: string;
|
||||
}
|
||||
21
scripts/settings/tsconfig.json
Normal file
21
scripts/settings/tsconfig.json
Normal file
@@ -0,0 +1,21 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ESNext",
|
||||
"useDefineForClassFields": true,
|
||||
"lib": ["DOM", "DOM.Iterable", "ESNext"],
|
||||
"allowJs": false,
|
||||
"skipLibCheck": true,
|
||||
"esModuleInterop": false,
|
||||
"allowSyntheticDefaultImports": true,
|
||||
"strict": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"module": "ESNext",
|
||||
"moduleResolution": "Node",
|
||||
"resolveJsonModule": true,
|
||||
"isolatedModules": true,
|
||||
"noEmit": true,
|
||||
"jsx": "react-jsx"
|
||||
},
|
||||
"include": ["src"],
|
||||
"references": [{ "path": "./tsconfig.node.json" }]
|
||||
}
|
||||
9
scripts/settings/tsconfig.node.json
Normal file
9
scripts/settings/tsconfig.node.json
Normal file
@@ -0,0 +1,9 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"composite": true,
|
||||
"module": "ESNext",
|
||||
"moduleResolution": "Node",
|
||||
"allowSyntheticDefaultImports": true
|
||||
},
|
||||
"include": ["vite.config.ts"]
|
||||
}
|
||||
12
scripts/settings/vite.config.ts
Normal file
12
scripts/settings/vite.config.ts
Normal file
@@ -0,0 +1,12 @@
|
||||
import { defineConfig } from 'vite';
|
||||
import react from '@vitejs/plugin-react';
|
||||
import { viteSingleFile } from "vite-plugin-singlefile"
|
||||
|
||||
// https://vite.dev/config/
|
||||
export default defineConfig({
|
||||
plugins: [react(), viteSingleFile()],
|
||||
build: {
|
||||
minify: false,
|
||||
sourcemap: true,
|
||||
}
|
||||
});
|
||||
@@ -105,8 +105,8 @@ class BaseModule(ABC):
|
||||
for key in self.authentication.keys():
|
||||
if key in site or site in key:
|
||||
logger.debug(f"Could not find exact authentication information for site '{site}'. \
|
||||
did find information for '{key}' which is close, is this what you meant? \
|
||||
If so, edit your authentication settings to make sure it exactly matches.")
|
||||
did find information for '{key}' which is close, is this what you meant? \
|
||||
If so, edit your authentication settings to make sure it exactly matches.")
|
||||
|
||||
def get_ytdlp_cookiejar(args):
|
||||
import yt_dlp
|
||||
|
||||
@@ -80,7 +80,10 @@ class ModuleFactory:
|
||||
|
||||
available = self.available_modules(limit_to_modules=[module_name], suppress_warnings=suppress_warnings)
|
||||
if not available:
|
||||
raise IndexError(f"Module '{module_name}' not found. Are you sure it's installed/exists?")
|
||||
message = f"Module '{module_name}' not found. Are you sure it's installed/exists?"
|
||||
if 'archiver' in module_name:
|
||||
message += f" Did you mean {module_name.replace('archiver', 'extractor')}?"
|
||||
raise IndexError(message)
|
||||
return available[0]
|
||||
|
||||
def available_modules(self, limit_to_modules: List[str]= [], suppress_warnings: bool = False) -> List[LazyBaseModule]:
|
||||
|
||||
@@ -15,6 +15,7 @@ from copy import copy
|
||||
|
||||
from rich_argparse import RichHelpFormatter
|
||||
from loguru import logger
|
||||
import requests
|
||||
|
||||
from .metadata import Metadata, Media
|
||||
from auto_archiver.version import __version__
|
||||
@@ -72,10 +73,20 @@ class ArchivingOrchestrator:
|
||||
|
||||
self.basic_parser = parser
|
||||
return parser
|
||||
|
||||
def check_steps(self, config):
|
||||
for module_type in MODULE_TYPES:
|
||||
if not config['steps'].get(f"{module_type}s", []):
|
||||
if module_type == 'feeder' or module_type == 'formatter' and config['steps'].get(f"{module_type}"):
|
||||
raise SetupError(f"It appears you have '{module_type}' set under 'steps' in your configuration file, but as of version 0.13.0 of Auto Archiver, you must use '{module_type}s'. Change this in your configuration file and try again. \
|
||||
Here's how that would look: \n\nsteps:\n {module_type}s:\n - [your_{module_type}_name_here]\n {'extractors:...' if module_type == 'feeder' else '...'}\n")
|
||||
if module_type == 'extractor' and config['steps'].get('archivers'):
|
||||
raise SetupError(f"As of version 0.13.0 of Auto Archiver, the 'archivers' step name has been changed to 'extractors'. Change this in your configuration file and try again. \
|
||||
Here's how that would look: \n\nsteps:\n extractors:\n - [your_extractor_name_here]\n enrichers:...\n")
|
||||
raise SetupError(f"No {module_type}s were configured. Make sure to set at least one {module_type} in your configuration file or on the command line (using --{module_type}s)")
|
||||
|
||||
def setup_complete_parser(self, basic_config: dict, yaml_config: dict, unused_args: list[str]) -> None:
|
||||
|
||||
|
||||
# modules parser to get the overridden 'steps' values
|
||||
modules_parser = argparse.ArgumentParser(
|
||||
add_help=False,
|
||||
@@ -100,6 +111,7 @@ class ArchivingOrchestrator:
|
||||
# but should we add them? Or should we just add them to the 'complete' parser?
|
||||
|
||||
if is_valid_config(yaml_config):
|
||||
self.check_steps(yaml_config)
|
||||
# only load the modules enabled in config
|
||||
# TODO: if some steps are empty (e.g. 'feeders' is empty), should we default to the 'simple' ones? Or only if they are ALL empty?
|
||||
enabled_modules = []
|
||||
@@ -115,10 +127,6 @@ class ArchivingOrchestrator:
|
||||
simple_modules = [module for module in self.module_factory.available_modules() if not module.requires_setup]
|
||||
self.add_individual_module_args(simple_modules, parser)
|
||||
|
||||
# for simple mode, we use the cli_feeder and any modules that don't require setup
|
||||
if not yaml_config['steps']['feeders']:
|
||||
yaml_config['steps']['feeders'] = ['cli_feeder']
|
||||
|
||||
# add them to the config
|
||||
for module in simple_modules:
|
||||
for module_type in module.type:
|
||||
@@ -171,9 +179,6 @@ class ArchivingOrchestrator:
|
||||
if not parser:
|
||||
parser = self.parser
|
||||
|
||||
# allow passing URLs directly on the command line
|
||||
parser.add_argument('urls', nargs='*', default=[], help='URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml')
|
||||
|
||||
parser.add_argument('--authentication', dest='authentication', help='A dictionary of sites and their authentication methods \
|
||||
(token, username etc.) that extractors can use to log into \
|
||||
a website. If passing this on the command line, use a JSON string. \
|
||||
@@ -193,7 +198,11 @@ class ArchivingOrchestrator:
|
||||
modules = self.module_factory.available_modules()
|
||||
|
||||
for module in modules:
|
||||
|
||||
if module.name == 'cli_feeder':
|
||||
# special case. For the CLI feeder, allow passing URLs directly on the command line without setting --cli_feeder.urls=
|
||||
parser.add_argument('urls', nargs='*', default=[], help='URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml')
|
||||
continue
|
||||
|
||||
if not module.configs:
|
||||
# this module has no configs, don't show anything in the help
|
||||
# (TODO: do we want to show something about this module though, like a description?)
|
||||
@@ -277,36 +286,16 @@ class ArchivingOrchestrator:
|
||||
raise SetupError(f"Only one {module_type} is allowed, found {len(step_items)} {module_type}s. Please remove one of the following from your configuration file: {modules_to_load}")
|
||||
|
||||
for module in modules_to_load:
|
||||
if module == 'cli_feeder':
|
||||
# cli_feeder is a pseudo module, it just takes the command line args for [URLS]
|
||||
urls = self.config['urls']
|
||||
if not urls:
|
||||
raise SetupError("No URLs provided. Please provide at least one URL via the command line, or set up an alternative feeder. Use --help for more information.")
|
||||
|
||||
def feed(self) -> Generator[Metadata]:
|
||||
for url in urls:
|
||||
logger.debug(f"Processing URL: '{url}'")
|
||||
yield Metadata().set_url(url)
|
||||
|
||||
pseudo_module = type('CLIFeeder', (Feeder,), {
|
||||
'name': 'cli_feeder',
|
||||
'display_name': 'CLI Feeder',
|
||||
'__iter__': feed
|
||||
|
||||
})()
|
||||
|
||||
pseudo_module.__iter__ = feed
|
||||
step_items.append(pseudo_module)
|
||||
continue
|
||||
|
||||
if module in invalid_modules:
|
||||
continue
|
||||
|
||||
loaded_module = None
|
||||
try:
|
||||
loaded_module: BaseModule = self.module_factory.get_module(module, self.config)
|
||||
except (KeyboardInterrupt, Exception) as e:
|
||||
logger.error(f"Error during setup of modules: {e}\n{traceback.format_exc()}")
|
||||
if module_type == 'extractor' and loaded_module.name == module:
|
||||
if loaded_module and module_type == 'extractor':
|
||||
loaded_module.cleanup()
|
||||
raise e
|
||||
|
||||
@@ -348,7 +337,23 @@ class ArchivingOrchestrator:
|
||||
yaml_config = self.load_config(basic_config.config_file)
|
||||
|
||||
return self.setup_complete_parser(basic_config, yaml_config, unused_args)
|
||||
|
||||
def check_for_updates(self):
|
||||
response = requests.get("https://pypi.org/pypi/auto-archiver/json").json()
|
||||
latest_version = response['info']['version']
|
||||
# check version compared to current version
|
||||
if latest_version != __version__:
|
||||
if os.environ.get('RUNNING_IN_DOCKER'):
|
||||
update_cmd = "`docker pull bellingcat/auto-archiver:latest`"
|
||||
else:
|
||||
update_cmd = "`pip install --upgrade auto-archiver`"
|
||||
logger.warning("")
|
||||
logger.warning("********* IMPORTANT: UPDATE AVAILABLE ********")
|
||||
logger.warning(f"A new version of auto-archiver is available (v{latest_version}, you have {__version__})")
|
||||
logger.warning(f"Make sure to update to the latest version using: {update_cmd}")
|
||||
logger.warning("")
|
||||
|
||||
|
||||
def setup(self, args: list):
|
||||
"""
|
||||
Function to configure all setup of the orchestrator: setup configs and load modules.
|
||||
@@ -356,6 +361,8 @@ class ArchivingOrchestrator:
|
||||
This method should only ever be called once
|
||||
"""
|
||||
|
||||
self.check_for_updates()
|
||||
|
||||
if self.setup_finished:
|
||||
logger.warning("The `setup_config()` function should only ever be run once. \
|
||||
If you need to re-run the setup, please re-instantiate a new instance of the orchestrator. \
|
||||
|
||||
23
src/auto_archiver/modules/cli_feeder/__manifest__.py
Normal file
23
src/auto_archiver/modules/cli_feeder/__manifest__.py
Normal file
@@ -0,0 +1,23 @@
|
||||
{
|
||||
'name': 'Command Line Feeder',
|
||||
'type': ['feeder'],
|
||||
'entry_point': 'cli_feeder::CLIFeeder',
|
||||
'requires_setup': False,
|
||||
'description': 'Feeds URLs to orchestrator from the command line',
|
||||
'configs': {
|
||||
'urls': {
|
||||
'default': None,
|
||||
'help': 'URL(s) to archive, either a single URL or a list of urls, should not come from config.yaml',
|
||||
},
|
||||
},
|
||||
'description': """
|
||||
The Command Line Feeder is the default enabled feeder for the Auto Archiver. It allows you to pass URLs directly to the orchestrator from the command line
|
||||
without the need to specify any additional configuration or command line arguments:
|
||||
|
||||
`auto-archiver --feeder cli_feeder -- "https://example.com/1/,https://example.com/2/"`
|
||||
|
||||
You can pass multiple URLs by separating them with a space. The URLs will be processed in the order they are provided.
|
||||
|
||||
`auto-archiver --feeder cli_feeder -- https://example.com/1/ https://example.com/2/`
|
||||
""",
|
||||
}
|
||||
21
src/auto_archiver/modules/cli_feeder/cli_feeder.py
Normal file
21
src/auto_archiver/modules/cli_feeder/cli_feeder.py
Normal file
@@ -0,0 +1,21 @@
|
||||
from loguru import logger
|
||||
|
||||
from auto_archiver.core.feeder import Feeder
|
||||
from auto_archiver.core.metadata import Metadata
|
||||
|
||||
class CLIFeeder(Feeder):
|
||||
|
||||
def setup(self) -> None:
|
||||
self.urls = self.config['urls']
|
||||
if not self.urls:
|
||||
raise ValueError("No URLs provided. Please provide at least one URL via the command line, or set up an alternative feeder. Use --help for more information.")
|
||||
|
||||
def __iter__(self) -> Metadata:
|
||||
urls = self.config['urls']
|
||||
for url in urls:
|
||||
logger.debug(f"Processing {url}")
|
||||
m = Metadata().set_url(url)
|
||||
m.set_context("folder", "cli")
|
||||
yield m
|
||||
|
||||
logger.success(f"Processed {len(urls)} URL(s)")
|
||||
@@ -10,7 +10,7 @@ class ConsoleDb(Database):
|
||||
"""
|
||||
|
||||
def started(self, item: Metadata) -> None:
|
||||
logger.warning(f"STARTED {item}")
|
||||
logger.info(f"STARTED {item}")
|
||||
|
||||
def failed(self, item: Metadata, reason:str) -> None:
|
||||
logger.error(f"FAILED {item}: {reason}")
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
},
|
||||
'entry_point': 'csv_db::CSVDb',
|
||||
"configs": {
|
||||
"csv_file": {"default": "db.csv", "help": "CSV file name"}
|
||||
"csv_file": {"default": "db.csv", "help": "CSV file name to save metadata to"},
|
||||
},
|
||||
"description": """
|
||||
Handles exporting archival results to a CSV file.
|
||||
|
||||
@@ -28,6 +28,13 @@ the broader archiving framework.
|
||||
metadata objects. Some dropins are included in this generic_archiver by default, but
|
||||
custom dropins can be created to handle additional websites and passed to the archiver
|
||||
via the command line using the `--dropins` option (TODO!).
|
||||
|
||||
### Auto-Updates
|
||||
|
||||
The Generic Extractor will also automatically check for updates to `yt-dlp` (every 5 days by default).
|
||||
This can be configured using the `ytdlp_update_interval` setting (or disabled by setting it to -1).
|
||||
If you are having issues with the extractor, you can review the version of `yt-dlp` being used with `yt-dlp --version`.
|
||||
|
||||
""",
|
||||
"configs": {
|
||||
"subtitles": {"default": True, "help": "download subtitles if available", "type": "bool"},
|
||||
@@ -64,5 +71,10 @@ via the command line using the `--dropins` option (TODO!).
|
||||
"default": "inf",
|
||||
"help": "Use to limit the number of videos to download when a channel or long page is being extracted. 'inf' means no limit.",
|
||||
},
|
||||
"ytdlp_update_interval": {
|
||||
"default": 5,
|
||||
"help": "How often to check for yt-dlp updates (days). If positive, will check and update yt-dlp every [num] days. Set it to -1 to disable, or 0 to always update on every run.",
|
||||
"type": "int",
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
@@ -1,7 +1,11 @@
|
||||
import datetime, os, yt_dlp, pysubs2
|
||||
import datetime, os
|
||||
import importlib
|
||||
import subprocess
|
||||
from typing import Generator, Type
|
||||
|
||||
import yt_dlp
|
||||
from yt_dlp.extractor.common import InfoExtractor
|
||||
import pysubs2
|
||||
|
||||
from loguru import logger
|
||||
|
||||
@@ -11,6 +15,44 @@ from auto_archiver.core import Metadata, Media
|
||||
class GenericExtractor(Extractor):
|
||||
_dropins = {}
|
||||
|
||||
def setup(self):
|
||||
# check for file .ytdlp-update in the secrets folder
|
||||
if self.ytdlp_update_interval < 0:
|
||||
return
|
||||
|
||||
use_secrets = os.path.exists('secrets')
|
||||
path = os.path.join('secrets' if use_secrets else '', '.ytdlp-update')
|
||||
next_update_check = None
|
||||
if os.path.exists(path):
|
||||
with open(path, "r") as f:
|
||||
next_update_check = datetime.datetime.fromisoformat(f.read())
|
||||
|
||||
if not next_update_check or next_update_check < datetime.datetime.now():
|
||||
self.update_ytdlp()
|
||||
|
||||
next_update_check = datetime.datetime.now() + datetime.timedelta(days=self.ytdlp_update_interval)
|
||||
with open(path, "w") as f:
|
||||
f.write(next_update_check.isoformat())
|
||||
|
||||
def update_ytdlp(self):
|
||||
logger.info("Checking and updating yt-dlp...")
|
||||
logger.info(f"Tip: change the 'ytdlp_update_interval' setting to control how often yt-dlp is updated. Set to -1 to disable or 0 to enable on every run. Current setting: {self.ytdlp_update_interval}")
|
||||
from importlib.metadata import version as get_version
|
||||
old_version = get_version("yt-dlp")
|
||||
try:
|
||||
# try and update with pip (this works inside poetry environment and in a normal virtualenv)
|
||||
result = subprocess.run(["pip", "install", "--upgrade", "yt-dlp"], check=True, capture_output=True)
|
||||
|
||||
if "Successfully installed yt-dlp" in result.stdout.decode():
|
||||
new_version = importlib.metadata.version("yt-dlp")
|
||||
logger.info(f"yt-dlp successfully (from {old_version} to {new_version})")
|
||||
importlib.reload(yt_dlp)
|
||||
else:
|
||||
logger.info("yt-dlp already up to date")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error updating yt-dlp: {e}")
|
||||
|
||||
def suitable_extractors(self, url: str) -> Generator[str, None, None]:
|
||||
"""
|
||||
Returns a list of valid extractors for the given URL"""
|
||||
|
||||
@@ -12,7 +12,9 @@
|
||||
"default": None,
|
||||
"help": "the id of the sheet to archive (alternative to 'sheet' config)",
|
||||
},
|
||||
"header": {"default": 1, "help": "index of the header row (starts at 1)", "type": "int"},
|
||||
"header": {"default": 1,
|
||||
"type": "int",
|
||||
"help": "index of the header row (starts at 1)", "type": "int"},
|
||||
"service_account": {
|
||||
"default": "secrets/service_account.json",
|
||||
"help": "service account JSON file path. Learn how to create one: https://gspread.readthedocs.io/en/latest/oauth2.html",
|
||||
|
||||
@@ -7,7 +7,9 @@
|
||||
"bin": [""]
|
||||
},
|
||||
"configs": {
|
||||
"detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"}
|
||||
"detect_thumbnails": {"default": True,
|
||||
"help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'",
|
||||
"type": "bool"},
|
||||
},
|
||||
"description": """ """,
|
||||
}
|
||||
|
||||
@@ -10,25 +10,30 @@
|
||||
"requires_setup": True,
|
||||
"configs": {
|
||||
"username": {"required": True,
|
||||
"help": "a valid Instagram username"},
|
||||
"help": "A valid Instagram username."},
|
||||
"password": {
|
||||
"required": True,
|
||||
"help": "the corresponding Instagram account password",
|
||||
"help": "The corresponding Instagram account password.",
|
||||
},
|
||||
"download_folder": {
|
||||
"default": "instaloader",
|
||||
"help": "name of a folder to temporarily download content to",
|
||||
"help": "Name of a folder to temporarily download content to.",
|
||||
},
|
||||
"session_file": {
|
||||
"default": "secrets/instaloader.session",
|
||||
"help": "path to the instagram session which saves session credentials",
|
||||
"help": "Path to the instagram session file which saves session credentials. If one doesn't exist this gives the path to store a new one.",
|
||||
},
|
||||
# TODO: fine-grain
|
||||
# "download_stories": {"default": True, "help": "if the link is to a user profile: whether to get stories information"},
|
||||
},
|
||||
"description": """
|
||||
Uses the [Instaloader library](https://instaloader.github.io/as-module.html) to download content from Instagram. This class handles both individual posts
|
||||
and user profiles, downloading as much information as possible, including images, videos, text, stories,
|
||||
Uses the [Instaloader library](https://instaloader.github.io/as-module.html) to download content from Instagram.
|
||||
|
||||
> ⚠️ **Warning**
|
||||
> This module is not actively maintained due to known issues with blocking.
|
||||
> Prioritise usage of the [Instagram Tbot Extractor](./instagram_tbot_extractor.md) and [Instagram API Extractor](./instagram_api_extractor.md)
|
||||
|
||||
This class handles both individual posts and user profiles, downloading as much information as possible, including images, videos, text, stories,
|
||||
highlights, and tagged posts.
|
||||
Authentication is required via username/password or a session file.
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
highlights, and tagged posts. Authentication is required via username/password or a session file.
|
||||
|
||||
"""
|
||||
import re, os, shutil, traceback
|
||||
import re, os, shutil
|
||||
import instaloader
|
||||
from loguru import logger
|
||||
|
||||
@@ -15,10 +15,9 @@ class InstagramExtractor(Extractor):
|
||||
"""
|
||||
Uses Instaloader to download either a post (inc images, videos, text) or as much as possible from a profile (posts, stories, highlights, ...)
|
||||
"""
|
||||
|
||||
# NB: post regex should be tested before profile
|
||||
|
||||
valid_url = re.compile(r"(?:(?:http|https):\/\/)?(?:www.)?(?:instagram.com|instagr.am|instagr.com)\/")
|
||||
|
||||
# https://regex101.com/r/MGPquX/1
|
||||
post_pattern = re.compile(r"{valid_url}(?:p|reel)\/(\w+)".format(valid_url=valid_url))
|
||||
# https://regex101.com/r/6Wbsxa/1
|
||||
@@ -28,19 +27,22 @@ class InstagramExtractor(Extractor):
|
||||
def setup(self) -> None:
|
||||
|
||||
self.insta = instaloader.Instaloader(
|
||||
download_geotags=True, download_comments=True, compress_json=False, dirname_pattern=self.download_folder, filename_pattern="{date_utc}_UTC_{target}__{typename}"
|
||||
download_geotags=True,
|
||||
download_comments=True,
|
||||
compress_json=False,
|
||||
dirname_pattern=self.download_folder,
|
||||
filename_pattern="{date_utc}_UTC_{target}__{typename}"
|
||||
)
|
||||
try:
|
||||
self.insta.load_session_from_file(self.username, self.session_file)
|
||||
except Exception as e:
|
||||
logger.error(f"Unable to login from session file: {e}\n{traceback.format_exc()}")
|
||||
try:
|
||||
self.insta.login(self.username, config.instagram_self.password)
|
||||
# TODO: wait for this issue to be fixed https://github.com/instaloader/instaloader/issues/1758
|
||||
logger.debug(f"Session file failed", exc_info=True)
|
||||
logger.info("No valid session file found - Attempting login with use and password.")
|
||||
self.insta.login(self.username, self.password)
|
||||
self.insta.save_session_to_file(self.session_file)
|
||||
except Exception as e2:
|
||||
logger.error(f"Unable to finish login (retrying from file): {e2}\n{traceback.format_exc()}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to setup Instagram Extractor with Instagrapi. {e}")
|
||||
|
||||
|
||||
def download(self, item: Metadata) -> Metadata:
|
||||
|
||||
@@ -104,7 +104,7 @@ class InstagramTbotExtractor(Extractor):
|
||||
message = ""
|
||||
time.sleep(3)
|
||||
# media is added before text by the bot so it can be used as a stop-logic mechanism
|
||||
while attempts < (self.timeout - 3) and (not message or not len(seen_media)):
|
||||
while attempts < max(self.timeout - 3, 3) and (not message or not len(seen_media)):
|
||||
attempts += 1
|
||||
time.sleep(1)
|
||||
for post in self.client.iter_messages(chat, min_id=since_id):
|
||||
|
||||
@@ -17,7 +17,9 @@
|
||||
"choices": ["random", "static"],
|
||||
},
|
||||
"save_to": {"default": "./local_archive", "help": "folder where to save archived content"},
|
||||
"save_absolute": {"default": False, "help": "whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)"},
|
||||
"save_absolute": {"default": False,
|
||||
"type": "bool",
|
||||
"help": "whether the path to the stored file is absolute or relative in the output result inc. formatters (WARN: leaks the file structure)"},
|
||||
},
|
||||
"description": """
|
||||
LocalStorage: A storage module for saving archived content locally on the filesystem.
|
||||
|
||||
@@ -6,13 +6,25 @@
|
||||
"python": ["loguru", "selenium"],
|
||||
},
|
||||
"configs": {
|
||||
"width": {"default": 1280, "help": "width of the screenshots"},
|
||||
"height": {"default": 720, "help": "height of the screenshots"},
|
||||
"timeout": {"default": 60, "help": "timeout for taking the screenshot"},
|
||||
"sleep_before_screenshot": {"default": 4, "help": "seconds to wait for the pages to load before taking screenshot"},
|
||||
"width": {"default": 1280,
|
||||
"type": "int",
|
||||
"help": "width of the screenshots"},
|
||||
"height": {"default": 1024,
|
||||
"type": "int",
|
||||
"help": "height of the screenshots"},
|
||||
"timeout": {"default": 60,
|
||||
"type": "int",
|
||||
"help": "timeout for taking the screenshot"},
|
||||
"sleep_before_screenshot": {"default": 4,
|
||||
"type": "int",
|
||||
"help": "seconds to wait for the pages to load before taking screenshot"},
|
||||
"http_proxy": {"default": "", "help": "http proxy to use for the webdriver, eg http://proxy-user:password@proxy-ip:port"},
|
||||
"save_to_pdf": {"default": False, "help": "save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter"},
|
||||
"print_options": {"default": {}, "help": "options to pass to the pdf printer"}
|
||||
"save_to_pdf": {"default": False,
|
||||
"type": "bool",
|
||||
"help": "save the page as pdf along with the screenshot. PDF saving options can be adjusted with the 'print_options' parameter"},
|
||||
"print_options": {"default": {},
|
||||
"help": "options to pass to the pdf printer, in JSON format. See https://www.selenium.dev/documentation/webdriver/interactions/print_page/ for more information",
|
||||
"type": "json_loader"},
|
||||
},
|
||||
"description": """
|
||||
Captures screenshots and optionally saves web pages as PDFs using a WebDriver.
|
||||
|
||||
@@ -7,7 +7,9 @@
|
||||
},
|
||||
'entry_point': 'ssl_enricher::SSLEnricher',
|
||||
"configs": {
|
||||
"skip_when_nothing_archived": {"default": True, "help": "if true, will skip enriching when no media is archived"},
|
||||
"skip_when_nothing_archived": {"default": True,
|
||||
"type": 'bool',
|
||||
"help": "if true, will skip enriching when no media is archived"},
|
||||
},
|
||||
"description": """
|
||||
Retrieves SSL certificate information for a domain and stores it as a file.
|
||||
|
||||
@@ -14,7 +14,9 @@
|
||||
"api_hash": {"default": None, "help": "telegram API_HASH value, go to https://my.telegram.org/apps"},
|
||||
"bot_token": {"default": None, "help": "optional, but allows access to more content such as large videos, talk to @botfather"},
|
||||
"session_file": {"default": "secrets/anon", "help": "optional, records the telegram login session for future usage, '.session' will be appended to the provided value."},
|
||||
"join_channels": {"default": True, "help": "disables the initial setup with channel_invites config, useful if you have a lot and get stuck"},
|
||||
"join_channels": {"default": True,
|
||||
"type": "bool",
|
||||
"help": "disables the initial setup with channel_invites config, useful if you have a lot and get stuck"},
|
||||
"channel_invites": {
|
||||
"default": {},
|
||||
"help": "(JSON string) private channel invite links (format: t.me/joinchat/HASH OR t.me/+HASH) and (optional but important to avoid hanging for minutes on startup) channel id (format: CHANNEL_ID taken from a post url like https://t.me/c/CHANNEL_ID/1), the telegram account will join any new channels on setup",
|
||||
|
||||
@@ -17,11 +17,19 @@
|
||||
"configs": {
|
||||
"profile": {"default": None, "help": "browsertrix-profile (for profile generation see https://github.com/webrecorder/browsertrix-crawler#creating-and-using-browser-profiles)."},
|
||||
"docker_commands": {"default": None, "help":"if a custom docker invocation is needed"},
|
||||
"timeout": {"default": 120, "help": "timeout for WACZ generation in seconds"},
|
||||
"extract_media": {"default": False, "help": "If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched."},
|
||||
"extract_screenshot": {"default": True, "help": "If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched."},
|
||||
"timeout": {"default": 120,
|
||||
"type": "int",
|
||||
"help": "timeout for WACZ generation in seconds", "type": "int"},
|
||||
"extract_media": {"default": False,
|
||||
"type": 'bool',
|
||||
"help": "If enabled all the images/videos/audio present in the WACZ archive will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched."
|
||||
},
|
||||
"extract_screenshot": {"default": True,
|
||||
"type": 'bool',
|
||||
"help": "If enabled the screenshot captured by browsertrix will be extracted into separate Media and appear in the html report. The .wacz file will be kept untouched."
|
||||
},
|
||||
"socks_proxy_host": {"default": None, "help": "SOCKS proxy host for browsertrix-crawler, use in combination with socks_proxy_port. eg: user:password@host"},
|
||||
"socks_proxy_port": {"default": None, "help": "SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234"},
|
||||
"socks_proxy_port": {"default": None, "type":"int", "help": "SOCKS proxy port for browsertrix-crawler, use in combination with socks_proxy_host. eg 1234"},
|
||||
"proxy_server": {"default": None, "help": "SOCKS server proxy URL, in development"},
|
||||
},
|
||||
"description": """
|
||||
|
||||
@@ -9,6 +9,7 @@
|
||||
"configs": {
|
||||
"timeout": {
|
||||
"default": 15,
|
||||
"type": "int",
|
||||
"help": "seconds to wait for successful archive confirmation from wayback, if more than this passes the result contains the job_id so the status can later be checked manually.",
|
||||
},
|
||||
"if_not_archived_within": {
|
||||
|
||||
@@ -10,8 +10,12 @@
|
||||
"help": "WhisperApi api endpoint, eg: https://whisperbox-api.com/api/v1, a deployment of https://github.com/bellingcat/whisperbox-transcribe."},
|
||||
"api_key": {"required": True,
|
||||
"help": "WhisperApi api key for authentication"},
|
||||
"include_srt": {"default": False, "help": "Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players)."},
|
||||
"timeout": {"default": 90, "help": "How many seconds to wait at most for a successful job completion."},
|
||||
"include_srt": {"default": False,
|
||||
"type": "bool",
|
||||
"help": "Whether to include a subtitle SRT (SubRip Subtitle file) for the video (can be used in video players)."},
|
||||
"timeout": {"default": 90,
|
||||
"type": "int",
|
||||
"help": "How many seconds to wait at most for a successful job completion."},
|
||||
"action": {"default": "translate",
|
||||
"help": "which Whisper operation to execute",
|
||||
"choices": ["transcribe", "translate", "language_detection"]},
|
||||
|
||||
@@ -1,18 +1,23 @@
|
||||
""" This Webdriver class acts as a context manager for the selenium webdriver. """
|
||||
from __future__ import annotations
|
||||
from selenium import webdriver
|
||||
from selenium.common.exceptions import TimeoutException
|
||||
from selenium.webdriver.common.proxy import Proxy, ProxyType
|
||||
from selenium.webdriver.common.print_page_options import PrintOptions
|
||||
|
||||
from loguru import logger
|
||||
from selenium.webdriver.common.by import By
|
||||
import os
|
||||
import time
|
||||
|
||||
#import domain_for_url
|
||||
from urllib.parse import urlparse, urlunparse
|
||||
from http.cookiejar import MozillaCookieJar
|
||||
|
||||
from selenium import webdriver
|
||||
from selenium.webdriver.support.ui import WebDriverWait
|
||||
from selenium.webdriver.support import expected_conditions as EC
|
||||
from selenium.common import exceptions as selenium_exceptions
|
||||
from selenium.webdriver.common.print_page_options import PrintOptions
|
||||
from selenium.webdriver.common.by import By
|
||||
|
||||
from loguru import logger
|
||||
|
||||
|
||||
class CookieSettingDriver(webdriver.Firefox):
|
||||
|
||||
facebook_accept_cookies: bool
|
||||
@@ -20,6 +25,10 @@ class CookieSettingDriver(webdriver.Firefox):
|
||||
cookiejar: MozillaCookieJar
|
||||
|
||||
def __init__(self, cookies, cookiejar, facebook_accept_cookies, *args, **kwargs):
|
||||
if os.environ.get('RUNNING_IN_DOCKER'):
|
||||
# Selenium doesn't support linux-aarch64 driver, we need to set this manually
|
||||
kwargs['service'] = webdriver.FirefoxService(executable_path='/usr/local/bin/geckodriver')
|
||||
|
||||
super(CookieSettingDriver, self).__init__(*args, **kwargs)
|
||||
self.cookies = cookies
|
||||
self.cookiejar = cookiejar
|
||||
@@ -64,14 +73,29 @@ class CookieSettingDriver(webdriver.Firefox):
|
||||
time.sleep(2)
|
||||
except Exception as e:
|
||||
logger.warning(f'Failed on fb accept cookies.', e)
|
||||
|
||||
|
||||
# now get the actual URL
|
||||
super(CookieSettingDriver, self).get(url)
|
||||
if self.facebook_accept_cookies:
|
||||
# try and click the 'close' button on the 'login' window to close it
|
||||
close_button = self.find_element(By.XPATH, "//div[@role='dialog']//div[@aria-label='Close']")
|
||||
if close_button:
|
||||
close_button.click()
|
||||
try:
|
||||
xpath = "//div[@role='dialog']//div[@aria-label='Close']"
|
||||
WebDriverWait(self, 5).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
|
||||
except selenium_exceptions.NoSuchElementException:
|
||||
logger.warning("Unable to find the 'close' button on the facebook login window")
|
||||
pass
|
||||
|
||||
else:
|
||||
|
||||
# for all other sites, try and use some common button text to reject/accept cookies
|
||||
for text in ["Refuse non-essential cookies", "Decline optional cookies", "Reject additional cookies", "Accept all cookies"]:
|
||||
try:
|
||||
xpath = f"//*[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '{text.lower()}')]"
|
||||
WebDriverWait(self, 5).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
|
||||
break
|
||||
except selenium_exceptions.WebDriverException:
|
||||
pass
|
||||
|
||||
|
||||
class Webdriver:
|
||||
@@ -90,7 +114,6 @@ class Webdriver:
|
||||
setattr(self.print_options, k, v)
|
||||
|
||||
def __enter__(self) -> webdriver:
|
||||
|
||||
options = webdriver.FirefoxOptions()
|
||||
options.add_argument("--headless")
|
||||
options.add_argument(f'--proxy-server={self.http_proxy}')
|
||||
@@ -105,7 +128,7 @@ class Webdriver:
|
||||
self.driver.set_window_size(self.width, self.height)
|
||||
self.driver.set_page_load_timeout(self.timeout_seconds)
|
||||
self.driver.print_options = self.print_options
|
||||
except TimeoutException as e:
|
||||
except selenium_exceptions.TimeoutException as e:
|
||||
logger.error(f"failed to get new webdriver, possibly due to insufficient system resources or timeout settings: {e}")
|
||||
|
||||
return self.driver
|
||||
|
||||
@@ -1,21 +1,36 @@
|
||||
import pytest
|
||||
|
||||
from auto_archiver.modules.instagram_extractor import InstagramExtractor
|
||||
from .test_extractor_base import TestExtractorBase
|
||||
|
||||
class TestInstagramExtractor(TestExtractorBase):
|
||||
|
||||
@pytest.fixture
|
||||
def instagram_extractor(setup_module, mocker):
|
||||
|
||||
extractor_module: str = 'instagram_extractor'
|
||||
config: dict = {}
|
||||
config: dict = {
|
||||
"username": "user_name",
|
||||
"password": "password123",
|
||||
"download_folder": "instaloader",
|
||||
"session_file": "secrets/instaloader.session",
|
||||
}
|
||||
fake_loader = mocker.MagicMock()
|
||||
fake_loader.load_session_from_file.return_value = None
|
||||
fake_loader.login.return_value = None
|
||||
fake_loader.save_session_to_file.return_value = None
|
||||
mocker.patch("instaloader.Instaloader", return_value=fake_loader,)
|
||||
return setup_module(extractor_module, config)
|
||||
|
||||
@pytest.mark.parametrize("url", [
|
||||
"https://www.instagram.com/p/",
|
||||
"https://www.instagram.com/p/1234567890/",
|
||||
"https://www.instagram.com/reel/1234567890/",
|
||||
"https://www.instagram.com/username/",
|
||||
"https://www.instagram.com/username/stories/",
|
||||
"https://www.instagram.com/username/highlights/",
|
||||
])
|
||||
def test_regex_matches(self, url):
|
||||
# post
|
||||
assert InstagramExtractor.valid_url.match(url)
|
||||
|
||||
@pytest.mark.parametrize("url", [
|
||||
"https://www.instagram.com/p/",
|
||||
"https://www.instagram.com/p/1234567890/",
|
||||
"https://www.instagram.com/reel/1234567890/",
|
||||
"https://www.instagram.com/username/",
|
||||
"https://www.instagram.com/username/stories/",
|
||||
"https://www.instagram.com/username/highlights/",
|
||||
])
|
||||
def test_regex_matches(url: str, instagram_extractor: InstagramExtractor) -> None:
|
||||
"""
|
||||
Ensure that the valid_url regex matches all provided Instagram URLs.
|
||||
"""
|
||||
assert instagram_extractor.valid_url.match(url)
|
||||
Reference in New Issue
Block a user