Compare commits

...

11 Commits

Author SHA1 Message Date
msramalho
1970fa3c82 new instagram archiver via telegram bot 2023-02-17 16:15:25 +00:00
msramalho
aa5430451e instagram archiver via telegram bot 2023-02-17 15:46:29 +00:00
msramalho
f35875a94c name fix 2023-02-17 15:46:05 +00:00
msramalho
5505255ea3 url auth wall detect 2023-02-17 15:45:58 +00:00
msramalho
da17b3f68a name fix 2023-02-17 15:45:35 +00:00
msramalho
d6dbdec6ac example 2023-02-09 12:32:55 +00:00
msramalho
224ebe7ee8 links 2023-02-08 22:27:56 +00:00
msramalho
54a1bc2172 update readme 2023-02-08 22:26:24 +00:00
msramalho
77948207d1 update 2023-02-08 22:24:40 +00:00
msramalho
60552ae0ea update readme 2023-02-08 22:23:25 +00:00
msramalho
f255271ecb update README 2023-02-08 22:17:22 +00:00
16 changed files with 215 additions and 94 deletions

View File

@@ -1,4 +1,12 @@
# Auto Archiver <h1 align="center">Auto Archiver</h1>
[![PyPI version](https://badge.fury.io/py/auto-archiver.svg)](https://badge.fury.io/py/auto-archiver)
[![Docker Image Version (latest by date)](https://img.shields.io/docker/v/bellingcat/auto-archiver?label=version&logo=docker)](https://pypi.org/project/auto-archiver/)
<!-- ![Docker Pulls](https://img.shields.io/docker/pulls/bellingcat/auto-archiver) -->
<!-- [![PyPI download month](https://img.shields.io/pypi/dm/auto-archiver.svg)](https://pypi.python.org/pypi/auto-archiver/) -->
<!-- [![Documentation Status](https://readthedocs.org/projects/vk-url-scraper/badge/?version=latest)](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest) -->
Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/). Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/).
@@ -15,6 +23,11 @@ But **you always need a configuration/orchestration file**, which is where you'l
## How to run the auto-archiver ## How to run the auto-archiver
### Option 1 - docker ### Option 1 - docker
<details><summary><code>Docker instructions</code></summary>
[![dockeri.co](https://dockerico.blankenship.io/image/bellingcat/auto-archiver)](https://hub.docker.com/r/bellingcat/auto-archiver)
Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag. Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.
@@ -32,14 +45,20 @@ Docker works like a virtual machine running inside your computer, it isolates ev
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker 2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file 3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
</details>
### Option 2 - python package ### Option 2 - python package
<details><summary><code>Python package instructions</code></summary>
1. make sure you have python 3.8 or higher installed 1. make sure you have python 3.8 or higher installed
2. install the package `pip/pipenv/conda install auto-archiver` 2. install the package `pip/pipenv/conda install auto-archiver`
3. test it's installed with `auto-archiver --help` 3. test it's installed with `auto-archiver --help`
4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml` 4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml`
1. if your orchestration file is inside a `secrets/` which we advise 1. if your orchestration file is inside a `secrets/` which we advise
</details>
### Option 3 - local installation ### Option 3 - local installation
This can also be used for development. This can also be used for development.
@@ -60,13 +79,6 @@ Clone and run:
</details><br/> </details><br/>
### Examples
# Orchestration # Orchestration
The archiver work is orchestrated by the following workflow (we call each a **step**): The archiver work is orchestrated by the following workflow (we call each a **step**):
1. **Feeder** gets the links (from a spreadsheet, from the console, ...) 1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
@@ -85,7 +97,7 @@ The structure of orchestration file is split into 2 parts: `steps` (what **steps
steps: steps:
feeder: gsheet_feeder feeder: gsheet_feeder
archivers: # order matters archivers: # order matters
- youtubedl_enricher - youtubedl_archiver
enrichers: enrichers:
- thumbnail_enricher - thumbnail_enricher
formatter: html_formatter formatter: html_formatter
@@ -178,16 +190,18 @@ Note that the first row is skipped, as it is assumed to be a header row (`--gshe
## Development ## Development
Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run from the local development environment. Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run from the local development environment.
# Docker development #### Docker development
* working with docker locally: working with docker locally:
* `docker build . -t auto-archiver` to build a local image * `docker build . -t auto-archiver` to build a local image
* `docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml` * `docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml`
* to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive` * to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
* release to docker hub
release to docker hub
* `docker image tag auto-archiver bellingcat/auto-archiver:latest` * `docker image tag auto-archiver bellingcat/auto-archiver:latest`
* `docker push bellingcat/auto-archiver` * `docker push bellingcat/auto-archiver`
# RELEASE #### RELEASE
* update version in [version.py](src/auto_archiver/version.py) * update version in [version.py](src/auto_archiver/version.py)
* run `bash ./scripts/release.sh` and confirm * run `bash ./scripts/release.sh` and confirm
* package is automatically updated in pypi * package is automatically updated in pypi

View File

@@ -1,80 +1,79 @@
steps: steps:
# only 1 feeder allowed # only 1 feeder allowed
# a feeder could be in an "infinite loop" for example: gsheets_infinite feeder which holds-> this could be an easy logic addiction by modifying for each to while not feeder.done() if it becomes necessary # feeder: cli_feeder # default feeder
feeder: gsheet_feeder # default -> only expects URL from CLI feeder: gsheet_feeder # default -> only expects URL from CLI
archivers: # order matters archivers: # order matters
- telethon # - vk_archiver
# - tiktok # - telethon_archiver
# - twitter # - telegram_archiver
# - instagram # - twitter_archiver
# - webarchive # this way it runs as a failsafe only # - twitter_api_archiver
# enrichers: # - instagram_archiver
# - screenshot # - instagram_tbot_archiver
# - wacz # - tiktok_archiver
# - webarchive # this way it runs for every case, webarchive extends archiver and enrichment - youtubedl_archiver
# - thumbnails # - wayback_archiver_enricher
formatters: enrichers:
- HTMLFormater - hash_enricher
- PdfFormater - screenshot_enricher
- thumbnail_enricher
# - wayback_archiver_enricher
# - wacz_enricher
formatter: html_formatter # defaults to mute_formatter
storages: storages:
- local_storage - local_storage
- s3 # - s3_storage
# - gdrive_storage
databases: databases:
- gsheets_db # - console_db
- mongo_db # - csv_db
- gsheet_db
# - mongo_db
configurations: configurations:
gsheet_feeder: gsheet_feeder:
sheet: my-auto-archiver sheet: auto-archiver-test
header: 2 # defaults to 1 in GSheetsFeeder header: 2 # defaults to 1 in GSheetsFeeder
service_account: "secrets/service_account.json" service_account: "secrets/service_account.json"
# allow_worksheets: "allowed" use_sheet_names_in_stored_paths: false
# block_worksheets: "blocked1,blocked2"
columns: columns:
'url': 'link' url: link
'status': 'archive status' status: archive status
'folder': 'destination folder' folder: destination folder
'archive': 'archive location' archive: archive location
'date': 'archive date' date: archive date
'thumbnail': 'thumbnail' thumbnail: thumbnail
'thumbnail_index': 'thumbnail index' thumbnail_index: thumbnail index
'timestamp': 'upload timestamp' timestamp: upload timestamp
'title': 'upload title' title: upload title
'duration': 'duration' text: textual content
'screenshot': 'screenshot' duration: duration
'hash': 'hash' screenshot: screenshot
'wacz': 'wacz' hash: hash
'replaywebpage': 'replaywebpage' wacz: wacz
telethon: replaywebpage: replaywebpage
api_id: "1234567"
api_hash: "examplehash"
session_file: "secrets/anon"
channel_invites:
- invite: https://t.me/+XXXXXXXXXXXXXX
id: 1000000000
- invite: https://t.me/joinchat/XXXXXXXXXXXXXX
id: 1000000001
tiktok: screenshot_enricher:
api_keys:
- username: 1
password: 2
- username: 3
password: 4
username: "abc"
password: "123"
token: "here"
screenshot:
width: 1280 width: 1280
height: 4600 height: 2300
wacz: wayback_archiver_enricher:
profile: secrets/profile.tar.gz timeout: 10
webarchive: key: ""
api_key: "12345" secret: ""
s3: hash_enricher:
- bucket: 123 algorithm: "SHA3-512"
- region: "nyc3" # wacz:
- cdn: "{region}{bucket}" # profile: secrets/profile.tar.gz
local_storage:
save_to: "./local_archive"
save_absolute: true
filename_generator: static
path_generator: flat
gdrive_storage:
path_generator: url
filename_generator: random
root_folder_id: TODO
oauth_token: secrets/gd-token.json
service_account: "secrets/service_account.json"

View File

@@ -3,6 +3,7 @@ from .telethon_archiver import TelethonArchiver
from .twitter_archiver import TwitterArchiver from .twitter_archiver import TwitterArchiver
from .twitter_api_archiver import TwitterApiArchiver from .twitter_api_archiver import TwitterApiArchiver
from .instagram_archiver import InstagramArchiver from .instagram_archiver import InstagramArchiver
from .instagram_tbot_archiver import InstagramTbotArchiver
from .tiktok_archiver import TiktokArchiver from .tiktok_archiver import TiktokArchiver
from .telegram_archiver import TelegramArchiver from .telegram_archiver import TelegramArchiver
from .vk_archiver import VkArchiver from .vk_archiver import VkArchiver

View File

@@ -0,0 +1,67 @@
from telethon.sync import TelegramClient
from loguru import logger
import time, os
from . import Archiver
from ..core import Metadata, Media
class InstagramTbotArchiver(Archiver):
"""
calls a telegram bot to fetch instagram posts/stories...
https://github.com/adw0rd/instagrapi
https://t.me/instagram_load_bot
"""
name = "instagram_tbot_archiver"
def __init__(self, config: dict) -> None:
super().__init__(config)
self.assert_valid_string("api_id")
self.assert_valid_string("api_hash")
self.timeout = int(self.timeout)
self.client = TelegramClient(self.session_file, self.api_id, self.api_hash)
@staticmethod
def configs() -> dict:
return {
"api_id": {"default": None, "help": "telegram API_ID value, go to https://my.telegram.org/apps"},
"api_hash": {"default": None, "help": "telegram API_HASH value, go to https://my.telegram.org/apps"},
"session_file": {"default": "secrets/anon", "help": "optional, records the telegram login session for future usage, '.session' will be appended to the provided value."},
"timeout": {"default": 15, "help": "timeout to fetch the instagram content in seconds."},
}
def setup(self) -> None:
logger.info(f"SETUP {self.name} checking login...")
with self.client.start():
logger.success(f"SETUP {self.name} login works.")
def download(self, item: Metadata) -> Metadata:
url = item.get_url()
if not "instagram.com" in url: return False
result = Metadata()
tmp_dir = item.get_tmp_dir()
with self.client.start():
chat = self.client.get_entity("instagram_load_bot")
since_id = self.client.send_message(entity=chat, message=url).id
attempts = 0
media = None
message = ""
time.sleep(4)
while attempts < self.timeout and (not message or not media):
attempts += 1
time.sleep(1)
for post in self.client.iter_messages(chat, min_id=since_id):
since_id = max(since_id, post.id)
if post.media and not media:
filename_dest = os.path.join(tmp_dir, f'{chat.id}_{post.id}')
media = self.client.download_media(post.media, filename_dest)
if media: result.add_media(Media(media))
if post.message: message += post.message
if message:
result.set_content(message).set_title(message[:128])
return result.success("insta-via-bot")

View File

@@ -114,7 +114,7 @@ class TelethonArchiver(Archiver):
with self.client.start(): with self.client.start():
# with self.client.start(bot_token=self.bot_token): # with self.client.start(bot_token=self.bot_token):
try: try:
post = self.client.get_messages(chat, ids=post_id) post = self.client.get_messages(chat, ids=post_id)
except ValueError as e: except ValueError as e:
logger.error(f"Could not fetch telegram {url} possibly it's private: {e}") logger.error(f"Could not fetch telegram {url} possibly it's private: {e}")
return False return False

View File

@@ -37,7 +37,7 @@ class TwitterArchiver(Archiver):
return self.link_clean_pattern.sub("\\1", url) return self.link_clean_pattern.sub("\\1", url)
def is_rearchivable(self, url: str) -> bool: def is_rearchivable(self, url: str) -> bool:
# Twitter posts are static # Twitter posts are static (for now)
return False return False
def download(self, item: Metadata) -> Metadata: def download(self, item: Metadata) -> Metadata:
@@ -86,7 +86,7 @@ class TwitterArchiver(Archiver):
media.filename = self.download_from_url(media.get("src"), f'{slugify(url)}_{i}{ext}', item) media.filename = self.download_from_url(media.get("src"), f'{slugify(url)}_{i}{ext}', item)
result.add_media(media) result.add_media(media)
return result.success("twitter") return result.success("twitter-snscrape")
def download_alternative(self, item: Metadata, url: str, tweet_id: str) -> Metadata: def download_alternative(self, item: Metadata, url: str, tweet_id: str) -> Metadata:
""" """

View File

@@ -6,7 +6,7 @@ from ..core import Metadata, Media
class YoutubeDLArchiver(Archiver): class YoutubeDLArchiver(Archiver):
name = "youtubedl_enricher" name = "youtubedl_archiver"
def __init__(self, config: dict) -> None: def __init__(self, config: dict) -> None:
super().__init__(config) super().__init__(config)

View File

@@ -63,6 +63,9 @@ class Metadata:
def is_success(self) -> bool: def is_success(self) -> bool:
return "success" in self.status return "success" in self.status
def is_empty(self) -> bool:
return not self.is_success() and len(self.media) == 0 and len(self.get_clean_metadata()) <= 2 # url, processed_at
@property # getter .netloc @property # getter .netloc
def netloc(self) -> str: def netloc(self) -> str:
return urlparse(self.get_url()).netloc return urlparse(self.get_url()).netloc
@@ -122,7 +125,7 @@ class Metadata:
for m in self.media: for m in self.media:
if m.get("id") == id: return m if m.get("id") == id: return m
return default return default
def get_first_image(self, default=None) -> Media: def get_first_image(self, default=None) -> Media:
for m in self.media: for m in self.media:
if "image" in m.mimetype: return m if "image" in m.mimetype: return m

View File

@@ -123,6 +123,9 @@ class ArchivingOrchestrator:
s.store(final_media, result) s.store(final_media, result)
result.set_final_media(final_media) result.set_final_media(final_media)
if result.is_empty():
result.status = "nothing archived"
# signal completion to databases (DBs, Google Sheets, CSV, ...) # signal completion to databases (DBs, Google Sheets, CSV, ...)
for d in self.databases: d.done(result) for d in self.databases: d.done(result)

View File

@@ -2,10 +2,8 @@ from typing import Union, Tuple
import datetime import datetime
from urllib.parse import quote from urllib.parse import quote
# from metadata import Metadata
from loguru import logger from loguru import logger
# from . import Enricher
from . import Database from . import Database
from ..core import Metadata from ..core import Metadata
from ..core import Media from ..core import Media
@@ -61,13 +59,13 @@ class GsheetsDb(Database):
cell_updates.append((row, 'status', item.status)) cell_updates.append((row, 'status', item.status))
media: Media = item.get_final_media() media: Media = item.get_final_media()
if hasattr(media, "urls"):
batch_if_valid('archive', "\n".join(media.urls)) batch_if_valid('archive', "\n".join(media.urls))
batch_if_valid('date', True, datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc).isoformat()) batch_if_valid('date', True, datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc).isoformat())
batch_if_valid('title', item.get_title()) batch_if_valid('title', item.get_title())
batch_if_valid('text', item.get("content", "")[:500]) batch_if_valid('text', item.get("content", "")[:500])
batch_if_valid('timestamp', item.get_timestamp()) batch_if_valid('timestamp', item.get_timestamp())
if (screenshot := item.get_media_by_id("screenshot")): if (screenshot := item.get_media_by_id("screenshot")) and hasattr(screenshot, "urls"):
batch_if_valid('screenshot', "\n".join(screenshot.urls)) batch_if_valid('screenshot', "\n".join(screenshot.urls))
if (thumbnail := item.get_first_image("thumbnail")): if (thumbnail := item.get_first_image("thumbnail")):

View File

@@ -3,7 +3,7 @@ import time, uuid, os
from selenium.common.exceptions import TimeoutException from selenium.common.exceptions import TimeoutException
from . import Enricher from . import Enricher
from ..utils import Webdriver from ..utils import Webdriver, UrlUtil
from ..core import Media, Metadata from ..core import Media, Metadata
class ScreenshotEnricher(Enricher): class ScreenshotEnricher(Enricher):
@@ -19,6 +19,10 @@ class ScreenshotEnricher(Enricher):
def enrich(self, to_enrich: Metadata) -> None: def enrich(self, to_enrich: Metadata) -> None:
url = to_enrich.get_url() url = to_enrich.get_url()
if UrlUtil.is_auth_wall(url):
logger.debug(f"[SKIP] SCREENSHOT since url is behind AUTH WALL: {url=}")
return
logger.debug(f"Enriching screenshot for {url=}") logger.debug(f"Enriching screenshot for {url=}")
with Webdriver(self.width, self.height, self.timeout, 'facebook.com' in url) as driver: with Webdriver(self.width, self.height, self.timeout, 'facebook.com' in url) as driver:
try: try:

View File

@@ -1,8 +1,10 @@
from loguru import logger from loguru import logger
import time, requests import time, requests
from . import Enricher from . import Enricher
from ..archivers import Archiver from ..archivers import Archiver
from ..utils import UrlUtil
from ..core import Metadata from ..core import Metadata
class WaybackArchiverEnricher(Enricher, Archiver): class WaybackArchiverEnricher(Enricher, Archiver):
@@ -33,6 +35,10 @@ class WaybackArchiverEnricher(Enricher, Archiver):
def enrich(self, to_enrich: Metadata) -> bool: def enrich(self, to_enrich: Metadata) -> bool:
url = to_enrich.get_url() url = to_enrich.get_url()
if UrlUtil.is_auth_wall(url):
logger.debug(f"[SKIP] WAYBACK since url is behind AUTH WALL: {url=}")
return
logger.debug(f"calling wayback for {url=}") logger.debug(f"calling wayback for {url=}")
if to_enrich.get("wayback"): if to_enrich.get("wayback"):

View File

@@ -3,6 +3,7 @@ from dataclasses import dataclass
import mimetypes, uuid, os, pathlib import mimetypes, uuid, os, pathlib
from jinja2 import Environment, FileSystemLoader from jinja2 import Environment, FileSystemLoader
from urllib.parse import quote from urllib.parse import quote
from loguru import logger
from ..version import __version__ from ..version import __version__
from ..core import Metadata, Media from ..core import Metadata, Media
@@ -26,12 +27,17 @@ class HtmlFormatter(Formatter):
@staticmethod @staticmethod
def configs() -> dict: def configs() -> dict:
return { return {
"detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"}, "detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"}
} }
def format(self, item: Metadata) -> Media: def format(self, item: Metadata) -> Media:
url = item.get_url()
if item.is_empty():
logger.debug(f"[SKIP] FORMAT there is no media or metadata to format: {url=}")
return
content = self.template.render( content = self.template.render(
url=item.get_url(), url=url,
title=item.get_title(), title=item.get_title(),
media=item.media, media=item.media,
metadata=item.get_clean_metadata(), metadata=item.get_clean_metadata(),

View File

@@ -2,4 +2,5 @@
from .gworksheet import GWorksheet from .gworksheet import GWorksheet
from .misc import * from .misc import *
from .webdriver import Webdriver from .webdriver import Webdriver
from .gsheet import Gsheets from .gsheet import Gsheets
from .url import UrlUtil

View File

@@ -0,0 +1,19 @@
import re
class UrlUtil:
telegram_private = re.compile(r"https:\/\/t\.me(\/c)\/(.+)\/(\d+)")
is_istagram = re.compile(r"https:\/\/www\.instagram\.com")
@staticmethod
def clean(url): return url
@staticmethod
def is_auth_wall(url):
"""
checks if URL is behind an authentication wall meaning steps like wayback, wacz, ... may not work
"""
if UrlUtil.telegram_private.match(url): return True
if UrlUtil.is_istagram.match(url): return True
return False

View File

@@ -1,9 +1,9 @@
_MAJOR = "0" _MAJOR = "0"
_MINOR = "3" _MINOR = "4"
# On main and in a nightly release the patch should be one ahead of the last # On main and in a nightly release the patch should be one ahead of the last
# released build. # released build.
_PATCH = "0" _PATCH = "1"
# This is mainly for nightly builds which have the suffix ".dev$DATE". See # This is mainly for nightly builds which have the suffix ".dev$DATE". See
# https://semver.org/#is-v123-a-semantic-version for the semantics. # https://semver.org/#is-v123-a-semantic-version for the semantics.
_SUFFIX = "" _SUFFIX = ""