mirror of
https://github.com/bellingcat/auto-archiver.git
synced 2026-06-12 05:08:28 +03:00
Compare commits
11 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
1970fa3c82 | ||
|
|
aa5430451e | ||
|
|
f35875a94c | ||
|
|
5505255ea3 | ||
|
|
da17b3f68a | ||
|
|
d6dbdec6ac | ||
|
|
224ebe7ee8 | ||
|
|
54a1bc2172 | ||
|
|
77948207d1 | ||
|
|
60552ae0ea | ||
|
|
f255271ecb |
40
README.md
40
README.md
@@ -1,4 +1,12 @@
|
|||||||
# Auto Archiver
|
<h1 align="center">Auto Archiver</h1>
|
||||||
|
|
||||||
|
[](https://badge.fury.io/py/auto-archiver)
|
||||||
|
[](https://pypi.org/project/auto-archiver/)
|
||||||
|
<!--  -->
|
||||||
|
<!-- [](https://pypi.python.org/pypi/auto-archiver/) -->
|
||||||
|
<!-- [](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest) -->
|
||||||
|
|
||||||
|
|
||||||
Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/).
|
Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/).
|
||||||
|
|
||||||
|
|
||||||
@@ -15,6 +23,11 @@ But **you always need a configuration/orchestration file**, which is where you'l
|
|||||||
## How to run the auto-archiver
|
## How to run the auto-archiver
|
||||||
|
|
||||||
### Option 1 - docker
|
### Option 1 - docker
|
||||||
|
|
||||||
|
<details><summary><code>Docker instructions</code></summary>
|
||||||
|
|
||||||
|
[](https://hub.docker.com/r/bellingcat/auto-archiver)
|
||||||
|
|
||||||
Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.
|
Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.
|
||||||
|
|
||||||
|
|
||||||
@@ -32,14 +45,20 @@ Docker works like a virtual machine running inside your computer, it isolates ev
|
|||||||
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
|
2. `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
|
||||||
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
|
3. `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
### Option 2 - python package
|
### Option 2 - python package
|
||||||
|
|
||||||
|
<details><summary><code>Python package instructions</code></summary>
|
||||||
|
|
||||||
1. make sure you have python 3.8 or higher installed
|
1. make sure you have python 3.8 or higher installed
|
||||||
2. install the package `pip/pipenv/conda install auto-archiver`
|
2. install the package `pip/pipenv/conda install auto-archiver`
|
||||||
3. test it's installed with `auto-archiver --help`
|
3. test it's installed with `auto-archiver --help`
|
||||||
4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml`
|
4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml`
|
||||||
1. if your orchestration file is inside a `secrets/` which we advise
|
1. if your orchestration file is inside a `secrets/` which we advise
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
|
||||||
### Option 3 - local installation
|
### Option 3 - local installation
|
||||||
This can also be used for development.
|
This can also be used for development.
|
||||||
@@ -60,13 +79,6 @@ Clone and run:
|
|||||||
|
|
||||||
</details><br/>
|
</details><br/>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Examples
|
|
||||||
|
|
||||||
|
|
||||||
# Orchestration
|
# Orchestration
|
||||||
The archiver work is orchestrated by the following workflow (we call each a **step**):
|
The archiver work is orchestrated by the following workflow (we call each a **step**):
|
||||||
1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
|
1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
|
||||||
@@ -85,7 +97,7 @@ The structure of orchestration file is split into 2 parts: `steps` (what **steps
|
|||||||
steps:
|
steps:
|
||||||
feeder: gsheet_feeder
|
feeder: gsheet_feeder
|
||||||
archivers: # order matters
|
archivers: # order matters
|
||||||
- youtubedl_enricher
|
- youtubedl_archiver
|
||||||
enrichers:
|
enrichers:
|
||||||
- thumbnail_enricher
|
- thumbnail_enricher
|
||||||
formatter: html_formatter
|
formatter: html_formatter
|
||||||
@@ -178,16 +190,18 @@ Note that the first row is skipped, as it is assumed to be a header row (`--gshe
|
|||||||
## Development
|
## Development
|
||||||
Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run from the local development environment.
|
Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run from the local development environment.
|
||||||
|
|
||||||
# Docker development
|
#### Docker development
|
||||||
* working with docker locally:
|
working with docker locally:
|
||||||
* `docker build . -t auto-archiver` to build a local image
|
* `docker build . -t auto-archiver` to build a local image
|
||||||
* `docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml`
|
* `docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml`
|
||||||
* to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
|
* to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
|
||||||
* release to docker hub
|
|
||||||
|
|
||||||
|
release to docker hub
|
||||||
* `docker image tag auto-archiver bellingcat/auto-archiver:latest`
|
* `docker image tag auto-archiver bellingcat/auto-archiver:latest`
|
||||||
* `docker push bellingcat/auto-archiver`
|
* `docker push bellingcat/auto-archiver`
|
||||||
|
|
||||||
# RELEASE
|
#### RELEASE
|
||||||
* update version in [version.py](src/auto_archiver/version.py)
|
* update version in [version.py](src/auto_archiver/version.py)
|
||||||
* run `bash ./scripts/release.sh` and confirm
|
* run `bash ./scripts/release.sh` and confirm
|
||||||
* package is automatically updated in pypi
|
* package is automatically updated in pypi
|
||||||
|
|||||||
@@ -1,80 +1,79 @@
|
|||||||
steps:
|
steps:
|
||||||
# only 1 feeder allowed
|
# only 1 feeder allowed
|
||||||
# a feeder could be in an "infinite loop" for example: gsheets_infinite feeder which holds-> this could be an easy logic addiction by modifying for each to while not feeder.done() if it becomes necessary
|
# feeder: cli_feeder # default feeder
|
||||||
feeder: gsheet_feeder # default -> only expects URL from CLI
|
feeder: gsheet_feeder # default -> only expects URL from CLI
|
||||||
archivers: # order matters
|
archivers: # order matters
|
||||||
- telethon
|
# - vk_archiver
|
||||||
# - tiktok
|
# - telethon_archiver
|
||||||
# - twitter
|
# - telegram_archiver
|
||||||
# - instagram
|
# - twitter_archiver
|
||||||
# - webarchive # this way it runs as a failsafe only
|
# - twitter_api_archiver
|
||||||
# enrichers:
|
# - instagram_archiver
|
||||||
# - screenshot
|
# - instagram_tbot_archiver
|
||||||
# - wacz
|
# - tiktok_archiver
|
||||||
# - webarchive # this way it runs for every case, webarchive extends archiver and enrichment
|
- youtubedl_archiver
|
||||||
# - thumbnails
|
# - wayback_archiver_enricher
|
||||||
formatters:
|
enrichers:
|
||||||
- HTMLFormater
|
- hash_enricher
|
||||||
- PdfFormater
|
- screenshot_enricher
|
||||||
|
- thumbnail_enricher
|
||||||
|
# - wayback_archiver_enricher
|
||||||
|
# - wacz_enricher
|
||||||
|
|
||||||
|
formatter: html_formatter # defaults to mute_formatter
|
||||||
storages:
|
storages:
|
||||||
- local_storage
|
- local_storage
|
||||||
- s3
|
# - s3_storage
|
||||||
|
# - gdrive_storage
|
||||||
databases:
|
databases:
|
||||||
- gsheets_db
|
# - console_db
|
||||||
- mongo_db
|
# - csv_db
|
||||||
|
- gsheet_db
|
||||||
|
# - mongo_db
|
||||||
|
|
||||||
configurations:
|
configurations:
|
||||||
gsheet_feeder:
|
gsheet_feeder:
|
||||||
sheet: my-auto-archiver
|
sheet: auto-archiver-test
|
||||||
header: 2 # defaults to 1 in GSheetsFeeder
|
header: 2 # defaults to 1 in GSheetsFeeder
|
||||||
service_account: "secrets/service_account.json"
|
service_account: "secrets/service_account.json"
|
||||||
# allow_worksheets: "allowed"
|
use_sheet_names_in_stored_paths: false
|
||||||
# block_worksheets: "blocked1,blocked2"
|
|
||||||
columns:
|
columns:
|
||||||
'url': 'link'
|
url: link
|
||||||
'status': 'archive status'
|
status: archive status
|
||||||
'folder': 'destination folder'
|
folder: destination folder
|
||||||
'archive': 'archive location'
|
archive: archive location
|
||||||
'date': 'archive date'
|
date: archive date
|
||||||
'thumbnail': 'thumbnail'
|
thumbnail: thumbnail
|
||||||
'thumbnail_index': 'thumbnail index'
|
thumbnail_index: thumbnail index
|
||||||
'timestamp': 'upload timestamp'
|
timestamp: upload timestamp
|
||||||
'title': 'upload title'
|
title: upload title
|
||||||
'duration': 'duration'
|
text: textual content
|
||||||
'screenshot': 'screenshot'
|
duration: duration
|
||||||
'hash': 'hash'
|
screenshot: screenshot
|
||||||
'wacz': 'wacz'
|
hash: hash
|
||||||
'replaywebpage': 'replaywebpage'
|
wacz: wacz
|
||||||
telethon:
|
replaywebpage: replaywebpage
|
||||||
api_id: "1234567"
|
|
||||||
api_hash: "examplehash"
|
|
||||||
session_file: "secrets/anon"
|
|
||||||
channel_invites:
|
|
||||||
- invite: https://t.me/+XXXXXXXXXXXXXX
|
|
||||||
id: 1000000000
|
|
||||||
- invite: https://t.me/joinchat/XXXXXXXXXXXXXX
|
|
||||||
id: 1000000001
|
|
||||||
|
|
||||||
tiktok:
|
screenshot_enricher:
|
||||||
api_keys:
|
|
||||||
- username: 1
|
|
||||||
password: 2
|
|
||||||
- username: 3
|
|
||||||
password: 4
|
|
||||||
username: "abc"
|
|
||||||
password: "123"
|
|
||||||
token: "here"
|
|
||||||
screenshot:
|
|
||||||
width: 1280
|
width: 1280
|
||||||
height: 4600
|
height: 2300
|
||||||
wacz:
|
wayback_archiver_enricher:
|
||||||
profile: secrets/profile.tar.gz
|
timeout: 10
|
||||||
webarchive:
|
key: ""
|
||||||
api_key: "12345"
|
secret: ""
|
||||||
s3:
|
hash_enricher:
|
||||||
- bucket: 123
|
algorithm: "SHA3-512"
|
||||||
- region: "nyc3"
|
# wacz:
|
||||||
- cdn: "{region}{bucket}"
|
# profile: secrets/profile.tar.gz
|
||||||
|
local_storage:
|
||||||
|
save_to: "./local_archive"
|
||||||
|
save_absolute: true
|
||||||
|
filename_generator: static
|
||||||
|
path_generator: flat
|
||||||
|
|
||||||
|
gdrive_storage:
|
||||||
|
path_generator: url
|
||||||
|
filename_generator: random
|
||||||
|
root_folder_id: TODO
|
||||||
|
oauth_token: secrets/gd-token.json
|
||||||
|
service_account: "secrets/service_account.json"
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ from .telethon_archiver import TelethonArchiver
|
|||||||
from .twitter_archiver import TwitterArchiver
|
from .twitter_archiver import TwitterArchiver
|
||||||
from .twitter_api_archiver import TwitterApiArchiver
|
from .twitter_api_archiver import TwitterApiArchiver
|
||||||
from .instagram_archiver import InstagramArchiver
|
from .instagram_archiver import InstagramArchiver
|
||||||
|
from .instagram_tbot_archiver import InstagramTbotArchiver
|
||||||
from .tiktok_archiver import TiktokArchiver
|
from .tiktok_archiver import TiktokArchiver
|
||||||
from .telegram_archiver import TelegramArchiver
|
from .telegram_archiver import TelegramArchiver
|
||||||
from .vk_archiver import VkArchiver
|
from .vk_archiver import VkArchiver
|
||||||
|
|||||||
67
src/auto_archiver/archivers/instagram_tbot_archiver.py
Normal file
67
src/auto_archiver/archivers/instagram_tbot_archiver.py
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
|
||||||
|
from telethon.sync import TelegramClient
|
||||||
|
from loguru import logger
|
||||||
|
import time, os
|
||||||
|
|
||||||
|
from . import Archiver
|
||||||
|
from ..core import Metadata, Media
|
||||||
|
|
||||||
|
|
||||||
|
class InstagramTbotArchiver(Archiver):
|
||||||
|
"""
|
||||||
|
calls a telegram bot to fetch instagram posts/stories...
|
||||||
|
https://github.com/adw0rd/instagrapi
|
||||||
|
https://t.me/instagram_load_bot
|
||||||
|
"""
|
||||||
|
name = "instagram_tbot_archiver"
|
||||||
|
|
||||||
|
def __init__(self, config: dict) -> None:
|
||||||
|
super().__init__(config)
|
||||||
|
self.assert_valid_string("api_id")
|
||||||
|
self.assert_valid_string("api_hash")
|
||||||
|
self.timeout = int(self.timeout)
|
||||||
|
self.client = TelegramClient(self.session_file, self.api_id, self.api_hash)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def configs() -> dict:
|
||||||
|
return {
|
||||||
|
"api_id": {"default": None, "help": "telegram API_ID value, go to https://my.telegram.org/apps"},
|
||||||
|
"api_hash": {"default": None, "help": "telegram API_HASH value, go to https://my.telegram.org/apps"},
|
||||||
|
"session_file": {"default": "secrets/anon", "help": "optional, records the telegram login session for future usage, '.session' will be appended to the provided value."},
|
||||||
|
"timeout": {"default": 15, "help": "timeout to fetch the instagram content in seconds."},
|
||||||
|
}
|
||||||
|
|
||||||
|
def setup(self) -> None:
|
||||||
|
logger.info(f"SETUP {self.name} checking login...")
|
||||||
|
with self.client.start():
|
||||||
|
logger.success(f"SETUP {self.name} login works.")
|
||||||
|
|
||||||
|
def download(self, item: Metadata) -> Metadata:
|
||||||
|
url = item.get_url()
|
||||||
|
if not "instagram.com" in url: return False
|
||||||
|
|
||||||
|
result = Metadata()
|
||||||
|
tmp_dir = item.get_tmp_dir()
|
||||||
|
with self.client.start():
|
||||||
|
chat = self.client.get_entity("instagram_load_bot")
|
||||||
|
since_id = self.client.send_message(entity=chat, message=url).id
|
||||||
|
|
||||||
|
attempts = 0
|
||||||
|
media = None
|
||||||
|
message = ""
|
||||||
|
time.sleep(4)
|
||||||
|
while attempts < self.timeout and (not message or not media):
|
||||||
|
attempts += 1
|
||||||
|
time.sleep(1)
|
||||||
|
for post in self.client.iter_messages(chat, min_id=since_id):
|
||||||
|
since_id = max(since_id, post.id)
|
||||||
|
if post.media and not media:
|
||||||
|
filename_dest = os.path.join(tmp_dir, f'{chat.id}_{post.id}')
|
||||||
|
media = self.client.download_media(post.media, filename_dest)
|
||||||
|
if media: result.add_media(Media(media))
|
||||||
|
if post.message: message += post.message
|
||||||
|
|
||||||
|
if message:
|
||||||
|
result.set_content(message).set_title(message[:128])
|
||||||
|
|
||||||
|
return result.success("insta-via-bot")
|
||||||
@@ -114,7 +114,7 @@ class TelethonArchiver(Archiver):
|
|||||||
with self.client.start():
|
with self.client.start():
|
||||||
# with self.client.start(bot_token=self.bot_token):
|
# with self.client.start(bot_token=self.bot_token):
|
||||||
try:
|
try:
|
||||||
post = self.client.get_messages(chat, ids=post_id)
|
post = self.client.get_messages(chat, ids=post_id)
|
||||||
except ValueError as e:
|
except ValueError as e:
|
||||||
logger.error(f"Could not fetch telegram {url} possibly it's private: {e}")
|
logger.error(f"Could not fetch telegram {url} possibly it's private: {e}")
|
||||||
return False
|
return False
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ class TwitterArchiver(Archiver):
|
|||||||
return self.link_clean_pattern.sub("\\1", url)
|
return self.link_clean_pattern.sub("\\1", url)
|
||||||
|
|
||||||
def is_rearchivable(self, url: str) -> bool:
|
def is_rearchivable(self, url: str) -> bool:
|
||||||
# Twitter posts are static
|
# Twitter posts are static (for now)
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def download(self, item: Metadata) -> Metadata:
|
def download(self, item: Metadata) -> Metadata:
|
||||||
@@ -86,7 +86,7 @@ class TwitterArchiver(Archiver):
|
|||||||
media.filename = self.download_from_url(media.get("src"), f'{slugify(url)}_{i}{ext}', item)
|
media.filename = self.download_from_url(media.get("src"), f'{slugify(url)}_{i}{ext}', item)
|
||||||
result.add_media(media)
|
result.add_media(media)
|
||||||
|
|
||||||
return result.success("twitter")
|
return result.success("twitter-snscrape")
|
||||||
|
|
||||||
def download_alternative(self, item: Metadata, url: str, tweet_id: str) -> Metadata:
|
def download_alternative(self, item: Metadata, url: str, tweet_id: str) -> Metadata:
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -6,7 +6,7 @@ from ..core import Metadata, Media
|
|||||||
|
|
||||||
|
|
||||||
class YoutubeDLArchiver(Archiver):
|
class YoutubeDLArchiver(Archiver):
|
||||||
name = "youtubedl_enricher"
|
name = "youtubedl_archiver"
|
||||||
|
|
||||||
def __init__(self, config: dict) -> None:
|
def __init__(self, config: dict) -> None:
|
||||||
super().__init__(config)
|
super().__init__(config)
|
||||||
|
|||||||
@@ -63,6 +63,9 @@ class Metadata:
|
|||||||
def is_success(self) -> bool:
|
def is_success(self) -> bool:
|
||||||
return "success" in self.status
|
return "success" in self.status
|
||||||
|
|
||||||
|
def is_empty(self) -> bool:
|
||||||
|
return not self.is_success() and len(self.media) == 0 and len(self.get_clean_metadata()) <= 2 # url, processed_at
|
||||||
|
|
||||||
@property # getter .netloc
|
@property # getter .netloc
|
||||||
def netloc(self) -> str:
|
def netloc(self) -> str:
|
||||||
return urlparse(self.get_url()).netloc
|
return urlparse(self.get_url()).netloc
|
||||||
@@ -122,7 +125,7 @@ class Metadata:
|
|||||||
for m in self.media:
|
for m in self.media:
|
||||||
if m.get("id") == id: return m
|
if m.get("id") == id: return m
|
||||||
return default
|
return default
|
||||||
|
|
||||||
def get_first_image(self, default=None) -> Media:
|
def get_first_image(self, default=None) -> Media:
|
||||||
for m in self.media:
|
for m in self.media:
|
||||||
if "image" in m.mimetype: return m
|
if "image" in m.mimetype: return m
|
||||||
|
|||||||
@@ -123,6 +123,9 @@ class ArchivingOrchestrator:
|
|||||||
s.store(final_media, result)
|
s.store(final_media, result)
|
||||||
result.set_final_media(final_media)
|
result.set_final_media(final_media)
|
||||||
|
|
||||||
|
if result.is_empty():
|
||||||
|
result.status = "nothing archived"
|
||||||
|
|
||||||
# signal completion to databases (DBs, Google Sheets, CSV, ...)
|
# signal completion to databases (DBs, Google Sheets, CSV, ...)
|
||||||
for d in self.databases: d.done(result)
|
for d in self.databases: d.done(result)
|
||||||
|
|
||||||
|
|||||||
@@ -2,10 +2,8 @@ from typing import Union, Tuple
|
|||||||
import datetime
|
import datetime
|
||||||
from urllib.parse import quote
|
from urllib.parse import quote
|
||||||
|
|
||||||
# from metadata import Metadata
|
|
||||||
from loguru import logger
|
from loguru import logger
|
||||||
|
|
||||||
# from . import Enricher
|
|
||||||
from . import Database
|
from . import Database
|
||||||
from ..core import Metadata
|
from ..core import Metadata
|
||||||
from ..core import Media
|
from ..core import Media
|
||||||
@@ -61,13 +59,13 @@ class GsheetsDb(Database):
|
|||||||
cell_updates.append((row, 'status', item.status))
|
cell_updates.append((row, 'status', item.status))
|
||||||
|
|
||||||
media: Media = item.get_final_media()
|
media: Media = item.get_final_media()
|
||||||
|
if hasattr(media, "urls"):
|
||||||
batch_if_valid('archive', "\n".join(media.urls))
|
batch_if_valid('archive', "\n".join(media.urls))
|
||||||
batch_if_valid('date', True, datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc).isoformat())
|
batch_if_valid('date', True, datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc).isoformat())
|
||||||
batch_if_valid('title', item.get_title())
|
batch_if_valid('title', item.get_title())
|
||||||
batch_if_valid('text', item.get("content", "")[:500])
|
batch_if_valid('text', item.get("content", "")[:500])
|
||||||
batch_if_valid('timestamp', item.get_timestamp())
|
batch_if_valid('timestamp', item.get_timestamp())
|
||||||
if (screenshot := item.get_media_by_id("screenshot")):
|
if (screenshot := item.get_media_by_id("screenshot")) and hasattr(screenshot, "urls"):
|
||||||
batch_if_valid('screenshot', "\n".join(screenshot.urls))
|
batch_if_valid('screenshot', "\n".join(screenshot.urls))
|
||||||
|
|
||||||
if (thumbnail := item.get_first_image("thumbnail")):
|
if (thumbnail := item.get_first_image("thumbnail")):
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ import time, uuid, os
|
|||||||
from selenium.common.exceptions import TimeoutException
|
from selenium.common.exceptions import TimeoutException
|
||||||
|
|
||||||
from . import Enricher
|
from . import Enricher
|
||||||
from ..utils import Webdriver
|
from ..utils import Webdriver, UrlUtil
|
||||||
from ..core import Media, Metadata
|
from ..core import Media, Metadata
|
||||||
|
|
||||||
class ScreenshotEnricher(Enricher):
|
class ScreenshotEnricher(Enricher):
|
||||||
@@ -19,6 +19,10 @@ class ScreenshotEnricher(Enricher):
|
|||||||
|
|
||||||
def enrich(self, to_enrich: Metadata) -> None:
|
def enrich(self, to_enrich: Metadata) -> None:
|
||||||
url = to_enrich.get_url()
|
url = to_enrich.get_url()
|
||||||
|
if UrlUtil.is_auth_wall(url):
|
||||||
|
logger.debug(f"[SKIP] SCREENSHOT since url is behind AUTH WALL: {url=}")
|
||||||
|
return
|
||||||
|
|
||||||
logger.debug(f"Enriching screenshot for {url=}")
|
logger.debug(f"Enriching screenshot for {url=}")
|
||||||
with Webdriver(self.width, self.height, self.timeout, 'facebook.com' in url) as driver:
|
with Webdriver(self.width, self.height, self.timeout, 'facebook.com' in url) as driver:
|
||||||
try:
|
try:
|
||||||
|
|||||||
@@ -1,8 +1,10 @@
|
|||||||
from loguru import logger
|
from loguru import logger
|
||||||
import time, requests
|
import time, requests
|
||||||
|
|
||||||
|
|
||||||
from . import Enricher
|
from . import Enricher
|
||||||
from ..archivers import Archiver
|
from ..archivers import Archiver
|
||||||
|
from ..utils import UrlUtil
|
||||||
from ..core import Metadata
|
from ..core import Metadata
|
||||||
|
|
||||||
class WaybackArchiverEnricher(Enricher, Archiver):
|
class WaybackArchiverEnricher(Enricher, Archiver):
|
||||||
@@ -33,6 +35,10 @@ class WaybackArchiverEnricher(Enricher, Archiver):
|
|||||||
|
|
||||||
def enrich(self, to_enrich: Metadata) -> bool:
|
def enrich(self, to_enrich: Metadata) -> bool:
|
||||||
url = to_enrich.get_url()
|
url = to_enrich.get_url()
|
||||||
|
if UrlUtil.is_auth_wall(url):
|
||||||
|
logger.debug(f"[SKIP] WAYBACK since url is behind AUTH WALL: {url=}")
|
||||||
|
return
|
||||||
|
|
||||||
logger.debug(f"calling wayback for {url=}")
|
logger.debug(f"calling wayback for {url=}")
|
||||||
|
|
||||||
if to_enrich.get("wayback"):
|
if to_enrich.get("wayback"):
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ from dataclasses import dataclass
|
|||||||
import mimetypes, uuid, os, pathlib
|
import mimetypes, uuid, os, pathlib
|
||||||
from jinja2 import Environment, FileSystemLoader
|
from jinja2 import Environment, FileSystemLoader
|
||||||
from urllib.parse import quote
|
from urllib.parse import quote
|
||||||
|
from loguru import logger
|
||||||
|
|
||||||
from ..version import __version__
|
from ..version import __version__
|
||||||
from ..core import Metadata, Media
|
from ..core import Metadata, Media
|
||||||
@@ -26,12 +27,17 @@ class HtmlFormatter(Formatter):
|
|||||||
@staticmethod
|
@staticmethod
|
||||||
def configs() -> dict:
|
def configs() -> dict:
|
||||||
return {
|
return {
|
||||||
"detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"},
|
"detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
def format(self, item: Metadata) -> Media:
|
def format(self, item: Metadata) -> Media:
|
||||||
|
url = item.get_url()
|
||||||
|
if item.is_empty():
|
||||||
|
logger.debug(f"[SKIP] FORMAT there is no media or metadata to format: {url=}")
|
||||||
|
return
|
||||||
|
|
||||||
content = self.template.render(
|
content = self.template.render(
|
||||||
url=item.get_url(),
|
url=url,
|
||||||
title=item.get_title(),
|
title=item.get_title(),
|
||||||
media=item.media,
|
media=item.media,
|
||||||
metadata=item.get_clean_metadata(),
|
metadata=item.get_clean_metadata(),
|
||||||
|
|||||||
@@ -2,4 +2,5 @@
|
|||||||
from .gworksheet import GWorksheet
|
from .gworksheet import GWorksheet
|
||||||
from .misc import *
|
from .misc import *
|
||||||
from .webdriver import Webdriver
|
from .webdriver import Webdriver
|
||||||
from .gsheet import Gsheets
|
from .gsheet import Gsheets
|
||||||
|
from .url import UrlUtil
|
||||||
19
src/auto_archiver/utils/url.py
Normal file
19
src/auto_archiver/utils/url.py
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
import re
|
||||||
|
|
||||||
|
class UrlUtil:
|
||||||
|
telegram_private = re.compile(r"https:\/\/t\.me(\/c)\/(.+)\/(\d+)")
|
||||||
|
is_istagram = re.compile(r"https:\/\/www\.instagram\.com")
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def clean(url): return url
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def is_auth_wall(url):
|
||||||
|
"""
|
||||||
|
checks if URL is behind an authentication wall meaning steps like wayback, wacz, ... may not work
|
||||||
|
"""
|
||||||
|
if UrlUtil.telegram_private.match(url): return True
|
||||||
|
if UrlUtil.is_istagram.match(url): return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
@@ -1,9 +1,9 @@
|
|||||||
|
|
||||||
_MAJOR = "0"
|
_MAJOR = "0"
|
||||||
_MINOR = "3"
|
_MINOR = "4"
|
||||||
# On main and in a nightly release the patch should be one ahead of the last
|
# On main and in a nightly release the patch should be one ahead of the last
|
||||||
# released build.
|
# released build.
|
||||||
_PATCH = "0"
|
_PATCH = "1"
|
||||||
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
|
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
|
||||||
# https://semver.org/#is-v123-a-semantic-version for the semantics.
|
# https://semver.org/#is-v123-a-semantic-version for the semantics.
|
||||||
_SUFFIX = ""
|
_SUFFIX = ""
|
||||||
|
|||||||
Reference in New Issue
Block a user