new instagram archiver via telegram bot

instagram archiver via telegram bot
name fix
2026-06-10 04:08:28 +03:00 · 2023-02-17 16:15:25 +00:00 · 2023-02-17 15:46:29 +00:00 · 2023-02-17 15:46:05 +00:00 · 2023-02-17 15:45:58 +00:00 · 2023-02-17 15:45:35 +00:00
16 changed files with 215 additions and 94 deletions
--- a/README.md
+++ b/README.md
@@ -1,4 +1,12 @@
-# Auto Archiver
+<h1 align="center">Auto Archiver</h1>
+
+[![PyPI version](https://badge.fury.io/py/auto-archiver.svg)](https://badge.fury.io/py/auto-archiver)
+[![Docker Image Version (latest by date)](https://img.shields.io/docker/v/bellingcat/auto-archiver?label=version&logo=docker)](https://pypi.org/project/auto-archiver/)
+<!-- ![Docker Pulls](https://img.shields.io/docker/pulls/bellingcat/auto-archiver) -->
+<!-- [![PyPI download month](https://img.shields.io/pypi/dm/auto-archiver.svg)](https://pypi.python.org/pypi/auto-archiver/) -->
+<!-- [![Documentation Status](https://readthedocs.org/projects/vk-url-scraper/badge/?version=latest)](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest) -->
+
+
 Read the [article about Auto Archiver on bellingcat.com](https://www.bellingcat.com/resources/2022/09/22/preserve-vital-online-content-with-bellingcats-auto-archiver-tool/).


@@ -15,6 +23,11 @@ But **you always need a configuration/orchestration file**, which is where you'l
 ## How to run the auto-archiver

 ### Option 1 - docker
+
+<details><summary><code>Docker instructions</code></summary>
+
+[![dockeri.co](https://dockerico.blankenship.io/image/bellingcat/auto-archiver)](https://hub.docker.com/r/bellingcat/auto-archiver)
+
 Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. Since it is an isolated environment when you need to pass it your orchestration file or get downloaded media out of docker you will need to connect folders on your machine with folders inside docker with the `-v` volume flag.


@@ -32,14 +45,20 @@ Docker works like a virtual machine running inside your computer, it isolates ev
       2.  `$PWD/local_archive` is a folder `local_archive/` in case you want to archive locally and have the files accessible outside docker
       3.  `/app/local_archive` is a folder inside docker that you can reference in your orchestration.yml file 

+</details>

 ### Option 2 - python package
+
+<details><summary><code>Python package instructions</code></summary>
+
 1. make sure you have python 3.8 or higher installed
 2. install the package `pip/pipenv/conda install auto-archiver`
 3. test it's installed with `auto-archiver --help`
 4. run it with your orchestration file and pass any flags you want in the command line `auto-archiver --config secrets/orchestration.yaml`
   1. if your orchestration file is inside a `secrets/` which we advise

+</details>
+

 ### Option 3 - local installation
 This can also be used for development.
@@ -60,13 +79,6 @@ Clone and run:

 </details><br/>

-
-
-
-
-### Examples
-
-
 # Orchestration
 The archiver work is orchestrated by the following workflow (we call each a **step**): 
 1. **Feeder** gets the links (from a spreadsheet, from the console, ...)
@@ -85,7 +97,7 @@ The structure of orchestration file is split into 2 parts: `steps` (what **steps
 steps:
  feeder: gsheet_feeder
  archivers: # order matters
-    - youtubedl_enricher
+    - youtubedl_archiver
  enrichers:
    - thumbnail_enricher
  formatter: html_formatter
@@ -178,16 +190,18 @@ Note that the first row is skipped, as it is assumed to be a header row (`--gshe
 ## Development
 Use `python -m src.auto_archiver --config secrets/orchestration.yaml` to run from the local development environment.

-# Docker development
-* working with docker locally:
+#### Docker development
+working with docker locally:
  * `docker build . -t auto-archiver` to build a local image
  * `docker run --rm -v $PWD/secrets:/app/secrets aa --config secrets/config.yaml`
    * to use local archive, also create a volume `-v` for it by adding `-v $PWD/local_archive:/app/local_archive`
-* release to docker hub
+
+
+release to docker hub
  * `docker image tag auto-archiver bellingcat/auto-archiver:latest`
  * `docker push bellingcat/auto-archiver`

-# RELEASE
+#### RELEASE
 * update version in [version.py](src/auto_archiver/version.py)
 * run `bash ./scripts/release.sh` and confirm
 * package is automatically updated in pypi
--- a/example.orchestration.yaml
+++ b/example.orchestration.yaml
@@ -1,80 +1,79 @@
 steps:
  # only 1 feeder allowed
-  # a feeder could be in an "infinite loop" for example: gsheets_infinite feeder which holds-> this could be an easy logic addiction by modifying for each to while not feeder.done() if it becomes necessary
+  # feeder: cli_feeder # default feeder
  feeder: gsheet_feeder # default -> only expects URL from CLI
  archivers: # order matters
-    - telethon
-    # - tiktok
-    # - twitter
-    # - instagram
-    # - webarchive # this way it runs as a failsafe only
-  # enrichers:
-  #   - screenshot
-    # - wacz
-    # - webarchive # this way it runs for every case, webarchive extends archiver and enrichment
-    # - thumbnails
-  formatters:
-    - HTMLFormater
-    - PdfFormater
+    # - vk_archiver
+    # - telethon_archiver
+    # - telegram_archiver
+    # - twitter_archiver
+    # - twitter_api_archiver
+    # - instagram_archiver
+    # - instagram_tbot_archiver
+    # - tiktok_archiver
+    - youtubedl_archiver
+    # - wayback_archiver_enricher
+  enrichers:
+    - hash_enricher
+    - screenshot_enricher
+    - thumbnail_enricher
+    # - wayback_archiver_enricher
+    # - wacz_enricher
+    
+  formatter: html_formatter # defaults to mute_formatter
  storages:
    - local_storage
-    - s3
+    # - s3_storage
+    # - gdrive_storage
  databases:
-    - gsheets_db
-    - mongo_db
-
-
+    # - console_db
+    # - csv_db
+    - gsheet_db
+    # - mongo_db

 configurations:
  gsheet_feeder:
-    sheet: my-auto-archiver
+    sheet: auto-archiver-test
    header: 2 # defaults to 1 in GSheetsFeeder
    service_account: "secrets/service_account.json"
-    # allow_worksheets: "allowed"
-    # block_worksheets: "blocked1,blocked2"
+    use_sheet_names_in_stored_paths: false
    columns:
-        'url': 'link'
-        'status': 'archive status'
-        'folder': 'destination folder'
-        'archive': 'archive location'
-        'date': 'archive date'
-        'thumbnail': 'thumbnail'
-        'thumbnail_index': 'thumbnail index'
-        'timestamp': 'upload timestamp'
-        'title': 'upload title'
-        'duration': 'duration'
-        'screenshot': 'screenshot'
-        'hash': 'hash'
-        'wacz': 'wacz'
-        'replaywebpage': 'replaywebpage'
-  telethon:
-    api_id: "1234567"
-    api_hash: "examplehash"
-    session_file: "secrets/anon"
-    channel_invites:
-      - invite: https://t.me/+XXXXXXXXXXXXXX
-        id: 1000000000
-      - invite: https://t.me/joinchat/XXXXXXXXXXXXXX
-        id: 1000000001
+      url: link
+      status: archive status
+      folder: destination folder
+      archive: archive location
+      date: archive date
+      thumbnail: thumbnail
+      thumbnail_index: thumbnail index
+      timestamp: upload timestamp
+      title: upload title
+      text: textual content
+      duration: duration
+      screenshot: screenshot
+      hash: hash
+      wacz: wacz
+      replaywebpage: replaywebpage

-  tiktok:
-    api_keys:
-      - username: 1
-        password: 2
-      - username: 3
-        password: 4
-    username: "abc"
-    password: "123"
-    token: "here"
-  screenshot:
+  screenshot_enricher:
    width: 1280
-    height: 4600
-  wacz:
-    profile: secrets/profile.tar.gz
-  webarchive:
-    api_key: "12345"
-  s3: 
-    - bucket: 123
-    - region: "nyc3"
-    - cdn: "{region}{bucket}"
+    height: 2300
+  wayback_archiver_enricher:
+    timeout: 10
+    key: ""
+    secret: ""
+  hash_enricher:
+    algorithm: "SHA3-512"
+  # wacz:
+    # profile: secrets/profile.tar.gz
+  local_storage:
+    save_to: "./local_archive"
+    save_absolute: true
+    filename_generator: static
+    path_generator: flat

+  gdrive_storage:
+    path_generator: url
+    filename_generator: random
+    root_folder_id: TODO
+    oauth_token: secrets/gd-token.json
+    service_account: "secrets/service_account.json"
--- a/src/auto_archiver/archivers/init.py
+++ b/src/auto_archiver/archivers/init.py
@@ -3,6 +3,7 @@ from .telethon_archiver import TelethonArchiver
 from .twitter_archiver import TwitterArchiver
 from .twitter_api_archiver import TwitterApiArchiver
 from .instagram_archiver import InstagramArchiver
+from .instagram_tbot_archiver import InstagramTbotArchiver
 from .tiktok_archiver import TiktokArchiver
 from .telegram_archiver import TelegramArchiver
 from .vk_archiver import VkArchiver
--- a/src/auto_archiver/archivers/instagram_tbot_archiver.py
+++ b/src/auto_archiver/archivers/instagram_tbot_archiver.py
@@ -0,0 +1,67 @@
+
+from telethon.sync import TelegramClient
+from loguru import logger
+import time, os
+
+from . import Archiver
+from ..core import Metadata, Media
+
+
+class InstagramTbotArchiver(Archiver):
+    """
+    calls a telegram bot to fetch instagram posts/stories...
+    https://github.com/adw0rd/instagrapi
+    https://t.me/instagram_load_bot
+    """
+    name = "instagram_tbot_archiver"
+
+    def __init__(self, config: dict) -> None:
+        super().__init__(config)
+        self.assert_valid_string("api_id")
+        self.assert_valid_string("api_hash")
+        self.timeout = int(self.timeout)
+        self.client = TelegramClient(self.session_file, self.api_id, self.api_hash)
+
+    @staticmethod
+    def configs() -> dict:
+        return {
+            "api_id": {"default": None, "help": "telegram API_ID value, go to https://my.telegram.org/apps"},
+            "api_hash": {"default": None, "help": "telegram API_HASH value, go to https://my.telegram.org/apps"},
+            "session_file": {"default": "secrets/anon", "help": "optional, records the telegram login session for future usage, '.session' will be appended to the provided value."},
+            "timeout": {"default": 15, "help": "timeout to fetch the instagram content in seconds."},
+        }
+
+    def setup(self) -> None:
+        logger.info(f"SETUP {self.name} checking login...")
+        with self.client.start():
+            logger.success(f"SETUP {self.name} login works.")
+
+    def download(self, item: Metadata) -> Metadata:
+        url = item.get_url()
+        if not "instagram.com" in url: return False
+
+        result = Metadata()
+        tmp_dir = item.get_tmp_dir()
+        with self.client.start():
+            chat = self.client.get_entity("instagram_load_bot")
+            since_id = self.client.send_message(entity=chat, message=url).id
+
+            attempts = 0
+            media = None
+            message = ""
+            time.sleep(4)
+            while attempts < self.timeout and (not message or not media):
+                attempts += 1
+                time.sleep(1)
+                for post in self.client.iter_messages(chat, min_id=since_id):
+                    since_id = max(since_id, post.id)
+                    if post.media and not media:
+                        filename_dest = os.path.join(tmp_dir, f'{chat.id}_{post.id}')
+                        media = self.client.download_media(post.media, filename_dest)
+                        if media: result.add_media(Media(media))
+                    if post.message: message += post.message
+
+            if message:
+                result.set_content(message).set_title(message[:128])
+
+            return result.success("insta-via-bot")
--- a/src/auto_archiver/archivers/telethon_archiver.py
+++ b/src/auto_archiver/archivers/telethon_archiver.py
@@ -114,7 +114,7 @@ class TelethonArchiver(Archiver):
        with self.client.start():
        # with self.client.start(bot_token=self.bot_token):
            try:
-                post = self.client.get_messages(chat,   ids=post_id)
+                post = self.client.get_messages(chat, ids=post_id)
            except ValueError as e:
                logger.error(f"Could not fetch telegram {url} possibly it's private: {e}")
                return False
--- a/src/auto_archiver/archivers/twitter_archiver.py
+++ b/src/auto_archiver/archivers/twitter_archiver.py
@@ -37,7 +37,7 @@ class TwitterArchiver(Archiver):
        return self.link_clean_pattern.sub("\\1", url)

    def is_rearchivable(self, url: str) -> bool:
-        # Twitter posts are static
+        # Twitter posts are static (for now)
        return False

    def download(self, item: Metadata) -> Metadata:
@@ -86,7 +86,7 @@ class TwitterArchiver(Archiver):
            media.filename = self.download_from_url(media.get("src"), f'{slugify(url)}_{i}{ext}', item)
            result.add_media(media)

-        return result.success("twitter")
+        return result.success("twitter-snscrape")

    def download_alternative(self, item: Metadata, url: str, tweet_id: str) -> Metadata:
        """
--- a/src/auto_archiver/archivers/youtubedl_archiver.py
+++ b/src/auto_archiver/archivers/youtubedl_archiver.py
@@ -6,7 +6,7 @@ from ..core import Metadata, Media


 class YoutubeDLArchiver(Archiver):
-    name = "youtubedl_enricher"
+    name = "youtubedl_archiver"

    def __init__(self, config: dict) -> None:
        super().__init__(config)
--- a/src/auto_archiver/core/metadata.py
+++ b/src/auto_archiver/core/metadata.py
@@ -63,6 +63,9 @@ class Metadata:
    def is_success(self) -> bool:
        return "success" in self.status

+    def is_empty(self) -> bool:
+        return not self.is_success() and len(self.media) == 0 and len(self.get_clean_metadata()) <= 2  # url, processed_at
+
    @property  # getter .netloc
    def netloc(self) -> str:
        return urlparse(self.get_url()).netloc
@@ -122,7 +125,7 @@ class Metadata:
        for m in self.media:
            if m.get("id") == id: return m
        return default
-    
+
    def get_first_image(self, default=None) -> Media:
        for m in self.media:
            if "image" in m.mimetype: return m
--- a/src/auto_archiver/core/orchestrator.py
+++ b/src/auto_archiver/core/orchestrator.py
@@ -123,6 +123,9 @@ class ArchivingOrchestrator:
                s.store(final_media, result)
            result.set_final_media(final_media)

+        if result.is_empty():
+            result.status = "nothing archived"
+
        # signal completion to databases (DBs, Google Sheets, CSV, ...)
        for d in self.databases: d.done(result)

--- a/src/auto_archiver/databases/gsheet_db.py
+++ b/src/auto_archiver/databases/gsheet_db.py
@@ -2,10 +2,8 @@ from typing import Union, Tuple
 import datetime
 from urllib.parse import quote

-# from metadata import Metadata
 from loguru import logger

-# from . import Enricher
 from . import Database
 from ..core import Metadata
 from ..core import Media
@@ -61,13 +59,13 @@ class GsheetsDb(Database):
        cell_updates.append((row, 'status', item.status))

        media: Media = item.get_final_media()
-
-        batch_if_valid('archive', "\n".join(media.urls))
+        if hasattr(media, "urls"):
+            batch_if_valid('archive', "\n".join(media.urls))
        batch_if_valid('date', True, datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc).isoformat())
        batch_if_valid('title', item.get_title())
        batch_if_valid('text', item.get("content", "")[:500])
        batch_if_valid('timestamp', item.get_timestamp())
-        if (screenshot := item.get_media_by_id("screenshot")):
+        if (screenshot := item.get_media_by_id("screenshot")) and hasattr(screenshot, "urls"):
            batch_if_valid('screenshot', "\n".join(screenshot.urls))

        if (thumbnail := item.get_first_image("thumbnail")):
--- a/src/auto_archiver/enrichers/screenshot_enricher.py
+++ b/src/auto_archiver/enrichers/screenshot_enricher.py
@@ -3,7 +3,7 @@ import time, uuid, os
 from selenium.common.exceptions import TimeoutException

 from . import Enricher
-from ..utils import Webdriver
+from ..utils import Webdriver, UrlUtil
 from ..core import Media, Metadata

 class ScreenshotEnricher(Enricher):
@@ -19,6 +19,10 @@ class ScreenshotEnricher(Enricher):

    def enrich(self, to_enrich: Metadata) -> None:
        url = to_enrich.get_url()
+        if UrlUtil.is_auth_wall(url):
+            logger.debug(f"[SKIP] SCREENSHOT since url is behind AUTH WALL: {url=}")
+            return
+
        logger.debug(f"Enriching screenshot for {url=}")
        with Webdriver(self.width, self.height, self.timeout, 'facebook.com' in url) as driver:
            try:
--- a/src/auto_archiver/enrichers/wayback_enricher.py
+++ b/src/auto_archiver/enrichers/wayback_enricher.py
@@ -1,8 +1,10 @@
 from loguru import logger
 import time, requests

+
 from . import Enricher
 from ..archivers import Archiver
+from ..utils import UrlUtil
 from ..core import Metadata

 class WaybackArchiverEnricher(Enricher, Archiver):
@@ -33,6 +35,10 @@ class WaybackArchiverEnricher(Enricher, Archiver):

    def enrich(self, to_enrich: Metadata) -> bool:
        url = to_enrich.get_url()
+        if UrlUtil.is_auth_wall(url):
+            logger.debug(f"[SKIP] WAYBACK since url is behind AUTH WALL: {url=}")
+            return
+
        logger.debug(f"calling wayback for {url=}")

        if to_enrich.get("wayback"):
--- a/src/auto_archiver/formatters/html_formatter.py
+++ b/src/auto_archiver/formatters/html_formatter.py
@@ -3,6 +3,7 @@ from dataclasses import dataclass
 import mimetypes, uuid, os, pathlib
 from jinja2 import Environment, FileSystemLoader
 from urllib.parse import quote
+from loguru import logger

 from ..version import __version__
 from ..core import Metadata, Media
@@ -26,12 +27,17 @@ class HtmlFormatter(Formatter):
    @staticmethod
    def configs() -> dict:
        return {
-            "detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"},
-
+            "detect_thumbnails": {"default": True, "help": "if true will group by thumbnails generated by thumbnail enricher by id 'thumbnail_00'"}
        }
+
    def format(self, item: Metadata) -> Media:
+        url = item.get_url()
+        if item.is_empty():
+            logger.debug(f"[SKIP] FORMAT there is no media or metadata to format: {url=}")
+            return
+
        content = self.template.render(
-            url=item.get_url(),
+            url=url,
            title=item.get_title(),
            media=item.media,
            metadata=item.get_clean_metadata(),
--- a/src/auto_archiver/utils/init.py
+++ b/src/auto_archiver/utils/init.py
@@ -2,4 +2,5 @@
 from .gworksheet import GWorksheet
 from .misc import *
 from .webdriver import Webdriver
-from .gsheet import Gsheets
+from .gsheet import Gsheets
+from .url import UrlUtil
--- a/src/auto_archiver/utils/url.py
+++ b/src/auto_archiver/utils/url.py
@@ -0,0 +1,19 @@
+import re
+
+class UrlUtil:
+    telegram_private = re.compile(r"https:\/\/t\.me(\/c)\/(.+)\/(\d+)")
+    is_istagram = re.compile(r"https:\/\/www\.instagram\.com")
+
+    @staticmethod
+    def clean(url): return url
+
+    @staticmethod
+    def is_auth_wall(url):
+        """
+        checks if URL is behind an authentication wall meaning steps like wayback, wacz, ... may not work
+        """
+        if UrlUtil.telegram_private.match(url): return True
+        if UrlUtil.is_istagram.match(url): return True
+
+        return False
+
--- a/src/auto_archiver/version.py
+++ b/src/auto_archiver/version.py
@@ -1,9 +1,9 @@

 _MAJOR = "0"
-_MINOR = "3"
+_MINOR = "4"
 # On main and in a nightly release the patch should be one ahead of the last
 # released build.
-_PATCH = "0"
+_PATCH = "1"
 # This is mainly for nightly builds which have the suffix ".dev$DATE". See
 # https://semver.org/#is-v123-a-semantic-version for the semantics.
 _SUFFIX = ""
Author	SHA1	Message	Date
msramalho	1970fa3c82	new instagram archiver via telegram bot	2023-02-17 16:15:25 +00:00
msramalho	aa5430451e	instagram archiver via telegram bot	2023-02-17 15:46:29 +00:00
msramalho	f35875a94c	name fix	2023-02-17 15:46:05 +00:00
msramalho	5505255ea3	url auth wall detect	2023-02-17 15:45:58 +00:00
msramalho	da17b3f68a	name fix	2023-02-17 15:45:35 +00:00
msramalho	d6dbdec6ac	example	2023-02-09 12:32:55 +00:00
msramalho	224ebe7ee8	links	2023-02-08 22:27:56 +00:00
msramalho	54a1bc2172	update readme	2023-02-08 22:26:24 +00:00
msramalho	77948207d1	update	2023-02-08 22:24:40 +00:00
msramalho	60552ae0ea	update readme	2023-02-08 22:23:25 +00:00
msramalho	f255271ecb	update README	2023-02-08 22:17:22 +00:00