mirror of
https://github.com/bellingcat/cisticola.git
synced 2026-06-07 19:08:35 +03:00
added table diagram, and brief developer guide and deployment info for docs
This commit is contained in:
@@ -7,7 +7,7 @@
|
||||
viewBox="0 0 51.688999 11.797"
|
||||
version="1.1"
|
||||
id="svg5"
|
||||
inkscape:version="1.1.2 (76b9e6a115, 2022-02-25)"
|
||||
inkscape:version="1.2.2 (1:1.2.2+202305151915+b0a8486541)"
|
||||
sodipodi:docname="cisticola_logo.svg"
|
||||
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
|
||||
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
|
||||
@@ -28,14 +28,16 @@
|
||||
fit-margin-right="0"
|
||||
fit-margin-bottom="0"
|
||||
inkscape:zoom="2.0838024"
|
||||
inkscape:cx="52.548168"
|
||||
inkscape:cx="141.56813"
|
||||
inkscape:cy="115.65396"
|
||||
inkscape:window-width="1920"
|
||||
inkscape:window-height="999"
|
||||
inkscape:window-height="1005"
|
||||
inkscape:window-x="0"
|
||||
inkscape:window-y="0"
|
||||
inkscape:window-maximized="1"
|
||||
inkscape:current-layer="layer4" />
|
||||
inkscape:current-layer="layer3"
|
||||
inkscape:showpageshadow="2"
|
||||
inkscape:deskcolor="#d1d1d1" />
|
||||
<defs
|
||||
id="defs2" />
|
||||
<g
|
||||
@@ -44,7 +46,7 @@
|
||||
inkscape:label="background"
|
||||
transform="translate(-60.255096,9.177412)">
|
||||
<rect
|
||||
style="fill:#000000;fill-opacity:1;stroke-width:0.723711"
|
||||
style="fill:#292a2b;fill-opacity:1;stroke-width:0.723711"
|
||||
id="rect16437"
|
||||
width="51.688999"
|
||||
height="11.797"
|
||||
|
||||
|
Before Width: | Height: | Size: 7.0 KiB After Width: | Height: | Size: 7.0 KiB |
1367
docs/images/database_schema.svg
Normal file
1367
docs/images/database_schema.svg
Normal file
File diff suppressed because it is too large
Load Diff
|
After Width: | Height: | Size: 81 KiB |
@@ -8,20 +8,37 @@ Definitions
|
||||
- *Platform*: a social media website, for example Telegram, YouTube, or Rumble.
|
||||
- *Channel*: an account or group on a platform, for example Twitter users, Telegram private chat groups, YouTube channels, and Gab groups.
|
||||
- *Post*: a single item created by a channel, for example a Telegram message, a Tweet, or a YouTube video. Posts can contain one or more media attachments.
|
||||
- *Media*: a file uploaded to a platform by a channel as part of a post.
|
||||
- *Media*: a file uploaded to a platform by a channel as part of a post. Often images or video but can include audio, or for some platforms arbitrary file types (such as PDFs).
|
||||
|
||||
Components
|
||||
----------
|
||||
Cisticola has many components
|
||||
Cisticola has many components, including:
|
||||
|
||||
- :py:mod:`cisticola.base`: contains Object Relational Mapping (ORM) dataclasses that imperatively map to pre-defined SQL tables
|
||||
- :py:mod:`cisticola.scraper`: contains platform-specific modules for scraping raw data from platforms. For example, the :py:mod:`cisticola.scraper.bitchute` module extracts raw data from Bitchute.
|
||||
- :py:mod:`cisticola.transformer`: contains platform-specific modules for converting raw data into a standardized, cross-platform format.
|
||||
- The :py:mod:`cisticola.base` module contains Object Relational Mapping (ORM) dataclasses that imperatively map to pre-defined SQL tables
|
||||
- The :py:mod:`cisticola.scraper` subpackage contains platform-specific modules for scraping raw data from platforms. For example, the :py:mod:`cisticola.scraper.bitchute` module extracts raw data from Bitchute.
|
||||
- The :py:mod:`cisticola.transformer` subpackage contains platform-specific modules for converting raw data into a standardized, cross-platform format.
|
||||
|
||||
The data extracted by scrapers varies by platform, but typically includes media files attached to posts.
|
||||
|
||||
Separating the "scraping" and "transforming" steps is useful because it ensures that no data is thrown away during the transormation. There may be some fields in the raw data that aren't included in the transformed format, but could be found to be useful in the future.
|
||||
|
||||
Tables
|
||||
------
|
||||
The database Cisticola uses to archive and store data consists of 6 tables. Their names, respective ORM mapping in :py:mod:`cisticola.base`, and a brief description are shown below:
|
||||
|
||||
- ``channels`` (:py:class:`cisticola.base.Channel`): User-specified information about a channel
|
||||
- ``raw_posts`` (:py:class:`cisticola.base.ScraperResult`): Minimally processed information scraped from a post
|
||||
- ``posts`` (:py:class:`cisticola.base.Post`): Processed information about a post
|
||||
- ``raw_channel_info`` (:py:class:`cisticola.base.RawChannelInfo`): Minimally processed information scraped from a channel
|
||||
- ``channel_info`` (:py:class:`cisticola.base.ChannelInfo`): Processed information about a channel
|
||||
- ``media`` (:py:class:`cisticola.base.Media`): Processed information about a media file attached to a post
|
||||
|
||||
The diagram below shows all columns in each table and their data types, with certain shared primary and foreign key columns colored differently to distinguish them.
|
||||
|
||||
.. image:: ../images/database_schema.svg
|
||||
:target: _images/database_schema.svg
|
||||
:width: 100%
|
||||
|
||||
TODO
|
||||
- Add diagram
|
||||
- Describe common workflow and steps
|
||||
|
||||
@@ -57,4 +57,4 @@ autodoc_default_options = {'exclude-members': '_sa_class_manager'}
|
||||
html_favicon = '../images/favicon.ico'
|
||||
html_logo = '../images/cisticola_logo.svg'
|
||||
|
||||
html_theme_options = {'style_nav_header_background': '#000000'}
|
||||
html_theme_options = {'style_nav_header_background': '#292a2b'}
|
||||
16
docs/source/deployment.rst
Normal file
16
docs/source/deployment.rst
Normal file
@@ -0,0 +1,16 @@
|
||||
Deployment
|
||||
==========
|
||||
|
||||
.. warning::
|
||||
|
||||
We are working on making cisticola more to install, configure, and use. If you're confused by these steps don't worry, it will get more accessible.
|
||||
|
||||
Docker
|
||||
------
|
||||
The easiest way to deploy Cisticola is to use Docker. Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple.
|
||||
|
||||
1. Install `Docker <https://docs.docker.com/get-docker/>`_
|
||||
|
||||
Manual Installation
|
||||
-------------------
|
||||
TODO
|
||||
23
docs/source/developer_guide.rst
Normal file
23
docs/source/developer_guide.rst
Normal file
@@ -0,0 +1,23 @@
|
||||
Developer Guide
|
||||
===============
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
To install the necessary dependencies for building the documentation and running unit tests, run the following command from the package root directory:
|
||||
|
||||
.. code-block::
|
||||
|
||||
pipenv install --dev
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
If changes are made to the package structure or additional modules are created, you can update the Sphinx source ``cisticola.*.rst`` files by running the following command from the ``docs/`` directory:
|
||||
|
||||
.. code-block::
|
||||
|
||||
pipenv run make apidoc
|
||||
|
||||
Formatting
|
||||
----------
|
||||
Cisticola uses `black <https://github.com/psf/black>`_ to format source code.
|
||||
@@ -6,4 +6,6 @@ Welcome to Cisticola's documentation!
|
||||
|
||||
about
|
||||
quickstart
|
||||
deployment
|
||||
developer_guide
|
||||
cisticola
|
||||
@@ -16,35 +16,10 @@ and then install the dependencies using the following command from the package r
|
||||
|
||||
pipenv install
|
||||
|
||||
To install the necessary dependencies for building the documentation and running unit tests, run the following command from the package root directory:
|
||||
|
||||
.. code-block::
|
||||
|
||||
pipenv install --dev
|
||||
|
||||
Environment Variables
|
||||
---------------------
|
||||
|
||||
Three of the scrapers in *cisticola* (:py:mod:`~cisticola.scraper.gab.GabScraper`, :py:mod:`~cisticola.scraper.instagram.InstagramScraper`, and :py:mod:`~cisticola.scraper.telegram_telethon.TelegramTelethonScraper`) require platform credentials to work correctly.
|
||||
|
||||
Gab
|
||||
"""
|
||||
|
||||
The Gab credentials can be configured by running the following command from the root directory:
|
||||
|
||||
.. code-block::
|
||||
|
||||
pipenv run garc configure
|
||||
|
||||
which will direct you to provide the username and password for your Gab account.
|
||||
|
||||
Instagram
|
||||
"""""""""
|
||||
|
||||
The Instagram credentials can be configured by setting the following environment variables, either in the project's ``.env`` file or in the system's environment:
|
||||
|
||||
- ``INSTAGRAM_USERNAME``: username of your Instagram account
|
||||
- ``INSTAGRAM_PASSWORD``: password of your Instagram account
|
||||
One of the scrapers in *cisticola* (:py:mod:`~cisticola.scraper.telegram_telethon.TelegramTelethonScraper`) requires platform credentials to work correctly.
|
||||
|
||||
Telegram Telethon
|
||||
"""""""""""""""""
|
||||
@@ -57,6 +32,12 @@ The Telegram credentials can be configured by setting the following environment
|
||||
|
||||
If you do not already have a Telegram application, you can create one by following the instructions on `this page`_.
|
||||
|
||||
To initialize a Telegram session, run the following script from the package's root directory using the command-line:
|
||||
|
||||
.. bash::
|
||||
|
||||
bash telethon_session_init.py
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
@@ -86,11 +67,7 @@ To see the logging output from a test run, add the ``--capture=no`` flag to the
|
||||
Examples
|
||||
--------
|
||||
|
||||
An example of a *cisticola* ingest file ``russian_telegram_ingest.py`` is included in the package root directory, showing how the list of channels to scrape is defined, and how the :py:mod:`~cisticola.scraper.base.ScraperController` and :py:mod:`~cisticola.transformer.base.Transformer` classes are used. To run the ingest script, run the following command from the package root directory:
|
||||
|
||||
.. code-block::
|
||||
|
||||
pipenv run python russian_telegram_ingest.py
|
||||
The script ``app.py`` is included in the package root directory, showing how the list of channels to scrape is defined, and how the :py:mod:`~cisticola.scraper.base.ScraperController` and :py:mod:`~cisticola.transformer.base.Transformer` classes are used.
|
||||
|
||||
.. _pipenv: https://pipenv.pypa.io/en/latest/
|
||||
.. _Sphinx: https://www.sphinx-doc.org/en/master/
|
||||
|
||||
@@ -11,10 +11,6 @@ addopts =
|
||||
--cov-report html:reports/coverage
|
||||
--html='reports/tests.html'
|
||||
--self-contained-html
|
||||
markers =
|
||||
profile: marks tests for only extracting channel metadata (deselect with '-m "not profile"')
|
||||
media: marks tests for archiving all media attachments (deselect with '-m "not media"')
|
||||
unarchived: marks tests for archiving all unarchived media attachments (deselect with '-m "not unarchived"')
|
||||
filterwarnings =
|
||||
ignore:the imp module is deprecated:DeprecationWarning
|
||||
ignore:The localize method is no longer necessary, as this time zone supports the fold attribute
|
||||
|
||||
Reference in New Issue
Block a user