added table diagram, and brief developer guide and deployment info for docs

This commit is contained in:
Tristan Lee
2023-08-03 23:58:12 -05:00
parent d3b8e1a3b3
commit ef9292bc90
9 changed files with 1446 additions and 46 deletions

View File

@@ -7,7 +7,7 @@
viewBox="0 0 51.688999 11.797"
version="1.1"
id="svg5"
inkscape:version="1.1.2 (76b9e6a115, 2022-02-25)"
inkscape:version="1.2.2 (1:1.2.2+202305151915+b0a8486541)"
sodipodi:docname="cisticola_logo.svg"
xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
@@ -28,14 +28,16 @@
fit-margin-right="0"
fit-margin-bottom="0"
inkscape:zoom="2.0838024"
inkscape:cx="52.548168"
inkscape:cx="141.56813"
inkscape:cy="115.65396"
inkscape:window-width="1920"
inkscape:window-height="999"
inkscape:window-height="1005"
inkscape:window-x="0"
inkscape:window-y="0"
inkscape:window-maximized="1"
inkscape:current-layer="layer4" />
inkscape:current-layer="layer3"
inkscape:showpageshadow="2"
inkscape:deskcolor="#d1d1d1" />
<defs
id="defs2" />
<g
@@ -44,7 +46,7 @@
inkscape:label="background"
transform="translate(-60.255096,9.177412)">
<rect
style="fill:#000000;fill-opacity:1;stroke-width:0.723711"
style="fill:#292a2b;fill-opacity:1;stroke-width:0.723711"
id="rect16437"
width="51.688999"
height="11.797"

Before

Width:  |  Height:  |  Size: 7.0 KiB

After

Width:  |  Height:  |  Size: 7.0 KiB

File diff suppressed because it is too large Load Diff

After

Width:  |  Height:  |  Size: 81 KiB

View File

@@ -8,20 +8,37 @@ Definitions
- *Platform*: a social media website, for example Telegram, YouTube, or Rumble.
- *Channel*: an account or group on a platform, for example Twitter users, Telegram private chat groups, YouTube channels, and Gab groups.
- *Post*: a single item created by a channel, for example a Telegram message, a Tweet, or a YouTube video. Posts can contain one or more media attachments.
- *Media*: a file uploaded to a platform by a channel as part of a post.
- *Media*: a file uploaded to a platform by a channel as part of a post. Often images or video but can include audio, or for some platforms arbitrary file types (such as PDFs).
Components
----------
Cisticola has many components
Cisticola has many components, including:
- :py:mod:`cisticola.base`: contains Object Relational Mapping (ORM) dataclasses that imperatively map to pre-defined SQL tables
- :py:mod:`cisticola.scraper`: contains platform-specific modules for scraping raw data from platforms. For example, the :py:mod:`cisticola.scraper.bitchute` module extracts raw data from Bitchute.
- :py:mod:`cisticola.transformer`: contains platform-specific modules for converting raw data into a standardized, cross-platform format.
- The :py:mod:`cisticola.base` module contains Object Relational Mapping (ORM) dataclasses that imperatively map to pre-defined SQL tables
- The :py:mod:`cisticola.scraper` subpackage contains platform-specific modules for scraping raw data from platforms. For example, the :py:mod:`cisticola.scraper.bitchute` module extracts raw data from Bitchute.
- The :py:mod:`cisticola.transformer` subpackage contains platform-specific modules for converting raw data into a standardized, cross-platform format.
The data extracted by scrapers varies by platform, but typically includes media files attached to posts.
Separating the "scraping" and "transforming" steps is useful because it ensures that no data is thrown away during the transormation. There may be some fields in the raw data that aren't included in the transformed format, but could be found to be useful in the future.
Tables
------
The database Cisticola uses to archive and store data consists of 6 tables. Their names, respective ORM mapping in :py:mod:`cisticola.base`, and a brief description are shown below:
- ``channels`` (:py:class:`cisticola.base.Channel`): User-specified information about a channel
- ``raw_posts`` (:py:class:`cisticola.base.ScraperResult`): Minimally processed information scraped from a post
- ``posts`` (:py:class:`cisticola.base.Post`): Processed information about a post
- ``raw_channel_info`` (:py:class:`cisticola.base.RawChannelInfo`): Minimally processed information scraped from a channel
- ``channel_info`` (:py:class:`cisticola.base.ChannelInfo`): Processed information about a channel
- ``media`` (:py:class:`cisticola.base.Media`): Processed information about a media file attached to a post
The diagram below shows all columns in each table and their data types, with certain shared primary and foreign key columns colored differently to distinguish them.
.. image:: ../images/database_schema.svg
:target: _images/database_schema.svg
:width: 100%
TODO
- Add diagram
- Describe common workflow and steps

View File

@@ -57,4 +57,4 @@ autodoc_default_options = {'exclude-members': '_sa_class_manager'}
html_favicon = '../images/favicon.ico'
html_logo = '../images/cisticola_logo.svg'
html_theme_options = {'style_nav_header_background': '#000000'}
html_theme_options = {'style_nav_header_background': '#292a2b'}

View File

@@ -0,0 +1,16 @@
Deployment
==========
.. warning::
We are working on making cisticola more to install, configure, and use. If you're confused by these steps don't worry, it will get more accessible.
Docker
------
The easiest way to deploy Cisticola is to use Docker. Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple.
1. Install `Docker <https://docs.docker.com/get-docker/>`_
Manual Installation
-------------------
TODO

View File

@@ -0,0 +1,23 @@
Developer Guide
===============
Installation
------------
To install the necessary dependencies for building the documentation and running unit tests, run the following command from the package root directory:
.. code-block::
pipenv install --dev
Documentation
-------------
If changes are made to the package structure or additional modules are created, you can update the Sphinx source ``cisticola.*.rst`` files by running the following command from the ``docs/`` directory:
.. code-block::
pipenv run make apidoc
Formatting
----------
Cisticola uses `black <https://github.com/psf/black>`_ to format source code.

View File

@@ -6,4 +6,6 @@ Welcome to Cisticola's documentation!
about
quickstart
deployment
developer_guide
cisticola

View File

@@ -16,35 +16,10 @@ and then install the dependencies using the following command from the package r
pipenv install
To install the necessary dependencies for building the documentation and running unit tests, run the following command from the package root directory:
.. code-block::
pipenv install --dev
Environment Variables
---------------------
Three of the scrapers in *cisticola* (:py:mod:`~cisticola.scraper.gab.GabScraper`, :py:mod:`~cisticola.scraper.instagram.InstagramScraper`, and :py:mod:`~cisticola.scraper.telegram_telethon.TelegramTelethonScraper`) require platform credentials to work correctly.
Gab
"""
The Gab credentials can be configured by running the following command from the root directory:
.. code-block::
pipenv run garc configure
which will direct you to provide the username and password for your Gab account.
Instagram
"""""""""
The Instagram credentials can be configured by setting the following environment variables, either in the project's ``.env`` file or in the system's environment:
- ``INSTAGRAM_USERNAME``: username of your Instagram account
- ``INSTAGRAM_PASSWORD``: password of your Instagram account
One of the scrapers in *cisticola* (:py:mod:`~cisticola.scraper.telegram_telethon.TelegramTelethonScraper`) requires platform credentials to work correctly.
Telegram Telethon
"""""""""""""""""
@@ -57,6 +32,12 @@ The Telegram credentials can be configured by setting the following environment
If you do not already have a Telegram application, you can create one by following the instructions on `this page`_.
To initialize a Telegram session, run the following script from the package's root directory using the command-line:
.. bash::
bash telethon_session_init.py
Documentation
-------------
@@ -86,11 +67,7 @@ To see the logging output from a test run, add the ``--capture=no`` flag to the
Examples
--------
An example of a *cisticola* ingest file ``russian_telegram_ingest.py`` is included in the package root directory, showing how the list of channels to scrape is defined, and how the :py:mod:`~cisticola.scraper.base.ScraperController` and :py:mod:`~cisticola.transformer.base.Transformer` classes are used. To run the ingest script, run the following command from the package root directory:
.. code-block::
pipenv run python russian_telegram_ingest.py
The script ``app.py`` is included in the package root directory, showing how the list of channels to scrape is defined, and how the :py:mod:`~cisticola.scraper.base.ScraperController` and :py:mod:`~cisticola.transformer.base.Transformer` classes are used.
.. _pipenv: https://pipenv.pypa.io/en/latest/
.. _Sphinx: https://www.sphinx-doc.org/en/master/

View File

@@ -11,10 +11,6 @@ addopts =
--cov-report html:reports/coverage
--html='reports/tests.html'
--self-contained-html
markers =
profile: marks tests for only extracting channel metadata (deselect with '-m "not profile"')
media: marks tests for archiving all media attachments (deselect with '-m "not media"')
unarchived: marks tests for archiving all unarchived media attachments (deselect with '-m "not unarchived"')
filterwarnings =
ignore:the imp module is deprecated:DeprecationWarning
ignore:The localize method is no longer necessary, as this time zone supports the fold attribute