diff --git a/docs/images/cisticola_logo.svg b/docs/images/cisticola_logo.svg index f570be8..92c0ca5 100644 --- a/docs/images/cisticola_logo.svg +++ b/docs/images/cisticola_logo.svg @@ -7,7 +7,7 @@ viewBox="0 0 51.688999 11.797" version="1.1" id="svg5" - inkscape:version="1.1.2 (76b9e6a115, 2022-02-25)" + inkscape:version="1.2.2 (1:1.2.2+202305151915+b0a8486541)" sodipodi:docname="cisticola_logo.svg" xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" @@ -28,14 +28,16 @@ fit-margin-right="0" fit-margin-bottom="0" inkscape:zoom="2.0838024" - inkscape:cx="52.548168" + inkscape:cx="141.56813" inkscape:cy="115.65396" inkscape:window-width="1920" - inkscape:window-height="999" + inkscape:window-height="1005" inkscape:window-x="0" inkscape:window-y="0" inkscape:window-maximized="1" - inkscape:current-layer="layer4" /> + inkscape:current-layer="layer3" + inkscape:showpageshadow="2" + inkscape:deskcolor="#d1d1d1" /> +image/svg+xmlpostsid raw_id forwarded_from reply_to named_entities cryptocurrency_addresses hashtags outlinks mentions likes forwards views video_duration channel date date_archived date_transformed platform_id scraper transformer platform content detected_language video_title normalized_content url author_id author_username integerintegerintegerintegerjsonjsonjsonjsonjsonintegerintegerintegerintegerintegertimestamptimestamptimestampcharactercharactercharactercharactercharactercharactercharactercharactercharactercharactercharacterchannel_infoid raw_channel_info_id channel followers following verified date_created date_archived date_transformed description description_url description_location platform_id scraper transformer platform screenname name integerintegerintegerintegerintegerbooleantimestamptimestamptimestampcharactercharactercharactercharactercharactercharactercharactercharactercharacterraw_channel_infoid channel date_archived scraper platform raw_data integerintegertimestampcharactercharactercharacterchannelsid public chat country platform url screenname influencer notes source name platform_id category integerbooleanbooleanjsonbcharactercharactercharactercharactercharactercharactercharactercharactercharactermediaidpost date date_archived date_transformed raw_id scraper transformer type url original_url exif ocr integerintegertimestamptimestamptimestampintegercharactercharactercharactercharactercharactercharactercharacterraw_posts id platform_id media_archived date date_archived archived_urls channel scraper platform raw_data integercharactertimestamptimestamptimestampjsonintegercharactercharactercharactertransformer.transformtransformer.transform_info diff --git a/docs/source/about.rst b/docs/source/about.rst index a233fe0..233d829 100644 --- a/docs/source/about.rst +++ b/docs/source/about.rst @@ -8,20 +8,37 @@ Definitions - *Platform*: a social media website, for example Telegram, YouTube, or Rumble. - *Channel*: an account or group on a platform, for example Twitter users, Telegram private chat groups, YouTube channels, and Gab groups. - *Post*: a single item created by a channel, for example a Telegram message, a Tweet, or a YouTube video. Posts can contain one or more media attachments. -- *Media*: a file uploaded to a platform by a channel as part of a post. +- *Media*: a file uploaded to a platform by a channel as part of a post. Often images or video but can include audio, or for some platforms arbitrary file types (such as PDFs). Components ---------- -Cisticola has many components +Cisticola has many components, including: -- :py:mod:`cisticola.base`: contains Object Relational Mapping (ORM) dataclasses that imperatively map to pre-defined SQL tables -- :py:mod:`cisticola.scraper`: contains platform-specific modules for scraping raw data from platforms. For example, the :py:mod:`cisticola.scraper.bitchute` module extracts raw data from Bitchute. -- :py:mod:`cisticola.transformer`: contains platform-specific modules for converting raw data into a standardized, cross-platform format. +- The :py:mod:`cisticola.base` module contains Object Relational Mapping (ORM) dataclasses that imperatively map to pre-defined SQL tables +- The :py:mod:`cisticola.scraper` subpackage contains platform-specific modules for scraping raw data from platforms. For example, the :py:mod:`cisticola.scraper.bitchute` module extracts raw data from Bitchute. +- The :py:mod:`cisticola.transformer` subpackage contains platform-specific modules for converting raw data into a standardized, cross-platform format. The data extracted by scrapers varies by platform, but typically includes media files attached to posts. Separating the "scraping" and "transforming" steps is useful because it ensures that no data is thrown away during the transormation. There may be some fields in the raw data that aren't included in the transformed format, but could be found to be useful in the future. +Tables +------ +The database Cisticola uses to archive and store data consists of 6 tables. Their names, respective ORM mapping in :py:mod:`cisticola.base`, and a brief description are shown below: + +- ``channels`` (:py:class:`cisticola.base.Channel`): User-specified information about a channel +- ``raw_posts`` (:py:class:`cisticola.base.ScraperResult`): Minimally processed information scraped from a post +- ``posts`` (:py:class:`cisticola.base.Post`): Processed information about a post +- ``raw_channel_info`` (:py:class:`cisticola.base.RawChannelInfo`): Minimally processed information scraped from a channel +- ``channel_info`` (:py:class:`cisticola.base.ChannelInfo`): Processed information about a channel +- ``media`` (:py:class:`cisticola.base.Media`): Processed information about a media file attached to a post + +The diagram below shows all columns in each table and their data types, with certain shared primary and foreign key columns colored differently to distinguish them. + +.. image:: ../images/database_schema.svg + :target: _images/database_schema.svg + :width: 100% + TODO - Add diagram - Describe common workflow and steps diff --git a/docs/source/conf.py b/docs/source/conf.py index c291fb8..fc3607f 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -57,4 +57,4 @@ autodoc_default_options = {'exclude-members': '_sa_class_manager'} html_favicon = '../images/favicon.ico' html_logo = '../images/cisticola_logo.svg' -html_theme_options = {'style_nav_header_background': '#000000'} \ No newline at end of file +html_theme_options = {'style_nav_header_background': '#292a2b'} \ No newline at end of file diff --git a/docs/source/deployment.rst b/docs/source/deployment.rst new file mode 100644 index 0000000..91e0c56 --- /dev/null +++ b/docs/source/deployment.rst @@ -0,0 +1,16 @@ +Deployment +========== + +.. warning:: + + We are working on making cisticola more to install, configure, and use. If you're confused by these steps don't worry, it will get more accessible. + +Docker +------ +The easiest way to deploy Cisticola is to use Docker. Docker works like a virtual machine running inside your computer, it isolates everything and makes installation simple. + +1. Install `Docker `_ + +Manual Installation +------------------- +TODO \ No newline at end of file diff --git a/docs/source/developer_guide.rst b/docs/source/developer_guide.rst new file mode 100644 index 0000000..2fb512f --- /dev/null +++ b/docs/source/developer_guide.rst @@ -0,0 +1,23 @@ +Developer Guide +=============== + +Installation +------------ + +To install the necessary dependencies for building the documentation and running unit tests, run the following command from the package root directory: + +.. code-block:: + + pipenv install --dev + +Documentation +------------- +If changes are made to the package structure or additional modules are created, you can update the Sphinx source ``cisticola.*.rst`` files by running the following command from the ``docs/`` directory: + +.. code-block:: + + pipenv run make apidoc + +Formatting +---------- +Cisticola uses `black `_ to format source code. \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index be9e5a3..e3dfcdf 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -6,4 +6,6 @@ Welcome to Cisticola's documentation! about quickstart + deployment + developer_guide cisticola \ No newline at end of file diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst index f6ca747..5d8621e 100644 --- a/docs/source/quickstart.rst +++ b/docs/source/quickstart.rst @@ -16,35 +16,10 @@ and then install the dependencies using the following command from the package r pipenv install -To install the necessary dependencies for building the documentation and running unit tests, run the following command from the package root directory: - -.. code-block:: - - pipenv install --dev - Environment Variables --------------------- -Three of the scrapers in *cisticola* (:py:mod:`~cisticola.scraper.gab.GabScraper`, :py:mod:`~cisticola.scraper.instagram.InstagramScraper`, and :py:mod:`~cisticola.scraper.telegram_telethon.TelegramTelethonScraper`) require platform credentials to work correctly. - -Gab -""" - -The Gab credentials can be configured by running the following command from the root directory: - -.. code-block:: - - pipenv run garc configure - -which will direct you to provide the username and password for your Gab account. - -Instagram -""""""""" - -The Instagram credentials can be configured by setting the following environment variables, either in the project's ``.env`` file or in the system's environment: - -- ``INSTAGRAM_USERNAME``: username of your Instagram account -- ``INSTAGRAM_PASSWORD``: password of your Instagram account +One of the scrapers in *cisticola* (:py:mod:`~cisticola.scraper.telegram_telethon.TelegramTelethonScraper`) requires platform credentials to work correctly. Telegram Telethon """"""""""""""""" @@ -57,6 +32,12 @@ The Telegram credentials can be configured by setting the following environment If you do not already have a Telegram application, you can create one by following the instructions on `this page`_. +To initialize a Telegram session, run the following script from the package's root directory using the command-line: + +.. bash:: + + bash telethon_session_init.py + Documentation ------------- @@ -86,11 +67,7 @@ To see the logging output from a test run, add the ``--capture=no`` flag to the Examples -------- -An example of a *cisticola* ingest file ``russian_telegram_ingest.py`` is included in the package root directory, showing how the list of channels to scrape is defined, and how the :py:mod:`~cisticola.scraper.base.ScraperController` and :py:mod:`~cisticola.transformer.base.Transformer` classes are used. To run the ingest script, run the following command from the package root directory: - -.. code-block:: - - pipenv run python russian_telegram_ingest.py +The script ``app.py`` is included in the package root directory, showing how the list of channels to scrape is defined, and how the :py:mod:`~cisticola.scraper.base.ScraperController` and :py:mod:`~cisticola.transformer.base.Transformer` classes are used. .. _pipenv: https://pipenv.pypa.io/en/latest/ .. _Sphinx: https://www.sphinx-doc.org/en/master/ diff --git a/pytest.ini b/pytest.ini index ae2a8b6..f3545f6 100644 --- a/pytest.ini +++ b/pytest.ini @@ -11,10 +11,6 @@ addopts = --cov-report html:reports/coverage --html='reports/tests.html' --self-contained-html -markers = - profile: marks tests for only extracting channel metadata (deselect with '-m "not profile"') - media: marks tests for archiving all media attachments (deselect with '-m "not media"') - unarchived: marks tests for archiving all unarchived media attachments (deselect with '-m "not unarchived"') filterwarnings = ignore:the imp module is deprecated:DeprecationWarning ignore:The localize method is no longer necessary, as this time zone supports the fold attribute