Commit Graph

79 Commits

Author SHA1 Message Date
Patrick Robertson
dbc564e18b Add sphinx_book_theme theme to poetry 2025-02-10 22:58:52 +00:00
Patrick Robertson
2650cd8fb2 Use a script to auto-generate documentation for the core modules from the manifest file 2025-02-10 22:51:04 +00:00
Patrick Robertson
824728739a Start fleshing out the docs more - rearrange, separate out modules section, move files over to md (from rst) 2025-02-10 16:24:16 +00:00
erinhmclark
dd402b456f Fix and add types to manifest 2025-01-24 18:50:11 +00:00
erinhmclark
170f8d18a6 Add instructions to README.md, include build directories in .gitignore and do a bit more tidying, 2025-01-16 20:46:10 +00:00
Patrick Robertson
100996f1e5 Add docker-compose for easy building and running of docker image in dev
Just use docker compose up
2025-01-15 14:36:02 +01:00
Patrick Robertson
cef4037ad5 Add documentation on running tests to the readme 2025-01-14 11:30:06 +01:00
Patrick Robertson
bdfedfcf61 Merge branch 'main' into feat/unittest 2025-01-13 19:50:47 +01:00
Patrick Robertson
52f064908e Add unit test badges to readme 2025-01-13 18:33:22 +01:00
erinhmclark
72a8e76fbb Update README.md for usage with Poetry. 2025-01-12 20:21:23 +00:00
msramalho
11c3288267 closes #146 2024-08-21 13:33:58 +01:00
msramalho
dc9e64397e bumping yt-dlp 2024-07-18 11:23:09 +01:00
R. Miles McCain
f603400d0d Add direct Atlos integration (#137)
* Add Atlos feeder

* Add Atlos db

* Add Atlos storage

* Fix Atlos storages

* Fix Atlos feeder

* Only include URLs in Atlos feeder once they're processed

* Remove print

* Add Atlos documentation to README

* Formatting fixes

* Don't archive existing material

* avoid KeyError in atlos_db

* version bump

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2024-04-15 19:25:17 +01:00
Miguel Sozinho Ramalho
7a21ae96af V0.9.0 - closes several open issues: new enrichers and bug fixes (#133)
* clean orchestrator code, add archiver cleanup logic

* improves documentation for database.py

* telethon archivers isolate sessions into copied files

* closes #127

* closes #125

* closes #84

* meta enricher applies to all media

* closes #61 adds subtitles and comments

* minor update

* minor fixes to yt-dlp subtitles and comments

* closes #17 but logic is imperfect.

* closes #85 ssl enhancer

* minimifies html, JS refactor for preview of certificates

* closes #91 adds freetsa timestamp authority

* version bump

* simplify download_url method

* skip ssl if nothing archived

* html preview improvements

* adds retrying lib

* manual download archiver improvements

* meta only runs when relevant data available

* new metadata convenience method

* html template improvements

* removes debug message

* does not close #91 yet, will need a few more certificate chaing logging

* adds verbosity config

* new instagram api archiver

* adds proxy support we

* adds proxy/end support and bug fix for yt-dlp

* proxy support for webdriver

* adds socks proxy to wacz_enricher

* refactor recursivity in inner media and display

* infinite recursive display

* foolproofing timestamping authortities

* version to 0.9.0

* minor fixes from code-review
2024-02-20 18:05:29 +00:00
Tomas Apodaca
590d3fe824 Fix typo in readme (#121) 2024-01-24 21:17:31 +00:00
msramalho
b7889a182d readme update 2023-06-26 18:18:46 +01:00
msramalho
04f827f183 Bump version to v0.5.25 for release 2023-06-26 18:15:45 +01:00
Miguel Sozinho Ramalho
cc03ad7c49 Update README.md 2023-05-11 13:55:28 +01:00
Logan Williams
6d2aa3dd7a Add invocation example 2023-05-11 14:32:23 +02:00
Logan Williams
f2e580de4e Update README images 2023-05-11 14:30:27 +02:00
Logan Williams
80ea912d0e Update README 2023-05-11 11:32:46 +02:00
Logan Williams
26373d4545 Re-order README slightly 2023-05-10 11:48:34 +02:00
Miguel Sozinho Ramalho
b67a7b818a Merge pull request #75 from bellingcat/feature/browsertrix 2023-05-10 10:14:40 +01:00
Logan Williams
2e63cb8411 Update README with new entrypoint 2023-05-10 11:13:47 +02:00
msramalho
e150370657 updates docker instructions 2023-05-10 09:51:53 +01:00
msramalho
ae3e607705 fix: depreacating thumbnail_index 2023-05-09 11:29:05 +01:00
msramalho
876988b587 detect invalid url messages instagram bot 2023-02-20 12:22:52 +00:00
msramalho
d1e4574c6c readme updates 2023-02-17 16:30:50 +00:00
msramalho
f35875a94c name fix 2023-02-17 15:46:05 +00:00
msramalho
224ebe7ee8 links 2023-02-08 22:27:56 +00:00
msramalho
54a1bc2172 update readme 2023-02-08 22:26:24 +00:00
msramalho
77948207d1 update 2023-02-08 22:24:40 +00:00
msramalho
60552ae0ea update readme 2023-02-08 22:23:25 +00:00
msramalho
f255271ecb update README 2023-02-08 22:17:22 +00:00
msramalho
2a7ece5dcc cleanups and docs 2023-02-08 22:13:19 +00:00
msramalho
d31b3dda52 Bump version to v0.2.17 for release 2023-02-07 23:56:42 +00:00
msramalho
f81ff14faa license to publish 2023-02-07 23:43:50 +00:00
msramalho
5ed38ffaab clean readme 2023-02-07 23:37:53 +00:00
msramalho
9b4a41e654 Bump version to v0.2.0 for release 2023-02-07 22:07:23 +00:00
msramalho
b3860cfec1 telethon join channels working 2022-12-14 14:01:39 +00:00
msramalho
65dd155c90 WIP refactor logic 2022-11-15 15:00:52 +00:00
msramalho
22363cb8b9 adds information on browsertrix usage 2022-10-20 11:59:23 +01:00
msramalho
ac4f1b6132 readme updates 2022-10-19 11:37:04 +01:00
msramalho
26903190fd adds wacz link 2022-10-17 14:41:34 +01:00
msramalho
57464f1506 refactors for edges in browsertrix and s3 upload, adds timeout parameter 2022-10-17 14:07:31 +01:00
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho
0bdd06f641 Update README.md 2022-09-22 15:58:41 +02:00
msramalho
34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho
ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00