Commit Graph

15 Commits

Author SHA1 Message Date
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
e180b82b0d removing useless constructors 2022-07-25 12:29:42 +01:00
msramalho
9317b5e035 turning HASH_ALGORITHM into global archiver prop 2022-07-25 12:27:50 +01:00
Dave Mateer
363a8ef67a Added hash_algorithm to config to choose between SHA256 and SHA3_512 2022-07-18 13:15:48 +01:00
msramalho
3095ce3054 fix: missing key bug 2022-07-04 18:25:33 +02:00
msramalho
34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho
179528562b minor updates 2022-06-27 01:07:59 +02:00
msramalho
ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
msramalho
2be539d39e twitter archiver improvements 2022-06-14 20:55:43 +02:00
msramalho
159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
Dave Mateer
dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
Dave Mateer
cb18289e4f Get Twitter original size image quality 2022-05-09 10:55:38 +01:00
Logan Williams
2d50703489 Generate archivers for Telegram posts with images; move generation to function in base_archiver 2022-02-28 08:41:45 +01:00
Logan Williams
09dc5b5b81 Fix issue with query parameters by using urllib 2022-02-25 15:29:56 +01:00
Logan Williams
6a62c5798c Add Twitter non-video archiver 2022-02-25 13:55:43 +01:00