Commit Graph

15 Commits

Author SHA1 Message Date
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
c622f941d7 tiktok bug fix 2022-06-08 11:44:49 +02:00
msramalho
13e7d0bf1b improving path operations 2022-06-08 11:11:09 +02:00
msramalho
f87acb6d1d refactor 2022-06-07 18:41:58 +02:00
msramalho
e2d1a5d6be import cleanups 2022-06-03 18:30:12 +02:00
msramalho
10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho
159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
Dave Mateer
dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
msramalho
6bd6f88b46 refactor 2022-05-09 17:45:54 +02:00
msramalho
0d65798308 wip: configurations and logic 2022-05-09 14:54:48 +02:00
Logan Williams
1eb17e4de5 Add hash and screenshot methods; switch to more recent ytdl fork 2022-02-25 13:54:40 +01:00
msramalho
9a264a7dfe cleanup and docs 2022-02-23 16:07:58 +01:00
msramalho
2d145802b5 extracted worksheet operations 2022-02-23 09:54:03 +01:00
msramalho
e4603a9423 refactoring storage and bringing changes from origin 2022-02-22 16:03:35 +01:00
msramalho
f3ce226665 split into multiple files MVP 2022-02-21 14:19:09 +01:00