Add browsertrix-crawler capture

mirror of https://github.com/bellingcat/auto-archiver.git synced 2026-06-13 05:38:29 +03:00

The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page

This commit is contained in:

Ed Summers

2022-09-25 19:40:20 +00:00

parent 0bdd06f641

commit 3b87dffe6b

11 changed files with 63 additions and 13 deletions

									
										1

example.config.yaml
									
												View File
												
				@@ -119,4 +119,5 @@ execution:

				    duration: duration

				    screenshot: screenshot

				    hash: hash

				    wacz: wacz

Add browsertrix-crawler capture

1 example.config.yaml Unescape Escape View File

1

example.config.yaml

View File