Commit Graph

19 Commits

Author SHA1 Message Date
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
5cc21fa4e0 bug fix 2022-06-15 17:04:56 +02:00
msramalho
13e7d0bf1b improving path operations 2022-06-08 11:11:09 +02:00
msramalho
e2d1a5d6be import cleanups 2022-06-03 18:30:12 +02:00
msramalho
6bd6f88b46 refactor 2022-05-09 17:45:54 +02:00
msramalho
0d65798308 wip: configurations and logic 2022-05-09 14:54:48 +02:00
Logan Williams
538bb05395 Merge branch 'main' of github.com:bellingcat/auto-archiver into main 2022-03-18 09:53:29 +01:00
Logan Williams
d611aa1e14 Some videos don't render a duration for some reason 2022-03-18 09:44:17 +01:00
msramalho
3b9b42b854 minor code cleanup 2022-03-15 11:32:39 +01:00
msramalho
077c71f941 fixes index out fo range bug 2022-03-09 12:18:06 +01:00
Logan Williams
82ca6792c4 Fix issue with extracting time from Telegram media posts 2022-03-02 14:45:36 +01:00
Logan Williams
2d50703489 Generate archivers for Telegram posts with images; move generation to function in base_archiver 2022-02-28 08:41:45 +01:00
Logan Williams
63a2847ac9 Add header argument; set up webdriver 2022-02-25 16:09:35 +01:00
Logan Williams
1eb17e4de5 Add hash and screenshot methods; switch to more recent ytdl fork 2022-02-25 13:54:40 +01:00
msramalho
8bce84082a minor updates 2022-02-23 18:32:40 +01:00
msramalho
9a264a7dfe cleanup and docs 2022-02-23 16:07:58 +01:00
msramalho
9550cd509e making code more resilient to exceptions 2022-02-23 13:57:11 +01:00
msramalho
e4603a9423 refactoring storage and bringing changes from origin 2022-02-22 16:03:35 +01:00
msramalho
f3ce226665 split into multiple files MVP 2022-02-21 14:19:09 +01:00