Commit Graph

31 Commits

Author SHA1 Message Date
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
c77b4a080a update comment 2022-09-21 18:52:23 +02:00
msramalho
992dee022a format 2022-07-25 14:59:04 +01:00
msramalho
6124bc5f72 refactored and simplified obtaining credentials 2022-07-25 14:52:50 +01:00
Dave Mateer
9f9b9d8f63 adding in GD token 2022-07-18 13:25:05 +01:00
msramalho
6d8be4c07f s3 allow online preview instead of forced download 2022-07-14 18:16:06 +02:00
msramalho
03e542a0fc isolate into function 2022-07-14 17:45:28 +02:00
Dave Mateer
16bd54b8d3 Put in fix for leading / in Google Drive 2022-07-12 12:44:29 +01:00
msramalho
22f20ba744 improve debugging 2022-06-14 19:54:53 +02:00
msramalho
6dcb59fea6 removing unnecessary method 2022-06-08 11:46:00 +02:00
msramalho
f0a276e3a5 bug fix 2022-06-08 11:45:38 +02:00
msramalho
f87acb6d1d refactor 2022-06-07 18:41:58 +02:00
msramalho
a2fdfacb26 config refactor and cleanup 2022-06-03 17:32:25 +02:00
msramalho
c679e02c73 updated storages init 2022-06-03 17:32:02 +02:00
msramalho
d33daabee1 refactoring storages 2022-06-03 15:46:00 +02:00
msramalho
10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho
159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
msramalho
b895def432 method customization to children 2022-05-25 12:23:52 +02:00
Dave Mateer
dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
Dave Mateer
b3599dee71 working 2022-05-11 14:01:22 +01:00
msramalho
0d65798308 wip: configurations and logic 2022-05-09 14:54:48 +02:00
msramalho
03a6611c86 adds local storage 2022-05-03 20:33:02 +02:00
msramalho
24340190af s3 storage config refactor 2022-05-03 20:32:53 +02:00
Logan Williams
398f296789 Fix Selenium driver issues with telegram links 2022-03-18 11:10:27 +01:00
Logan Williams
538bb05395 Merge branch 'main' of github.com:bellingcat/auto-archiver into main 2022-03-18 09:53:29 +01:00
Logan Williams
050b04e31d Add flag for storage privacy 2022-03-18 09:53:21 +01:00
msramalho
30787506a1 additional logging 2022-03-16 19:50:29 +01:00
msramalho
3b9b42b854 minor code cleanup 2022-03-15 11:32:39 +01:00
msramalho
59027ac477 simplification 2022-03-09 11:44:19 +01:00
msramalho
e4603a9423 refactoring storage and bringing changes from origin 2022-02-22 16:03:35 +01:00