Commit Graph

268 Commits

Author SHA1 Message Date
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho
0bdd06f641 Update README.md 2022-09-22 15:58:41 +02:00
Miguel Sozinho Ramalho
0bd9e043ed Merge pull request #58 from bellingcat/dev 2022-09-21 18:53:13 +02:00
msramalho
c77b4a080a update comment 2022-09-21 18:52:23 +02:00
Miguel Sozinho Ramalho
e813249520 Merge pull request #56 from djhmateer/oauth 2022-07-25 15:03:49 +01:00
msramalho
992dee022a format 2022-07-25 14:59:04 +01:00
msramalho
961dcdb4ef Merge branch 'dev' into oauth 2022-07-25 14:58:56 +01:00
msramalho
6124bc5f72 refactored and simplified obtaining credentials 2022-07-25 14:52:50 +01:00
Miguel Sozinho Ramalho
12918d4fce Merge pull request #55 from djhmateer/dev-upstream 2022-07-25 12:38:33 +01:00
msramalho
63140d69c1 format 2022-07-25 12:35:27 +01:00
msramalho
e180b82b0d removing useless constructors 2022-07-25 12:29:42 +01:00
msramalho
9317b5e035 turning HASH_ALGORITHM into global archiver prop 2022-07-25 12:27:50 +01:00
msramalho
2d7d8c4e08 renaming and making default SHA-256 2022-07-25 12:12:43 +01:00
msramalho
7b8be95e25 removing empty line 2022-07-25 12:12:14 +01:00
Dave Mateer
524b40b869 Added Google OAuth flow for Google Drive so can use a real user and not a service account to save files 2022-07-18 13:39:00 +01:00
Dave Mateer
9f9b9d8f63 adding in GD token 2022-07-18 13:25:05 +01:00
Dave Mateer
363a8ef67a Added hash_algorithm to config to choose between SHA256 and SHA3_512 2022-07-18 13:15:48 +01:00
msramalho
6d8be4c07f s3 allow online preview instead of forced download 2022-07-14 18:16:06 +02:00
Miguel Sozinho Ramalho
d701141c1b Merge pull request #51 from djhmateer/whitelist 2022-07-14 17:13:29 +01:00
msramalho
37e1fcd540 comment 2022-07-14 18:10:53 +02:00
msramalho
90cb080c81 refactoring and renaming 2022-07-14 18:10:02 +02:00
Miguel Sozinho Ramalho
4a7aac59de Merge pull request #50 from djhmateer/leadingslash
Put in fix for leading / in Google Drive
2022-07-14 16:45:49 +01:00
msramalho
03e542a0fc isolate into function 2022-07-14 17:45:28 +02:00
Dave Mateer
42172566f2 Added whitelist and blacklist for workwheets (not spreadsheet) 2022-07-12 12:53:59 +01:00
Dave Mateer
16bd54b8d3 Put in fix for leading / in Google Drive 2022-07-12 12:44:29 +01:00
msramalho
3095ce3054 fix: missing key bug 2022-07-04 18:25:33 +02:00
Miguel Sozinho Ramalho
7aeb38a773 Merge pull request #45 from bellingcat/twitter-api-and-hack
Twitter: new archiver, new hack, ready
2022-06-27 13:37:52 +01:00
msramalho
4b423dfc34 fix telethon exception 2022-06-27 14:36:58 +02:00
msramalho
34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho
179528562b minor updates 2022-06-27 01:07:59 +02:00
msramalho
ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
Miguel Sozinho Ramalho
76b531c56a Merge pull request #44 from bellingcat/vk-url-lib 2022-06-21 14:40:08 +01:00
msramalho
b4e9d6a2a8 removes log 2022-06-21 15:39:54 +02:00
msramalho
c4efa6e597 dding thumbnails 2022-06-21 15:39:13 +02:00
msramalho
8a8251d622 fix in upstream lib for filenames 2022-06-21 01:44:48 +02:00
msramalho
74d421dc94 update lib 2022-06-21 00:05:32 +02:00
msramalho
88ede91304 refactoring to use vk_url_scraper 2022-06-20 14:44:06 +02:00
msramalho
177e3a623e improve log 2022-06-16 20:04:43 +02:00
msramalho
14dae0b938 remove unused import 2022-06-16 20:01:23 +02:00
Miguel Sozinho Ramalho
74cef2f21b Merge pull request #43 from bellingcat/refactor-tmp-dir-logic 2022-06-16 19:00:19 +01:00
msramalho
7ab8d0e825 tmp folder randomly created in folder 2022-06-16 19:58:26 +02:00
msramalho
d2e29f85d2 selenium: quit and close 2022-06-16 18:45:47 +02:00
msramalho
3efb835222 fix: telethon bad regex for ?single 2022-06-16 18:06:17 +02:00
Miguel Sozinho Ramalho
3d9a2622c3 Update README.md 2022-06-16 16:23:53 +01:00
Miguel Sozinho Ramalho
2650dcc680 Merge pull request #42 from bellingcat/dev 2022-06-16 16:20:32 +01:00
Miguel Sozinho Ramalho
4c6f3ea688 Merge pull request #33 from bellingcat/refactor-configs
breaking changes: refactor configs + fixes
2022-06-16 16:19:28 +01:00
Miguel Sozinho Ramalho
b7f1ec5404 Merge pull request #40 from bellingcat/vk-archiver 2022-06-16 16:18:48 +01:00
msramalho
14add43923 fixing auto_auto_archive 2022-06-16 17:17:25 +02:00
msramalho
cdd66fb7da returning empty string thumbs 2022-06-16 16:30:08 +02:00