Ed Summers
3b87dffe6b
Add browsertrix-crawler capture
...
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).
This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:
https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0
browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.
[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho
0bdd06f641
Update README.md
2022-09-22 15:58:41 +02:00
Miguel Sozinho Ramalho
0bd9e043ed
Merge pull request #58 from bellingcat/dev
2022-09-21 18:53:13 +02:00
msramalho
c77b4a080a
update comment
2022-09-21 18:52:23 +02:00
Miguel Sozinho Ramalho
e813249520
Merge pull request #56 from djhmateer/oauth
2022-07-25 15:03:49 +01:00
msramalho
992dee022a
format
2022-07-25 14:59:04 +01:00
msramalho
961dcdb4ef
Merge branch 'dev' into oauth
2022-07-25 14:58:56 +01:00
msramalho
6124bc5f72
refactored and simplified obtaining credentials
2022-07-25 14:52:50 +01:00
Miguel Sozinho Ramalho
12918d4fce
Merge pull request #55 from djhmateer/dev-upstream
2022-07-25 12:38:33 +01:00
msramalho
63140d69c1
format
2022-07-25 12:35:27 +01:00
msramalho
e180b82b0d
removing useless constructors
2022-07-25 12:29:42 +01:00
msramalho
9317b5e035
turning HASH_ALGORITHM into global archiver prop
2022-07-25 12:27:50 +01:00
msramalho
2d7d8c4e08
renaming and making default SHA-256
2022-07-25 12:12:43 +01:00
msramalho
7b8be95e25
removing empty line
2022-07-25 12:12:14 +01:00
Dave Mateer
524b40b869
Added Google OAuth flow for Google Drive so can use a real user and not a service account to save files
2022-07-18 13:39:00 +01:00
Dave Mateer
9f9b9d8f63
adding in GD token
2022-07-18 13:25:05 +01:00
Dave Mateer
363a8ef67a
Added hash_algorithm to config to choose between SHA256 and SHA3_512
2022-07-18 13:15:48 +01:00
msramalho
6d8be4c07f
s3 allow online preview instead of forced download
2022-07-14 18:16:06 +02:00
Miguel Sozinho Ramalho
d701141c1b
Merge pull request #51 from djhmateer/whitelist
2022-07-14 17:13:29 +01:00
msramalho
37e1fcd540
comment
2022-07-14 18:10:53 +02:00
msramalho
90cb080c81
refactoring and renaming
2022-07-14 18:10:02 +02:00
Miguel Sozinho Ramalho
4a7aac59de
Merge pull request #50 from djhmateer/leadingslash
...
Put in fix for leading / in Google Drive
2022-07-14 16:45:49 +01:00
msramalho
03e542a0fc
isolate into function
2022-07-14 17:45:28 +02:00
Dave Mateer
42172566f2
Added whitelist and blacklist for workwheets (not spreadsheet)
2022-07-12 12:53:59 +01:00
Dave Mateer
16bd54b8d3
Put in fix for leading / in Google Drive
2022-07-12 12:44:29 +01:00
msramalho
3095ce3054
fix: missing key bug
2022-07-04 18:25:33 +02:00
Miguel Sozinho Ramalho
7aeb38a773
Merge pull request #45 from bellingcat/twitter-api-and-hack
...
Twitter: new archiver, new hack, ready
2022-06-27 13:37:52 +01:00
msramalho
4b423dfc34
fix telethon exception
2022-06-27 14:36:58 +02:00
msramalho
34536e7f14
added explanation for 2 twitter archivers
2022-06-27 11:17:23 +02:00
msramalho
179528562b
minor updates
2022-06-27 01:07:59 +02:00
msramalho
ffe1c425a0
new archiver, new hack, ready
2022-06-27 01:07:55 +02:00
Miguel Sozinho Ramalho
76b531c56a
Merge pull request #44 from bellingcat/vk-url-lib
2022-06-21 14:40:08 +01:00
msramalho
b4e9d6a2a8
removes log
2022-06-21 15:39:54 +02:00
msramalho
c4efa6e597
dding thumbnails
2022-06-21 15:39:13 +02:00
msramalho
8a8251d622
fix in upstream lib for filenames
2022-06-21 01:44:48 +02:00
msramalho
74d421dc94
update lib
2022-06-21 00:05:32 +02:00
msramalho
88ede91304
refactoring to use vk_url_scraper
2022-06-20 14:44:06 +02:00
msramalho
177e3a623e
improve log
2022-06-16 20:04:43 +02:00
msramalho
14dae0b938
remove unused import
2022-06-16 20:01:23 +02:00
Miguel Sozinho Ramalho
74cef2f21b
Merge pull request #43 from bellingcat/refactor-tmp-dir-logic
2022-06-16 19:00:19 +01:00
msramalho
7ab8d0e825
tmp folder randomly created in folder
2022-06-16 19:58:26 +02:00
msramalho
d2e29f85d2
selenium: quit and close
2022-06-16 18:45:47 +02:00
msramalho
3efb835222
fix: telethon bad regex for ?single
2022-06-16 18:06:17 +02:00
Miguel Sozinho Ramalho
3d9a2622c3
Update README.md
2022-06-16 16:23:53 +01:00
Miguel Sozinho Ramalho
2650dcc680
Merge pull request #42 from bellingcat/dev
2022-06-16 16:20:32 +01:00
Miguel Sozinho Ramalho
4c6f3ea688
Merge pull request #33 from bellingcat/refactor-configs
...
breaking changes: refactor configs + fixes
2022-06-16 16:19:28 +01:00
Miguel Sozinho Ramalho
b7f1ec5404
Merge pull request #40 from bellingcat/vk-archiver
2022-06-16 16:18:48 +01:00
msramalho
14add43923
fixing auto_auto_archive
2022-06-16 17:17:25 +02:00
msramalho
cdd66fb7da
returning empty string thumbs
2022-06-16 16:30:08 +02:00
msramalho
afc7e133cf
simplifying telethon
2022-06-16 16:26:30 +02:00