Commit Graph

21 Commits

Author SHA1 Message Date
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
2ac08a34f6 ydl timestamp bug fix 2022-06-16 13:45:02 +02:00
msramalho
e2d1a5d6be import cleanups 2022-06-03 18:30:12 +02:00
msramalho
10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho
159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
Dave Mateer
dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
msramalho
bca960b228 merge from master and fixes 2022-05-10 23:09:33 +02:00
msramalho
f6e8da34b8 Merge remote-tracking branch 'origin/main' into refactor-configs 2022-05-10 22:37:09 +02:00
msramalho
6bd6f88b46 refactor 2022-05-09 17:45:54 +02:00
Miguel Sozinho Ramalho
3a3d3c6690 Merge pull request #29 from djhmateer/fb-cookie 2022-05-09 15:53:07 +01:00
msramalho
0d65798308 wip: configurations and logic 2022-05-09 14:54:48 +02:00
Dave Mateer
7ae6e0c6f8 fb cooke in ytd 2022-05-09 11:38:08 +01:00
Dave Mateer
bd235347ac Added catch in youtubedl_archiver for twitter.com to see if a linked video is in there eg vk.com 2022-05-09 11:23:42 +01:00
Dave Mateer
fec380e93d Fixed wwww (4 w's) to www in youtubedl 2022-04-27 10:18:10 +01:00
Logan Williams
aa4b175dea Fix issue with timestamps being convereted to user format 2022-02-28 12:54:58 +01:00
Logan Williams
6ebce974f0 WIP: Make timezones more consistent in UTC 2022-02-28 08:42:59 +01:00
Logan Williams
1eb17e4de5 Add hash and screenshot methods; switch to more recent ytdl fork 2022-02-25 13:54:40 +01:00
msramalho
9a264a7dfe cleanup and docs 2022-02-23 16:07:58 +01:00
msramalho
e4603a9423 refactoring storage and bringing changes from origin 2022-02-22 16:03:35 +01:00
msramalho
f3ce226665 split into multiple files MVP 2022-02-21 14:19:09 +01:00