Commit Graph

104 Commits

Author SHA1 Message Date
msramalho
dc0ca8bdd6 adds browsertrix to all archivers flows 2022-10-17 14:06:50 +01:00
Ed Summers
20ca50dc90 Clean up browsertrix-crawler files
Remove any local browsertrix-crawler files after the WACZ has been
copied to storage. Note, until this issue has a release on DockerHub the
local files won't be able to be deleted since Docker on Linux creates
the files as root:

https://github.com/webrecorder/browsertrix-crawler/issues/170

The code will catch this exception and log a warning instead of failing
and losing the work that has been completed.
2022-10-11 16:49:19 -04:00
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
e180b82b0d removing useless constructors 2022-07-25 12:29:42 +01:00
msramalho
9317b5e035 turning HASH_ALGORITHM into global archiver prop 2022-07-25 12:27:50 +01:00
msramalho
2d7d8c4e08 renaming and making default SHA-256 2022-07-25 12:12:43 +01:00
Dave Mateer
363a8ef67a Added hash_algorithm to config to choose between SHA256 and SHA3_512 2022-07-18 13:15:48 +01:00
msramalho
3095ce3054 fix: missing key bug 2022-07-04 18:25:33 +02:00
msramalho
4b423dfc34 fix telethon exception 2022-06-27 14:36:58 +02:00
msramalho
34536e7f14 added explanation for 2 twitter archivers 2022-06-27 11:17:23 +02:00
msramalho
179528562b minor updates 2022-06-27 01:07:59 +02:00
msramalho
ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
msramalho
b4e9d6a2a8 removes log 2022-06-21 15:39:54 +02:00
msramalho
c4efa6e597 dding thumbnails 2022-06-21 15:39:13 +02:00
msramalho
8a8251d622 fix in upstream lib for filenames 2022-06-21 01:44:48 +02:00
msramalho
88ede91304 refactoring to use vk_url_scraper 2022-06-20 14:44:06 +02:00
msramalho
177e3a623e improve log 2022-06-16 20:04:43 +02:00
msramalho
3efb835222 fix: telethon bad regex for ?single 2022-06-16 18:06:17 +02:00
msramalho
cdd66fb7da returning empty string thumbs 2022-06-16 16:30:08 +02:00
msramalho
afc7e133cf simplifying telethon 2022-06-16 16:26:30 +02:00
msramalho
81eb00a767 handle deleted telegram 2022-06-16 16:19:57 +02:00
msramalho
81ce27bdb3 fix 2022-06-16 14:34:33 +02:00
msramalho
ec1993c5dc telethon fix 2022-06-16 14:33:50 +02:00
msramalho
b37f7adc8f another telethon fix 2022-06-16 14:29:51 +02:00
msramalho
277d81d687 telethon minor fix 2022-06-16 14:16:18 +02:00
msramalho
2ac08a34f6 ydl timestamp bug fix 2022-06-16 13:45:02 +02:00
msramalho
c6bcb59005 improvement for albums 2022-06-15 23:36:10 +02:00
msramalho
659097c072 better error log 2022-06-15 22:54:18 +02:00
msramalho
3b6678818e title for vk photo 2022-06-15 22:47:55 +02:00
msramalho
ed4b193ae7 walrus 2022-06-15 22:30:08 +02:00
msramalho
c08b5268f7 using API instead of scraping 2022-06-15 21:25:15 +02:00
msramalho
86e1d3545e fix for missing telethon config 2022-06-15 17:17:46 +02:00
msramalho
b1f70bb818 minor improvements 2022-06-15 17:14:08 +02:00
msramalho
5cc21fa4e0 bug fix 2022-06-15 17:04:56 +02:00
msramalho
2dbdf9b8d3 check if exists 2022-06-15 17:04:50 +02:00
msramalho
771c5376c4 simplify display 2022-06-15 16:47:20 +02:00
msramalho
951b16ba9c improving media page with images and videos 2022-06-15 16:38:30 +02:00
msramalho
59afe7fd63 vk-archiver implemented 2022-06-15 16:38:18 +02:00
msramalho
64c083b37b wayback should re-archive even if old version exists 2022-06-14 20:55:59 +02:00
msramalho
2be539d39e twitter archiver improvements 2022-06-14 20:55:43 +02:00
msramalho
12648bbce9 centralizing slugify url method 2022-06-14 20:15:14 +02:00
msramalho
6499161f5c fixing gd bug on twitter images 2022-06-14 19:55:05 +02:00
msramalho
d9b8c48af0 missing parameter bug fix 2022-06-14 19:19:14 +02:00
msramalho
c8a02cb93a new wayback error action 2022-06-08 18:25:58 +02:00
msramalho
bd5146ac3e bug fixes 2022-06-08 18:17:25 +02:00
msramalho
562d2f51ad bot token 2022-06-08 13:39:57 +02:00
msramalho
067e6d8954 retry mechanism 2022-06-08 13:39:52 +02:00
msramalho
a0be3c8a22 todo 2022-06-08 11:44:57 +02:00
msramalho
c622f941d7 tiktok bug fix 2022-06-08 11:44:49 +02:00