Ed Summers
3b87dffe6b
Add browsertrix-crawler capture
...
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).
This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:
https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0
browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.
[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
e180b82b0d
removing useless constructors
2022-07-25 12:29:42 +01:00
msramalho
9317b5e035
turning HASH_ALGORITHM into global archiver prop
2022-07-25 12:27:50 +01:00
msramalho
2d7d8c4e08
renaming and making default SHA-256
2022-07-25 12:12:43 +01:00
Dave Mateer
363a8ef67a
Added hash_algorithm to config to choose between SHA256 and SHA3_512
2022-07-18 13:15:48 +01:00
msramalho
3095ce3054
fix: missing key bug
2022-07-04 18:25:33 +02:00
msramalho
4b423dfc34
fix telethon exception
2022-06-27 14:36:58 +02:00
msramalho
34536e7f14
added explanation for 2 twitter archivers
2022-06-27 11:17:23 +02:00
msramalho
179528562b
minor updates
2022-06-27 01:07:59 +02:00
msramalho
ffe1c425a0
new archiver, new hack, ready
2022-06-27 01:07:55 +02:00
msramalho
b4e9d6a2a8
removes log
2022-06-21 15:39:54 +02:00
msramalho
c4efa6e597
dding thumbnails
2022-06-21 15:39:13 +02:00
msramalho
8a8251d622
fix in upstream lib for filenames
2022-06-21 01:44:48 +02:00
msramalho
88ede91304
refactoring to use vk_url_scraper
2022-06-20 14:44:06 +02:00
msramalho
177e3a623e
improve log
2022-06-16 20:04:43 +02:00
msramalho
3efb835222
fix: telethon bad regex for ?single
2022-06-16 18:06:17 +02:00
msramalho
cdd66fb7da
returning empty string thumbs
2022-06-16 16:30:08 +02:00
msramalho
afc7e133cf
simplifying telethon
2022-06-16 16:26:30 +02:00
msramalho
81eb00a767
handle deleted telegram
2022-06-16 16:19:57 +02:00
msramalho
81ce27bdb3
fix
2022-06-16 14:34:33 +02:00
msramalho
ec1993c5dc
telethon fix
2022-06-16 14:33:50 +02:00
msramalho
b37f7adc8f
another telethon fix
2022-06-16 14:29:51 +02:00
msramalho
277d81d687
telethon minor fix
2022-06-16 14:16:18 +02:00
msramalho
2ac08a34f6
ydl timestamp bug fix
2022-06-16 13:45:02 +02:00
msramalho
c6bcb59005
improvement for albums
2022-06-15 23:36:10 +02:00
msramalho
659097c072
better error log
2022-06-15 22:54:18 +02:00
msramalho
3b6678818e
title for vk photo
2022-06-15 22:47:55 +02:00
msramalho
ed4b193ae7
walrus
2022-06-15 22:30:08 +02:00
msramalho
c08b5268f7
using API instead of scraping
2022-06-15 21:25:15 +02:00
msramalho
86e1d3545e
fix for missing telethon config
2022-06-15 17:17:46 +02:00
msramalho
b1f70bb818
minor improvements
2022-06-15 17:14:08 +02:00
msramalho
5cc21fa4e0
bug fix
2022-06-15 17:04:56 +02:00
msramalho
2dbdf9b8d3
check if exists
2022-06-15 17:04:50 +02:00
msramalho
771c5376c4
simplify display
2022-06-15 16:47:20 +02:00
msramalho
951b16ba9c
improving media page with images and videos
2022-06-15 16:38:30 +02:00
msramalho
59afe7fd63
vk-archiver implemented
2022-06-15 16:38:18 +02:00
msramalho
64c083b37b
wayback should re-archive even if old version exists
2022-06-14 20:55:59 +02:00
msramalho
2be539d39e
twitter archiver improvements
2022-06-14 20:55:43 +02:00
msramalho
12648bbce9
centralizing slugify url method
2022-06-14 20:15:14 +02:00
msramalho
6499161f5c
fixing gd bug on twitter images
2022-06-14 19:55:05 +02:00
msramalho
d9b8c48af0
missing parameter bug fix
2022-06-14 19:19:14 +02:00
msramalho
c8a02cb93a
new wayback error action
2022-06-08 18:25:58 +02:00
msramalho
bd5146ac3e
bug fixes
2022-06-08 18:17:25 +02:00
msramalho
562d2f51ad
bot token
2022-06-08 13:39:57 +02:00
msramalho
067e6d8954
retry mechanism
2022-06-08 13:39:52 +02:00
msramalho
a0be3c8a22
todo
2022-06-08 11:44:57 +02:00
msramalho
c622f941d7
tiktok bug fix
2022-06-08 11:44:49 +02:00
msramalho
13e7d0bf1b
improving path operations
2022-06-08 11:11:09 +02:00
msramalho
f87acb6d1d
refactor
2022-06-07 18:41:58 +02:00
msramalho
e2d1a5d6be
import cleanups
2022-06-03 18:30:12 +02:00