auto-archiver

mirror of https://github.com/bellingcat/auto-archiver.git synced 2026-06-08 03:18:28 +03:00

Author	SHA1	Message	Date
Ed Summers	c34fb9cf10	Add browsertrix profile config option This commit adds a browsertrix profile option to the configuration. In order to not require the passing of the browsertrix config to every Archiver, the Archiver constructors (include the base) were modified to accept a Storage and Config instance. Some of the constructors them pick out the pieces they need from the Config, in addition to calling the parent constructor. In order to avoid a circular import that this created the Config object now defines the default hash function to use, rather than having it be a static property of the Archiver class.	2022-10-11 16:21:42 -04:00
Ed Summers	3b87dffe6b	Add browsertrix-crawler capture The [browsertrix-crawler] utility is a browser-based crawler that can crawl one or more pages. browsertrix-crawler creates archives in the [WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web component, or unzipped to get the original WARC data (the ISO standard format used by the Internet Archive Wayback Machine). This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here: https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0 browsertrix-crawler requires Docker to be installed. If Docker is not installed an error message will be logged and things continue as normal. [browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler [WACZ]: https://specs.webrecorder.net/wacz/latest/ [ReplayWeb.page]: https://replayweb.page	2022-09-25 19:46:29 +00:00
msramalho	8a8251d622	fix in upstream lib for filenames	2022-06-21 01:44:48 +02:00
msramalho	88ede91304	refactoring to use vk_url_scraper	2022-06-20 14:44:06 +02:00
msramalho	562d2f51ad	bot token	2022-06-08 13:39:57 +02:00
msramalho	f87acb6d1d	refactor	2022-06-07 18:41:58 +02:00
msramalho	5135e97d3f	cleanup auto_archive and config	2022-06-03 18:03:49 +02:00
msramalho	10f03cb888	Merge branch 'dev' into refactor-configs	2022-06-02 17:30:47 +02:00
msramalho	159adf9afe	refactoring filenumber into subfolder	2022-05-26 19:18:29 +02:00
Dave Mateer	dbac5accbd	Save to folders for S3 and GD. Google Drive (GD) storage	2022-05-11 15:39:44 +01:00
msramalho	d469967c03	fix index out of range for empty sheets	2022-05-10 22:24:21 +02:00
msramalho	e0276dfab1	additional cleanup	2022-05-09 18:19:38 +02:00
msramalho	3b9b42b854	minor code cleanup	2022-03-15 11:32:39 +01:00
msramalho	07bbf443ca	improves documentation	2022-03-13 12:05:09 +01:00
msramalho	4c54926548	offset fix	2022-03-12 20:29:43 +01:00
msramalho	d8d9cf17dc	fix offset	2022-03-12 20:25:52 +01:00
msramalho	f121c9dab7	enable tolower	2022-03-12 20:14:16 +01:00
msramalho	67b16064bb	offby1	2022-03-12 20:11:38 +01:00
msramalho	ec4ae84487	case-insensitive is a bad idea	2022-03-12 20:06:31 +01:00
msramalho	69483d432c	adds logs	2022-03-12 20:04:08 +01:00
msramalho	6e5e7212c2	fixes header offset	2022-03-12 19:56:00 +01:00
msramalho	6c5d6f521e	implements fresh status retrieval if needed	2022-03-10 19:00:02 +01:00
msramalho	ff874fe0d3	simplifies access to google sheets, single get_values	2022-03-09 12:17:51 +01:00
Logan Williams	63a2847ac9	Add header argument; set up webdriver	2022-02-25 16:09:35 +01:00
msramalho	3cafc444fc	creates tmp folder if not exists	2022-02-23 16:32:38 +01:00
msramalho	1d62009c4f	creates utils module and moves gworkseet there	2022-02-23 16:24:59 +01:00

26 Commits