Commit Graph

78 Commits

Author SHA1 Message Date
msramalho
93be1af93f adds instagram post/profile 2022-10-18 15:45:10 +01:00
msramalho
f0f844a569 improves browsertrix configurations 2022-10-18 11:21:10 +01:00
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
90cb080c81 refactoring and renaming 2022-07-14 18:10:02 +02:00
Dave Mateer
42172566f2 Added whitelist and blacklist for workwheets (not spreadsheet) 2022-07-12 12:53:59 +01:00
msramalho
ffe1c425a0 new archiver, new hack, ready 2022-06-27 01:07:55 +02:00
msramalho
14dae0b938 remove unused import 2022-06-16 20:01:23 +02:00
msramalho
7ab8d0e825 tmp folder randomly created in folder 2022-06-16 19:58:26 +02:00
msramalho
59afe7fd63 vk-archiver implemented 2022-06-15 16:38:18 +02:00
msramalho
6872d8e103 check if exists to configuration, save_logs to command line 2022-06-14 21:37:02 +02:00
msramalho
eca10023b0 detecting errors at a higher level to avoid false "in progress" messages 2022-06-14 19:28:34 +02:00
msramalho
067e6d8954 retry mechanism 2022-06-08 13:39:52 +02:00
msramalho
f87acb6d1d refactor 2022-06-07 18:41:58 +02:00
msramalho
5135e97d3f cleanup auto_archive and config 2022-06-03 18:03:49 +02:00
msramalho
aaa1d299da started cleaning auto_archive 2022-06-03 17:32:55 +02:00
msramalho
10f03cb888 Merge branch 'dev' into refactor-configs 2022-06-02 17:30:47 +02:00
msramalho
159adf9afe refactoring filenumber into subfolder 2022-05-26 19:18:29 +02:00
msramalho
ea261635a2 cleanup 2022-05-25 10:32:26 +02:00
Dave Mateer
dbac5accbd Save to folders for S3 and GD. Google Drive (GD) storage 2022-05-11 15:39:44 +01:00
Dave Mateer
b3599dee71 working 2022-05-11 14:01:22 +01:00
msramalho
d7f44b948f wayback fix 2022-05-10 23:15:58 +02:00
msramalho
bca960b228 merge from master and fixes 2022-05-10 23:09:33 +02:00
msramalho
f6e8da34b8 Merge remote-tracking branch 'origin/main' into refactor-configs 2022-05-10 22:37:09 +02:00
msramalho
39f27ec1bc reenable telethon 2022-05-10 20:23:13 +02:00
msramalho
e0276dfab1 additional cleanup 2022-05-09 18:19:38 +02:00
Miguel Sozinho Ramalho
bba510b8c2 Merge pull request #30 from djhmateer/logging 2022-05-09 15:59:49 +01:00
Miguel Sozinho Ramalho
6e8eccefd8 self-documenting info message 2022-05-09 15:59:35 +01:00
Miguel Sozinho Ramalho
05a3adfc36 Merge pull request #31 from djhmateer/firefox 2022-05-09 15:59:02 +01:00
msramalho
0d65798308 wip: configurations and logic 2022-05-09 14:54:48 +02:00
Dave Mateer
bb599f702d Reload firefox driver on every spreadsheet row 2022-05-09 12:16:18 +01:00
Dave Mateer
f52d8cdef8 add back in d_dotenv() 2022-05-09 12:02:43 +01:00
Dave Mateer
e3c0ae1d45 dotenv 2022-05-09 11:57:54 +01:00
Dave Mateer
e18a9779db added log diretory and file creation 2022-05-09 11:55:10 +01:00
Dave Mateer
51f635ce50 get env variable FACEBOOK_COOKIE patch through from auto_archive 2022-05-09 11:44:25 +01:00
msramalho
f592c7fcfe refactor to use config.py 2022-05-03 20:34:04 +02:00
msramalho
3bdeec1d2f fix deprecation warning for selenium 2022-03-30 11:05:31 +02:00
Logan Williams
398f296789 Fix Selenium driver issues with telegram links 2022-03-18 11:10:27 +01:00
Logan Williams
538bb05395 Merge branch 'main' of github.com:bellingcat/auto-archiver into main 2022-03-18 09:53:29 +01:00
Logan Williams
050b04e31d Add flag for storage privacy 2022-03-18 09:53:21 +01:00
msramalho
0035603bfb telethon-poc 2022-03-15 18:45:53 +01:00
Logan Williams
0304860bce Don't check status for empty URL rows 2022-03-14 11:10:51 +01:00
msramalho
f121c9dab7 enable tolower 2022-03-12 20:14:16 +01:00
msramalho
69483d432c adds logs 2022-03-12 20:04:08 +01:00
msramalho
486c3295b5 log 2022-03-12 19:54:10 +01:00
msramalho
6c5d6f521e implements fresh status retrieval if needed 2022-03-10 19:00:02 +01:00
msramalho
52333874c9 making column names configurable through the command line 2022-03-09 12:38:04 +01:00
msramalho
ff874fe0d3 simplifies access to google sheets, single get_values 2022-03-09 12:17:51 +01:00
msramalho
544e7578a6 removes duplicate code 2022-03-09 11:46:14 +01:00
Logan Williams
aa4b175dea Fix issue with timestamps being convereted to user format 2022-02-28 12:54:58 +01:00