Ed Summers
3b87dffe6b
Add browsertrix-crawler capture
...
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).
This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:
https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0
browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.
[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
msramalho
90cb080c81
refactoring and renaming
2022-07-14 18:10:02 +02:00
Dave Mateer
42172566f2
Added whitelist and blacklist for workwheets (not spreadsheet)
2022-07-12 12:53:59 +01:00
msramalho
ffe1c425a0
new archiver, new hack, ready
2022-06-27 01:07:55 +02:00
msramalho
14dae0b938
remove unused import
2022-06-16 20:01:23 +02:00
msramalho
7ab8d0e825
tmp folder randomly created in folder
2022-06-16 19:58:26 +02:00
msramalho
59afe7fd63
vk-archiver implemented
2022-06-15 16:38:18 +02:00
msramalho
6872d8e103
check if exists to configuration, save_logs to command line
2022-06-14 21:37:02 +02:00
msramalho
eca10023b0
detecting errors at a higher level to avoid false "in progress" messages
2022-06-14 19:28:34 +02:00
msramalho
067e6d8954
retry mechanism
2022-06-08 13:39:52 +02:00
msramalho
f87acb6d1d
refactor
2022-06-07 18:41:58 +02:00
msramalho
5135e97d3f
cleanup auto_archive and config
2022-06-03 18:03:49 +02:00
msramalho
aaa1d299da
started cleaning auto_archive
2022-06-03 17:32:55 +02:00
msramalho
10f03cb888
Merge branch 'dev' into refactor-configs
2022-06-02 17:30:47 +02:00
msramalho
159adf9afe
refactoring filenumber into subfolder
2022-05-26 19:18:29 +02:00
msramalho
ea261635a2
cleanup
2022-05-25 10:32:26 +02:00
Dave Mateer
dbac5accbd
Save to folders for S3 and GD. Google Drive (GD) storage
2022-05-11 15:39:44 +01:00
Dave Mateer
b3599dee71
working
2022-05-11 14:01:22 +01:00
msramalho
d7f44b948f
wayback fix
2022-05-10 23:15:58 +02:00
msramalho
bca960b228
merge from master and fixes
2022-05-10 23:09:33 +02:00
msramalho
f6e8da34b8
Merge remote-tracking branch 'origin/main' into refactor-configs
2022-05-10 22:37:09 +02:00
msramalho
39f27ec1bc
reenable telethon
2022-05-10 20:23:13 +02:00
msramalho
e0276dfab1
additional cleanup
2022-05-09 18:19:38 +02:00
Miguel Sozinho Ramalho
bba510b8c2
Merge pull request #30 from djhmateer/logging
2022-05-09 15:59:49 +01:00
Miguel Sozinho Ramalho
6e8eccefd8
self-documenting info message
2022-05-09 15:59:35 +01:00
Miguel Sozinho Ramalho
05a3adfc36
Merge pull request #31 from djhmateer/firefox
2022-05-09 15:59:02 +01:00
msramalho
0d65798308
wip: configurations and logic
2022-05-09 14:54:48 +02:00
Dave Mateer
bb599f702d
Reload firefox driver on every spreadsheet row
2022-05-09 12:16:18 +01:00
Dave Mateer
f52d8cdef8
add back in d_dotenv()
2022-05-09 12:02:43 +01:00
Dave Mateer
e3c0ae1d45
dotenv
2022-05-09 11:57:54 +01:00
Dave Mateer
e18a9779db
added log diretory and file creation
2022-05-09 11:55:10 +01:00
Dave Mateer
51f635ce50
get env variable FACEBOOK_COOKIE patch through from auto_archive
2022-05-09 11:44:25 +01:00
msramalho
f592c7fcfe
refactor to use config.py
2022-05-03 20:34:04 +02:00
msramalho
3bdeec1d2f
fix deprecation warning for selenium
2022-03-30 11:05:31 +02:00
Logan Williams
398f296789
Fix Selenium driver issues with telegram links
2022-03-18 11:10:27 +01:00
Logan Williams
538bb05395
Merge branch 'main' of github.com:bellingcat/auto-archiver into main
2022-03-18 09:53:29 +01:00
Logan Williams
050b04e31d
Add flag for storage privacy
2022-03-18 09:53:21 +01:00
msramalho
0035603bfb
telethon-poc
2022-03-15 18:45:53 +01:00
Logan Williams
0304860bce
Don't check status for empty URL rows
2022-03-14 11:10:51 +01:00
msramalho
f121c9dab7
enable tolower
2022-03-12 20:14:16 +01:00
msramalho
69483d432c
adds logs
2022-03-12 20:04:08 +01:00
msramalho
486c3295b5
log
2022-03-12 19:54:10 +01:00
msramalho
6c5d6f521e
implements fresh status retrieval if needed
2022-03-10 19:00:02 +01:00
msramalho
52333874c9
making column names configurable through the command line
2022-03-09 12:38:04 +01:00
msramalho
ff874fe0d3
simplifies access to google sheets, single get_values
2022-03-09 12:17:51 +01:00
msramalho
544e7578a6
removes duplicate code
2022-03-09 11:46:14 +01:00
Logan Williams
aa4b175dea
Fix issue with timestamps being convereted to user format
2022-02-28 12:54:58 +01:00
Logan Williams
c6b159905b
Switch to headless Firefox
2022-02-28 11:45:32 +01:00
Logan Williams
6ebce974f0
WIP: Make timezones more consistent in UTC
2022-02-28 08:42:59 +01:00
Logan Williams
63a2847ac9
Add header argument; set up webdriver
2022-02-25 16:09:35 +01:00