Commit Graph

311 Commits

Author SHA1 Message Date
msramalho
0cb593fd21 wayback enricher ready 2023-01-11 00:03:47 +00:00
msramalho
d4825196f1 html template working with jinja templates 2023-01-10 00:22:16 +00:00
msramalho
aac16fa8c2 minor comments 2023-01-09 22:24:44 +00:00
msramalho
1cdc006b27 s3 storaging + WIP gsheets DB 2023-01-04 18:02:44 +00:00
msramalho
bb512b36c9 gsheet feeder + db WIP 2023-01-04 16:37:36 +00:00
msramalho
96845305a3 media concept implemented 2022-12-14 19:01:20 +00:00
msramalho
9c056d001c merge logic started 2022-12-14 16:11:06 +00:00
msramalho
53ffa2d4ae telethon_archiver working for multiple media 2022-12-14 15:37:34 +00:00
msramalho
b3860cfec1 telethon join channels working 2022-12-14 14:01:39 +00:00
msramalho
955891a411 WIP feeder 2022-12-10 12:03:46 +00:00
msramalho
9dc709d3b9 demo feeder logic working 2022-11-24 15:44:25 +00:00
msramalho
618e7ed0a3 subproperties in config 2022-11-24 11:53:21 +00:00
msramalho
65dd155c90 WIP refactor logic 2022-11-15 15:00:52 +00:00
msramalho
6a0ce5ced1 orchestrator design structure 2022-11-11 02:08:48 +00:00
msramalho
04263094ad WIP docker changes for cli and auto_archiver 2022-11-10 17:46:40 +00:00
msramalho
390b84eb22 dockerization complete 2022-11-08 15:55:33 +00:00
msramalho
81eadd4672 disable browsertrix on docker, see #66 2022-11-08 14:22:13 +00:00
msramalho
a8f7055696 reduces uncontrolled exceptions 2022-11-08 13:59:59 +00:00
msramalho
09f47383a3 dockerfile improvements 2022-11-08 13:59:35 +00:00
msramalho
629cd586db adds session_file for missing archivers 2022-11-08 13:59:09 +00:00
msramalho
889eb1d270 Merge branch 'dev' into dockerize 2022-11-02 17:01:00 +00:00
msramalho
50e03ba565 closes #65 with simpler solution 2022-11-02 16:59:44 +00:00
msramalho
a9df992f66 WiP 2022-11-02 16:51:32 +00:00
msramalho
c8fa077df7 docker initial files 2022-10-31 17:10:55 +00:00
msramalho
29e1872e87 fix: rm stopped containers only 2022-10-31 10:41:27 +00:00
msramalho
7a700acd8e hotfix for #65 2022-10-31 10:35:01 +00:00
msramalho
22363cb8b9 adds information on browsertrix usage 2022-10-20 11:59:23 +01:00
msramalho
ac4f1b6132 readme updates 2022-10-19 11:37:04 +01:00
msramalho
4d2b7b4040 reverse order of login attempts 2022-10-19 11:27:17 +01:00
msramalho
54c572258c fix tty 2022-10-18 17:46:40 +01:00
msramalho
6c80a5b82d session file logic 2022-10-18 17:35:59 +01:00
msramalho
63f53358d3 adds traceback 2022-10-18 16:38:12 +01:00
msramalho
3f121d800e catch bad instagram login 2022-10-18 16:36:27 +01:00
msramalho
93be1af93f adds instagram post/profile 2022-10-18 15:45:10 +01:00
msramalho
f0f844a569 improves browsertrix configurations 2022-10-18 11:21:10 +01:00
msramalho
df502f3bde updates yt-dlp 2022-10-18 11:20:53 +01:00
msramalho
26903190fd adds wacz link 2022-10-17 14:41:34 +01:00
Miguel Sozinho Ramalho
683f2d7500 Merge pull request #64 from bellingcat/dev 2022-10-17 14:40:15 +01:00
Miguel Sozinho Ramalho
23a4dc20c5 Merge pull request #63 from edsu/browsertrix-crawler 2022-10-17 14:39:34 +01:00
msramalho
57464f1506 refactors for edges in browsertrix and s3 upload, adds timeout parameter 2022-10-17 14:07:31 +01:00
msramalho
dc0ca8bdd6 adds browsertrix to all archivers flows 2022-10-17 14:06:50 +01:00
Ed Summers
20ca50dc90 Clean up browsertrix-crawler files
Remove any local browsertrix-crawler files after the WACZ has been
copied to storage. Note, until this issue has a release on DockerHub the
local files won't be able to be deleted since Docker on Linux creates
the files as root:

https://github.com/webrecorder/browsertrix-crawler/issues/170

The code will catch this exception and log a warning instead of failing
and losing the work that has been completed.
2022-10-11 16:49:19 -04:00
Ed Summers
c34fb9cf10 Add browsertrix profile config option
This commit adds a browsertrix profile option to the configuration. In
order to not require the passing of the browsertrix config to every
Archiver, the Archiver constructors (include the base) were modified to
accept a Storage and Config instance. Some of the constructors them pick
out the pieces they need from the Config, in addition to calling the
parent constructor. In order to avoid a circular import that this
created the Config object now defines the default hash function to use,
rather than having it be a static property of the Archiver class.
2022-10-11 16:21:42 -04:00
Miguel Sozinho Ramalho
82fcf74450 Merge pull request #62 from bellingcat/main 2022-10-06 08:24:51 +01:00
Ed Summers
3b87dffe6b Add browsertrix-crawler capture
The [browsertrix-crawler] utility is a browser-based crawler that can
crawl one or more pages. browsertrix-crawler creates archives in the
[WACZ] format which is essentially a standardized ZIP file (similar to DOCX, EPUB, JAR, etc) which can then be replayed using the [ReplayWeb.page] web
component, or unzipped to get the original WARC data (the ISO standard
format used by the Internet Archive Wayback Machine).

This PR adds browsertrix-crawler to archiver classes where screenshots are made made. The WACZ is uploaded to storage and then added to a new column in the spreadsheet. A column can be added that will display the WACZ, loaded from cloud storage (S3, digitalocean, etc) using the client side ReplayWeb page. You can see an example of the spreadsheet here:

https://docs.google.com/spreadsheets/d/1Tk-iJWzT9Sx2-YccuPttL9HcMdZEnhv_OR7Bc6tfeu8/edit#gid=0

browsertrix-crawler requires Docker to be installed. If Docker is not
installed an error message will be logged and things continue as normal.

[browsertrix-crawler]: https://github.com/webrecorder/browsertrix-crawler
[WACZ]: https://specs.webrecorder.net/wacz/latest/
[ReplayWeb.page]: https://replayweb.page
2022-09-25 19:46:29 +00:00
Miguel Sozinho Ramalho
0bdd06f641 Update README.md 2022-09-22 15:58:41 +02:00
Miguel Sozinho Ramalho
0bd9e043ed Merge pull request #58 from bellingcat/dev 2022-09-21 18:53:13 +02:00
msramalho
c77b4a080a update comment 2022-09-21 18:52:23 +02:00
Miguel Sozinho Ramalho
e813249520 Merge pull request #56 from djhmateer/oauth 2022-07-25 15:03:49 +01:00
msramalho
992dee022a format 2022-07-25 14:59:04 +01:00