Commit Graph

128 Commits

Author SHA1 Message Date
Patrick Robertson
49b6c32058 Fix the 'full' mode which creates a complete config file 2025-02-20 11:34:05 +00:00
Patrick Robertson
4b51ec9ad5 Remove dangling import 2025-02-20 11:20:16 +00:00
Patrick Robertson
7734a551fa Move 'assert_valid_url' out into utils, don't use assert but raise
assert is recommended only for debugging
2025-02-20 11:19:29 +00:00
Patrick Robertson
77b2b099c6 Replace exit() with raise exceptions. Better for code implementations
exit() is reserved solely for command line-called areas now
also assert is only recommended for debugging
2025-02-20 11:19:13 +00:00
Patrick Robertson
7dde8d609d Merge main 2025-02-20 10:29:57 +00:00
Patrick Robertson
a9802dd004 Remove the global _LAZY_LOADED_MODULES and allow each instance of ArchivingOrchestrator to load its own modules 2025-02-19 12:25:35 +00:00
Patrick Robertson
222a94563f WIP: Docs tidyups+add howto on logging and authentication
(Authentication is WIP)
2025-02-19 10:37:04 +00:00
Patrick Robertson
eb60b271b9 Fix issue #200 2025-02-19 10:35:14 +00:00
Patrick Robertson
3c543a3a6a Various fixes for issues with new architecture (#208)
* Add formatters to the TOC - fixes #204

* Add 'steps' settings to the example YAML in the docs. Fixes #206

* Improved docs on authentication architecture

* Fix setting modules on the command line - they now override any module settings in the orchestration as opposed to appending

* Fix tests for gsheet-feeder: add a test service_account.json (note: not real keys in there)

* Rename the command line entrypoint to _command_line_run

Also: make it clear that code implementation should not call this
Make sure the command line entry returns (we don't want a generator)

* Fix unit tests to use now code-entry points

* Version bump

* Move iterating of generator up to __main__

* Breakpoint

* two minor fixes

* Fix unit tests + add new '__main__' entry point implementation test

* Skip youtube tests if running on CI. Should still run them locally

* Fix full implementation run on GH actions

* Fix skipif test for GH Actions CI

* Add skipifs for truth - it blocks GH:

---------

Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2025-02-18 19:10:09 +00:00
Patrick Robertson
6d43bc7d4d Fix generator programmatic setup (#197)
* Fix returning a generator of a generator

* Move download test test to pytest.mark.download
2025-02-15 17:36:44 +00:00
Miguel Sozinho Ramalho
9297697ef5 makes orchestrator.run return the results to allow for code integration (#196) 2025-02-15 12:41:26 +00:00
Patrick Robertson
460a71649c Merge pull request #190 from bellingcat/docs_update
Docs improvement
2025-02-12 12:38:04 +01:00
Patrick Robertson
a0c4a82825 Improved docstrings for base modules 2025-02-12 11:32:13 +00:00
msramalho
e507fc81d2 improves mimetype guessing, previously file.sub.something would not have an extension 2025-02-11 15:02:49 +00:00
Patrick Robertson
29901da601 Merge branch 'load_modules' into docs_update 2025-02-11 14:10:56 +00:00
Patrick Robertson
2f51d3917a Further addition to docs: creating modules, configurations, installation 2025-02-11 13:49:30 +00:00
erinhmclark
c8cd7ea63c Merge branch 'load_modules' into add_module_tests
# Conflicts:
#	src/auto_archiver/modules/telethon_extractor/telethon_extractor.py
2025-02-11 13:08:08 +00:00
msramalho
e6594ad3dc merge result into cached results for context preservation 2025-02-11 12:52:42 +00:00
msramalho
7309cd32e7 fix: context to be updated on Metadata.merge 2025-02-11 12:51:17 +00:00
Patrick Robertson
ed81dcdaf0 Remove dangling 'b = ' from config.py 2025-02-10 23:07:03 +00:00
erinhmclark
8d894066f2 Merge branch 'load_modules' into add_module_tests
# Conflicts:
#	src/auto_archiver/modules/gsheet_feeder/gsheet_feeder.py
#	src/auto_archiver/utils/misc.py
2025-02-10 19:00:05 +00:00
erinhmclark
3dae2337a1 remove cdn_url check before storage. 2025-02-10 18:56:46 +00:00
erinhmclark
e97ccf8a73 Separate setup() and module_setup(). 2025-02-10 18:07:47 +00:00
erinhmclark
2c3d1f591f Separate setup() and module_setup(). 2025-02-10 17:25:15 +00:00
msramalho
12f14cccc9 fixes gsheet feeder<->db connection via context. 2025-02-10 16:58:35 +00:00
erinhmclark
c4bb667cec Merge branch 'load_modules' into add_module_tests
# Conflicts:
#	src/auto_archiver/modules/s3_storage/s3_storage.py
#	src/auto_archiver/utils/gsheet.py
#	src/auto_archiver/utils/misc.py
2025-02-10 16:17:08 +00:00
msramalho
15abf686b1 decouples s3_storage from hash_enricher 2025-02-10 15:48:54 +00:00
msramalho
7c848046e8 adds better info about wrong/missing modules 2025-02-10 14:59:32 +00:00
Patrick Robertson
74207d7821 Implementation tests for auto-archiver 2025-02-10 13:27:11 +01:00
Patrick Robertson
e9dd321dcd Fix setting cli_feeder as default feeder on clean install 2025-02-10 13:06:24 +01:00
Patrick Robertson
1fad37fd93 Remove blank file 2025-02-07 23:08:30 +01:00
Patrick Robertson
63aba6ad39 Fix sphinx-autoapi imports 2025-02-07 21:54:49 +01:00
erinhmclark
e9ad1e1b85 Pass media to storage cdn_call 2025-02-06 22:01:55 +00:00
Patrick Robertson
a506f2a88f Clarify that an extractor's method can also return False if no valid data was found 2025-02-06 10:20:05 +01:00
Patrick Robertson
6ab8fd2ee4 Tidy up setting modules as Orchestrator attributes on startup.
Don't override the values in config['steps'] – the config should be left as is
2025-02-06 10:20:05 +01:00
Patrick Robertson
b301f60ea3 Fix using validators set in __manifest__.py
E.g. you can use the validator 'is_file' to check if a config is a valid file
2025-02-04 13:37:26 +01:00
Patrick Robertson
c574b694ed Set up screenshot enricher to use authentication/cookies 2025-02-03 17:25:59 +01:00
Patrick Robertson
7a2be5a0da Add cookie extraction to 'authentication' options, get generic_extractor working using this info 2025-02-03 16:03:07 +01:00
Patrick Robertson
9a8c94b641 Fix getting/setting folder context for metadata 2025-02-03 16:02:17 +01:00
Patrick Robertson
c25d5cae84 Remove ArchivingContext completely
Context for a specific url/item is now passed around via the metadata (metadata.set_context('key', 'val') and metadata.get_context('key', default='something')
The only other thing that was passed around in ArchivingContext was the storage info, which is already accessible now via self.config
2025-01-30 17:50:54 +01:00
Patrick Robertson
d6b4b7a932 Further cleanup
* Removes (partly) the ArchivingOrchestrator
* Removes the cli_feeder module, and makes it the 'default', allowing you to pass URLs directly on the command line, without having to use the cumbersome --cli_feeder.urls. Just do auto-archiver https://my.url.com
* More unit tests
* Improved error handling
2025-01-30 16:44:40 +01:00
Patrick Robertson
953011f368 Don't make modules 'dataclasses' 2025-01-30 16:44:40 +01:00
Patrick Robertson
fade68c6f4 Fix up unit tests - dataclass + subclasses not having @dataclass was breaking it 2025-01-30 13:45:24 +01:00
Patrick Robertson
b7d9145f6c Further tidyups + refactoring for new structure
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
2025-01-30 13:21:10 +01:00
erinhmclark
cddae65a90 Update modules for new core structure. 2025-01-30 08:42:23 +00:00
Patrick Robertson
00a7018f36 Fix up dependency checking (use 'dependencies' instead of 'external_dependencies' -> simpler/easier to remember 2025-01-29 19:25:22 +01:00
Patrick Robertson
3d37c494aa Tidy ups + unit tests:
1. Allow loading modules from --module_paths=/extra/path/here
2. Improved unit tests for module loading
3. Further small tidy ups/clean ups
2025-01-29 18:42:49 +01:00
Patrick Robertson
7a4871db6b Fix up unit tests for new structure 2025-01-28 14:40:12 +01:00
Patrick Robertson
9635449ac0 more user friendly error logging when config issues are found 2025-01-28 11:44:52 +01:00
Patrick Robertson
27b25c5bd4 Validate orchestration.yaml file inputs - so if a user enters invalid values, it also validates them 2025-01-28 11:37:23 +01:00