Commit Graph

20 Commits

Author SHA1 Message Date
msramalho
cd19181d8f minor improvements 2025-06-11 16:51:42 +01:00
msramalho
3cf51dd874 adds tracker remove feature and tests 2025-06-11 11:56:42 +01:00
msramalho
8314833ae8 removes exclude_media_extensions option 2025-06-10 18:34:33 +01:00
msramalho
287e823f43 improves twitter URL cleaning and introduces another bestquality check 2025-06-10 16:09:38 +01:00
msramalho
c815488daa adds new URLs to ignore 2025-06-10 15:44:52 +01:00
Patrick Robertson
168dfb6254 Unit tests for url utils 2025-03-21 11:53:47 +04:00
erinhmclark
85abe1837a Ruff format with defaults. 2025-03-10 18:44:54 +00:00
Patrick Robertson
7734a551fa Move 'assert_valid_url' out into utils, don't use assert but raise
assert is recommended only for debugging
2025-02-20 11:19:29 +00:00
Patrick Robertson
c574b694ed Set up screenshot enricher to use authentication/cookies 2025-02-03 17:25:59 +01:00
Patrick Robertson
b7d9145f6c Further tidyups + refactoring for new structure
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
2025-01-30 13:21:10 +01:00
Galen Reich
381940f5a8 Fix Selenium headless invokation (#106)
Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com>
2023-11-13 11:56:35 +01:00
msramalho
ceb717ea65 exclude vk emojis 2023-08-17 18:11:26 +01:00
msramalho
6e4fb76940 exclude ok resource images from wacz enricher 2023-08-09 11:26:46 +01:00
msramalho
60a1f3a27a minor fixes 2023-07-31 16:08:48 +01:00
msramalho
fb197f1064 excluding telegram embeds 2023-07-28 12:57:15 +01:00
msramalho
aa71c85a98 improving ignored content from waczs 2023-07-28 12:19:14 +01:00
msramalho
59551b3b20 minor improvements: finding best twitter image quality 2023-07-27 21:36:15 +01:00
msramalho
f086d89111 new escape message 2023-07-27 20:14:59 +01:00
msramalho
dd034da844 feat: WACZ enricher can now be probed for media, and used as an archiver OR enricher 2023-07-27 15:42:10 +01:00
msramalho
5505255ea3 url auth wall detect 2023-02-17 15:45:58 +00:00