Commit Graph

81 Commits

Author SHA1 Message Date
msramalho
cd19181d8f minor improvements 2025-06-11 16:51:42 +01:00
msramalho
3cf51dd874 adds tracker remove feature and tests 2025-06-11 11:56:42 +01:00
msramalho
8314833ae8 removes exclude_media_extensions option 2025-06-10 18:34:33 +01:00
msramalho
287e823f43 improves twitter URL cleaning and introduces another bestquality check 2025-06-10 16:09:38 +01:00
msramalho
c815488daa adds new URLs to ignore 2025-06-10 15:44:52 +01:00
msramalho
07ff5baf07 adds Dropin flexible integration for antibot 2025-06-07 19:09:37 +01:00
msramalho
c7a84bc97a generalizes ydl info to filename method for reusing 2025-06-07 18:14:08 +01:00
msramalho
5f68c151a0 removes webdriver utils used by screenshot enricher 2025-06-04 14:17:19 +01:00
Patrick Robertson
aacb874b56 removeprefix for www. is required here 2025-03-21 12:23:45 +04:00
Patrick Robertson
14c56f4916 Provide better logs for screenshot enricher when auth is/isn't supported (cookies only) 2025-03-21 12:05:47 +04:00
Patrick Robertson
168dfb6254 Unit tests for url utils 2025-03-21 11:53:47 +04:00
Patrick Robertson
99e9ac2465 Fix 'Syntax Error' warning in python3.12+ 2025-03-17 09:29:51 +00:00
Patrick Robertson
19715c8ec2 Merge branch 'main' into webdriver-cookies 2025-03-14 12:44:48 +00:00
Patrick Robertson
f6b13327f0 Tweaks and additional debug logging 2025-03-13 17:41:41 +00:00
Patrick Robertson
0efeaaabb1 Revert to using time.sleep and .click() - since we only want to be waiting the first time (for the page to load) 2025-03-13 17:41:16 +00:00
Patrick Robertson
7a81ab617a Better checking of cookies to add to webdriver 2025-03-11 11:57:25 +00:00
erinhmclark
e7fa88f1c7 Implementing ruff suggestions. 2025-03-10 21:45:30 +00:00
erinhmclark
ca44a40b88 Ruff fix on src. 2025-03-10 19:03:45 +00:00
erinhmclark
85abe1837a Ruff format with defaults. 2025-03-10 18:44:54 +00:00
Patrick Robertson
e519ba2433 Add 'reject all' cookie button 2025-03-07 16:40:34 +00:00
Patrick Robertson
dba44b1ac1 Use WebDriverWait when waiting for elements in screenshot enricher 2025-03-07 12:07:54 +00:00
Patrick Robertson
0dfab2d1bc Add some code to attempt to click the cookies banners on various websites 2025-03-03 15:55:04 +00:00
Patrick Robertson
dea0a49600 Download correct gecko-driver for the platform + fix setting executable path when running in Docker
Fixes #232
2025-03-03 15:41:44 +00:00
erinhmclark
8124bb831d Merge branch 'main' into small_issues
# Conflicts:
#	src/auto_archiver/core/base_module.py
#	src/auto_archiver/utils/misc.py
2025-02-26 13:19:49 +00:00
erinhmclark
9157846930 Add docstrings to explain date formats. 2025-02-26 10:01:52 +00:00
erinhmclark
83a08dd215 Update date parsing to use dateutil.parser in misc.py 2025-02-25 20:17:31 +00:00
Patrick Robertson
eda359a1ef Fix json loader - it should go in 'validators' not 'utils'
Fixes #214
2025-02-20 13:10:39 +00:00
Patrick Robertson
7734a551fa Move 'assert_valid_url' out into utils, don't use assert but raise
assert is recommended only for debugging
2025-02-20 11:19:29 +00:00
erinhmclark
ce5a200d1f Added tests, updated instagram_tbot_extractor.py raise failure. 2025-02-18 12:59:10 +00:00
Patrick Robertson
1b976f4c09 Remove unused atlos util functions 2025-02-11 18:49:54 +00:00
Patrick Robertson
756f46012b Remove empty file 2025-02-11 18:47:54 +00:00
erinhmclark
8d894066f2 Merge branch 'load_modules' into add_module_tests
# Conflicts:
#	src/auto_archiver/modules/gsheet_feeder/gsheet_feeder.py
#	src/auto_archiver/utils/misc.py
2025-02-10 19:00:05 +00:00
msramalho
ab6cf52533 fixes bad hash initialization 2025-02-10 16:45:28 +00:00
erinhmclark
c4bb667cec Merge branch 'load_modules' into add_module_tests
# Conflicts:
#	src/auto_archiver/modules/s3_storage/s3_storage.py
#	src/auto_archiver/utils/gsheet.py
#	src/auto_archiver/utils/misc.py
2025-02-10 16:17:08 +00:00
erinhmclark
f311621e58 Small fixes.
Add timestamp helper method.
2025-02-10 15:57:42 +00:00
msramalho
15abf686b1 decouples s3_storage from hash_enricher 2025-02-10 15:48:54 +00:00
Patrick Robertson
63aba6ad39 Fix sphinx-autoapi imports 2025-02-07 21:54:49 +01:00
erinhmclark
266c7a14e6 Context related fixes, some more tests. 2025-02-06 16:53:00 +00:00
Patrick Robertson
0633e17998 Close the facebook 'login' window if it's there - to allow for proper screenshots 2025-02-04 14:18:46 +01:00
Patrick Robertson
72b5ea9ab6 Restore headless arg 2025-02-03 17:40:40 +01:00
Patrick Robertson
c574b694ed Set up screenshot enricher to use authentication/cookies 2025-02-03 17:25:59 +01:00
Patrick Robertson
b7d9145f6c Further tidyups + refactoring for new structure
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
2025-01-30 13:21:10 +01:00
Patrick Robertson
3d37c494aa Tidy ups + unit tests:
1. Allow loading modules from --module_paths=/extra/path/here
2. Improved unit tests for module loading
3. Further small tidy ups/clean ups
2025-01-29 18:42:49 +01:00
erinhmclark
e1a9373336 Refactoring for new config setup 2025-01-27 19:03:02 +00:00
erinhmclark
96b35a272c Rm gsheet references in utils 2025-01-24 18:51:15 +00:00
erinhmclark
dd402b456f Fix and add types to manifest 2025-01-24 18:50:11 +00:00
erinhmclark
1942e8b819 Gsheets utility revert 2025-01-24 13:34:30 +00:00
erinhmclark
024fe58377 fix config parsing in manifests, remove module level configs 2025-01-24 13:33:12 +00:00
erinhmclark
0453d95f56 fix config parsing in manifests 2025-01-24 13:24:54 +00:00
Patrick Robertson
6388983815 Merge branch 'main' into youtubedlp-rewrite 2025-01-21 16:43:14 +01:00