Patrick Robertson
aacb874b56
removeprefix for www. is required here
2025-03-21 12:23:45 +04:00
Patrick Robertson
14c56f4916
Provide better logs for screenshot enricher when auth is/isn't supported (cookies only)
2025-03-21 12:05:47 +04:00
Patrick Robertson
168dfb6254
Unit tests for url utils
2025-03-21 11:53:47 +04:00
Patrick Robertson
99e9ac2465
Fix 'Syntax Error' warning in python3.12+
2025-03-17 09:29:51 +00:00
Patrick Robertson
19715c8ec2
Merge branch 'main' into webdriver-cookies
2025-03-14 12:44:48 +00:00
Patrick Robertson
f6b13327f0
Tweaks and additional debug logging
2025-03-13 17:41:41 +00:00
Patrick Robertson
0efeaaabb1
Revert to using time.sleep and .click() - since we only want to be waiting the first time (for the page to load)
2025-03-13 17:41:16 +00:00
Patrick Robertson
7a81ab617a
Better checking of cookies to add to webdriver
2025-03-11 11:57:25 +00:00
erinhmclark
e7fa88f1c7
Implementing ruff suggestions.
2025-03-10 21:45:30 +00:00
erinhmclark
ca44a40b88
Ruff fix on src.
2025-03-10 19:03:45 +00:00
erinhmclark
85abe1837a
Ruff format with defaults.
2025-03-10 18:44:54 +00:00
Patrick Robertson
e519ba2433
Add 'reject all' cookie button
2025-03-07 16:40:34 +00:00
Patrick Robertson
dba44b1ac1
Use WebDriverWait when waiting for elements in screenshot enricher
2025-03-07 12:07:54 +00:00
Patrick Robertson
0dfab2d1bc
Add some code to attempt to click the cookies banners on various websites
2025-03-03 15:55:04 +00:00
Patrick Robertson
dea0a49600
Download correct gecko-driver for the platform + fix setting executable path when running in Docker
...
Fixes #232
2025-03-03 15:41:44 +00:00
erinhmclark
8124bb831d
Merge branch 'main' into small_issues
...
# Conflicts:
# src/auto_archiver/core/base_module.py
# src/auto_archiver/utils/misc.py
2025-02-26 13:19:49 +00:00
erinhmclark
9157846930
Add docstrings to explain date formats.
2025-02-26 10:01:52 +00:00
erinhmclark
83a08dd215
Update date parsing to use dateutil.parser in misc.py
2025-02-25 20:17:31 +00:00
Patrick Robertson
eda359a1ef
Fix json loader - it should go in 'validators' not 'utils'
...
Fixes #214
2025-02-20 13:10:39 +00:00
Patrick Robertson
7734a551fa
Move 'assert_valid_url' out into utils, don't use assert but raise
...
assert is recommended only for debugging
2025-02-20 11:19:29 +00:00
erinhmclark
ce5a200d1f
Added tests, updated instagram_tbot_extractor.py raise failure.
2025-02-18 12:59:10 +00:00
Patrick Robertson
1b976f4c09
Remove unused atlos util functions
2025-02-11 18:49:54 +00:00
Patrick Robertson
756f46012b
Remove empty file
2025-02-11 18:47:54 +00:00
erinhmclark
8d894066f2
Merge branch 'load_modules' into add_module_tests
...
# Conflicts:
# src/auto_archiver/modules/gsheet_feeder/gsheet_feeder.py
# src/auto_archiver/utils/misc.py
2025-02-10 19:00:05 +00:00
msramalho
ab6cf52533
fixes bad hash initialization
2025-02-10 16:45:28 +00:00
erinhmclark
c4bb667cec
Merge branch 'load_modules' into add_module_tests
...
# Conflicts:
# src/auto_archiver/modules/s3_storage/s3_storage.py
# src/auto_archiver/utils/gsheet.py
# src/auto_archiver/utils/misc.py
2025-02-10 16:17:08 +00:00
erinhmclark
f311621e58
Small fixes.
...
Add timestamp helper method.
2025-02-10 15:57:42 +00:00
msramalho
15abf686b1
decouples s3_storage from hash_enricher
2025-02-10 15:48:54 +00:00
Patrick Robertson
63aba6ad39
Fix sphinx-autoapi imports
2025-02-07 21:54:49 +01:00
erinhmclark
266c7a14e6
Context related fixes, some more tests.
2025-02-06 16:53:00 +00:00
Patrick Robertson
0633e17998
Close the facebook 'login' window if it's there - to allow for proper screenshots
2025-02-04 14:18:46 +01:00
Patrick Robertson
72b5ea9ab6
Restore headless arg
2025-02-03 17:40:40 +01:00
Patrick Robertson
c574b694ed
Set up screenshot enricher to use authentication/cookies
2025-02-03 17:25:59 +01:00
Patrick Robertson
b7d9145f6c
Further tidyups + refactoring for new structure
...
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
2025-01-30 13:21:10 +01:00
Patrick Robertson
3d37c494aa
Tidy ups + unit tests:
...
1. Allow loading modules from --module_paths=/extra/path/here
2. Improved unit tests for module loading
3. Further small tidy ups/clean ups
2025-01-29 18:42:49 +01:00
erinhmclark
e1a9373336
Refactoring for new config setup
2025-01-27 19:03:02 +00:00
erinhmclark
96b35a272c
Rm gsheet references in utils
2025-01-24 18:51:15 +00:00
erinhmclark
dd402b456f
Fix and add types to manifest
2025-01-24 18:50:11 +00:00
erinhmclark
1942e8b819
Gsheets utility revert
2025-01-24 13:34:30 +00:00
erinhmclark
024fe58377
fix config parsing in manifests, remove module level configs
2025-01-24 13:33:12 +00:00
erinhmclark
0453d95f56
fix config parsing in manifests
2025-01-24 13:24:54 +00:00
Patrick Robertson
6388983815
Merge branch 'main' into youtubedlp-rewrite
2025-01-21 16:43:14 +01:00
Patrick Robertson
4bb4ebdf82
Further cleanup, abstracts 'dropins' out into generic files
2025-01-21 16:36:45 +01:00
Patrick Robertson
fd2e7f973b
Further tidy-ups, also adds some ytdlp utils to 'utils'
2025-01-20 16:31:28 +01:00
erinhmclark
d3eec5d90f
Basic docs structure for RTD
2025-01-15 21:45:29 +00:00
Patrick Robertson
663c8ad93a
Add 'print_pdf' option to the screenshot enricher. Fixes #132
2024-12-20 07:14:03 +01:00
R. Miles McCain
f603400d0d
Add direct Atlos integration ( #137 )
...
* Add Atlos feeder
* Add Atlos db
* Add Atlos storage
* Fix Atlos storages
* Fix Atlos feeder
* Only include URLs in Atlos feeder once they're processed
* Remove print
* Add Atlos documentation to README
* Formatting fixes
* Don't archive existing material
* avoid KeyError in atlos_db
* version bump
---------
Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com >
2024-04-15 19:25:17 +01:00
Miguel Sozinho Ramalho
7a21ae96af
V0.9.0 - closes several open issues: new enrichers and bug fixes ( #133 )
...
* clean orchestrator code, add archiver cleanup logic
* improves documentation for database.py
* telethon archivers isolate sessions into copied files
* closes #127
* closes #125
* closes #84
* meta enricher applies to all media
* closes #61 adds subtitles and comments
* minor update
* minor fixes to yt-dlp subtitles and comments
* closes #17 but logic is imperfect.
* closes #85 ssl enhancer
* minimifies html, JS refactor for preview of certificates
* closes #91 adds freetsa timestamp authority
* version bump
* simplify download_url method
* skip ssl if nothing archived
* html preview improvements
* adds retrying lib
* manual download archiver improvements
* meta only runs when relevant data available
* new metadata convenience method
* html template improvements
* removes debug message
* does not close #91 yet, will need a few more certificate chaing logging
* adds verbosity config
* new instagram api archiver
* adds proxy support we
* adds proxy/end support and bug fix for yt-dlp
* proxy support for webdriver
* adds socks proxy to wacz_enricher
* refactor recursivity in inner media and display
* infinite recursive display
* foolproofing timestamping authortities
* version to 0.9.0
* minor fixes from code-review
2024-02-20 18:05:29 +00:00
Miguel Sozinho Ramalho
3e56ef137d
reduce s3 duplicating while keeping random urls via hash ( #112 )
2023-12-12 19:12:03 +00:00
Galen Reich
381940f5a8
Fix Selenium headless invokation ( #106 )
...
Co-authored-by: msramalho <19508417+msramalho@users.noreply.github.com >
2023-11-13 11:56:35 +01:00