Compare commits

...

22 Commits

Author SHA1 Message Date
msramalho
283bc35658 Bump version to v0.3.11 for release 2023-02-23 16:52:59 +01:00
msramalho
cef70fb80d update yt-dlp 2023-02-23 16:52:52 +01:00
msramalho
e66ef4f477 fix tests 2023-02-23 16:52:45 +01:00
msramalho
1f6a8368fd updates 2022-11-03 17:07:34 +00:00
msramalho
9a046fd1cb Bump version to v0.3.9 for release 2022-11-03 16:35:59 +00:00
msramalho
aae2bb5999 Bump version to v0.3.8 for release 2022-11-03 16:19:30 +00:00
msramalho
9e30b81d16 Bump version to v0.3.7 for release 2022-11-03 16:05:18 +00:00
msramalho
72bc355606 updates readme 2022-11-03 16:03:12 +00:00
msramalho
7f59eefb73 Merge branch 'main' of https://github.com/bellingcat/vk-url-scraper 2022-11-03 16:02:50 +00:00
msramalho
30003c524e Bump version to v0.3.6 for release 2022-11-03 16:01:15 +00:00
msramalho
d1b27bef1d adds session_file name customization 2022-11-03 16:00:58 +00:00
Miguel Sozinho Ramalho
e5e9e08ee6 Update README.md 2022-09-30 16:27:07 +01:00
Miguel Sozinho Ramalho
3a8a3f54c0 Merge pull request #12 from bellingcat/dependabot/pip/yt-dlp-2022.7.18
Bump yt-dlp from 2022.5.18 to 2022.7.18
2022-07-20 12:54:08 +02:00
dependabot[bot]
4d73864dbb Bump yt-dlp from 2022.5.18 to 2022.7.18
Bumps [yt-dlp](https://github.com/yt-dlp/yt-dlp) from 2022.5.18 to 2022.7.18.
- [Release notes](https://github.com/yt-dlp/yt-dlp/releases)
- [Changelog](https://github.com/yt-dlp/yt-dlp/blob/master/Changelog.md)
- [Commits](https://github.com/yt-dlp/yt-dlp/compare/2022.05.18...2022.07.18)

---
updated-dependencies:
- dependency-name: yt-dlp
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-07-19 17:44:01 +00:00
Miguel Sozinho Ramalho
ceaa8e45f3 Merge pull request #10 from bellingcat/dependabot/pip/yt-dlp-2022.7.18 2022-07-19 10:13:13 +02:00
dependabot[bot]
007c8e07a8 Bump yt-dlp from 2022.5.18 to 2022.7.18
Bumps [yt-dlp](https://github.com/yt-dlp/yt-dlp) from 2022.5.18 to 2022.7.18.
- [Release notes](https://github.com/yt-dlp/yt-dlp/releases)
- [Changelog](https://github.com/yt-dlp/yt-dlp/blob/master/Changelog.md)
- [Commits](https://github.com/yt-dlp/yt-dlp/compare/2022.05.18...2022.07.18)

---
updated-dependencies:
- dependency-name: yt-dlp
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-07-18 17:58:45 +00:00
Miguel Sozinho Ramalho
a515b2c3de Update README.md 2022-06-27 16:12:55 +01:00
Miguel Sozinho Ramalho
54540cd132 Update README.md 2022-06-27 16:07:09 +01:00
Miguel Sozinho Ramalho
cfb13e5d82 Merge pull request #2 from bellingcat/dependabot/pip/yt-dlp-2022.6.22.1
Bump yt-dlp from 2022.5.18 to 2022.6.22.1
2022-06-24 13:11:52 +01:00
Miguel Sozinho Ramalho
926c3cb8a4 Merge pull request #3 from bellingcat/dependabot/pip/furo-2022.6.21
Bump furo from 2022.6.4.1 to 2022.6.21
2022-06-24 11:38:27 +01:00
dependabot[bot]
15ebe2e66c Bump furo from 2022.6.4.1 to 2022.6.21
Bumps [furo](https://github.com/pradyunsg/furo) from 2022.6.4.1 to 2022.6.21.
- [Release notes](https://github.com/pradyunsg/furo/releases)
- [Changelog](https://github.com/pradyunsg/furo/blob/main/docs/changelog.md)
- [Commits](https://github.com/pradyunsg/furo/compare/2022.06.04.1...2022.06.21)

---
updated-dependencies:
- dependency-name: furo
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-06-22 22:44:17 +00:00
dependabot[bot]
eaff88b2d9 Bump yt-dlp from 2022.5.18 to 2022.6.22.1
Bumps [yt-dlp](https://github.com/yt-dlp/yt-dlp) from 2022.5.18 to 2022.6.22.1.
- [Release notes](https://github.com/yt-dlp/yt-dlp/releases)
- [Changelog](https://github.com/yt-dlp/yt-dlp/blob/master/Changelog.md)
- [Commits](https://github.com/yt-dlp/yt-dlp/compare/2022.05.18...2022.06.22.1)

---
updated-dependencies:
- dependency-name: yt-dlp
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-06-22 22:42:51 +00:00
9 changed files with 777 additions and 499 deletions

1
.gitignore vendored
View File

@@ -1,6 +1,7 @@
.env
vk_config.v2.json
output/
tmp*/
# build artifacts
.eggs/

View File

@@ -18,7 +18,7 @@ pytest-sphinx = "*"
pytest-cov = "*"
twine = ">=1.11.0"
sphinx = ">=4.3.0,<5.1.0"
furo = "==2022.6.4.1"
furo = "==2022.6.21"
myst-parser = ">=0.15.2,<0.19.0"
sphinx-autobuild = "==2021.3.14"
sphinx-autodoc-typehints = "*"
@@ -26,3 +26,6 @@ python-dotenv = "*"
[requires]
python_version = "3.9"
[pipenv]
allow_prereleases = true

1211
Pipfile.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,7 +1,13 @@
# vk-url-scraper
Library to scrape data and especially media links (videos and photos) from vk.com URLs.
Python library to scrape data, and especially media links like videos and photos, from vk.com URLs.
You can use it via the [command line](#command-line-usage) or as a [python library](#python-library-usage).
[![PyPI version](https://badge.fury.io/py/vk-url-scraper.svg)](https://badge.fury.io/py/vk-url-scraper)
[![PyPI download month](https://img.shields.io/pypi/dm/vk-url-scraper.svg)](https://pypi.python.org/pypi/vk-url-scraper/)
[![Documentation Status](https://readthedocs.org/projects/vk-url-scraper/badge/?version=latest)](https://vk-url-scraper.readthedocs.io/en/latest/?badge=latest)
You can use it via the [command line](#command-line-usage) or as a [python library](#python-library-usage), check the **[documentation](https://vk-url-scraper.readthedocs.io/en/latest/)**.
## Installation
You can install the most recent release from [pypi](https://pypi.org/project/vk-url-scraper/) via `pip install vk-url-scraper`.
@@ -21,7 +27,8 @@ vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall1
vk_url_scraper -u "username here" -p "password here" --urls https://vk.com/wall12345_6789 https://vk.com/photo-12345_6789 https://vk.com/video12345_6789
# you can pass a token as well to avoid always authenticating
# and possibly getting captch prompts
# and possibly getting captcha prompts
# you can fetch the token from the bk_config.v2.json file generated under by searching for "access_token"
vk_url_scraper -u "username" -p "password" -t "vktoken goes here" --urls https://vk.com/wall12345_6789
# save the JSON output into a file
@@ -48,7 +55,7 @@ res = vks.scrape("https://vk.com/wall-1_398461")
# scrape any "video" URL
res = vks.scrape("https://vk.com/video-6596301_145810025")
print(res[0]["text]) # eg: -> to get the text from code
print(res[0]["text"]) # eg: -> to get the text from code
```
```python
@@ -93,6 +100,7 @@ To test the command line interface available in [__main__.py](__vk_url_scraper/_
2. run `./scripts/release.sh` to create a tag and push, alternatively
1. `git tag vx.y.z` to tag version
2. `git push origin vx.y.z` -> this will trigger workflow and put project on [pypi](https://pypi.org/project/vk-url-scraper/)
3. go to https://readthedocs.org/ to deploy new docs version (if webhook is not setup)
### Fixing a failed release
@@ -102,4 +110,4 @@ If for some reason the GitHub Actions release workflow failed with an error that
git tag -l | xargs git tag -d && git fetch -t
```
Then repeat the steps above.
Then repeat the steps above.

View File

@@ -12,4 +12,4 @@ requests==2.28.0
urllib3==1.26.9
vk-api==11.9.8
python-dotenv==0.20.0
yt-dlp==2022.5.18
yt-dlp==2022.7.18

View File

@@ -44,7 +44,10 @@ setup(
"Programming Language :: Python :: 3",
],
keywords=["scraper", "vk", "vkontakte", "vk-api", "media-downloader"],
url="https://github.com/bellingcat/vk-url-scraper",
project_urls={
"Code": "https://github.com/bellingcat/vk-url-scraper",
"Documentation": "https://vk-url-scraper.readthedocs.io/en/latest/",
},
author="Bellingcat",
author_email="tech@bellingcat.com",
license="MIT",

View File

@@ -14,6 +14,17 @@ def test_login_fail():
VkScraper("invalid", "combination")
def test_login_custom_file():
session_filename = "test-session.json"
VkScraper(
os.environ["VK_USERNAME"],
os.environ["VK_PASSWORD"],
session_file=session_filename,
)
assert os.path.isfile(session_filename)
os.unlink(session_filename)
def test_login_success():
global vks
vks = VkScraper(
@@ -69,7 +80,7 @@ def test_scrape_wall_url_with_photos():
== "Хабаровск\nАллея героев\nПомолимся об укокоении воинов:\nАлександра, Игоря, Эдуарда, \nДионисия, Евгения, Александра, Артемия, Иннокентия, Андрея."
)
assert str(res[0]["datetime"]) == str(datetime.datetime(2022, 6, 15, 10, 37, 24))
assert len(res[0]["payload"]) == 16
assert len(res[0]["payload"]) == 17
assert len(res[0]["attachments"].keys()) == 1
assert list(res[0]["attachments"].keys()) == ["photo"]
assert len(res[0]["attachments"]["photo"]) == 9
@@ -81,7 +92,7 @@ def test_scrape_wall_url_with_photos_inner_videos_and_links_with_inner_photos():
assert res[0]["id"] == "wall-17315087_74182"
assert res[0]["text"] == ""
assert str(res[0]["datetime"]) == str(datetime.datetime(2022, 3, 24, 11, 1, 9))
assert len(res[0]["payload"]) == 15
assert len(res[0]["payload"]) == 17
assert len(res[0]["attachments"].keys()) == 3
for k in ["photo", "link", "video"]:
assert k in list(res[0]["attachments"].keys())

View File

@@ -40,7 +40,12 @@ class VkScraper:
VIDEO_PATTERN = re.compile(r"(video.{0,1}\d+_\d+)")
def __init__(
self, username: str, password: str, token: str = None, captcha_handler=captcha_handler
self,
username: str,
password: str,
token: str = None,
session_file="vk_config.v2.json",
captcha_handler=captcha_handler,
) -> None:
"""Initializes the scraper.
@@ -55,9 +60,17 @@ class VkScraper:
Matching password on vk.com
token : str
Access token received after authenticating, can be found in the vl_config.v2.json file
session_file : str
File name where the VK session is saved so future logins are easier, this will not be created if token is passed
captcha_handler : func
Function that can receive a vk_api captcha instance and help the user solve it, default is a complete CLI handler
"""
self.session = vk_api.VkApi(
username, password, token=token, captcha_handler=captcha_handler
username,
password,
token=token,
config_filename=session_file,
captcha_handler=captcha_handler,
)
if token is None or len(token) == 0:
self.session.auth(token_only=True)

View File

@@ -2,7 +2,7 @@ _MAJOR = "0"
_MINOR = "3"
# On main and in a nightly release the patch should be one ahead of the last
# released build.
_PATCH = "5"
_PATCH = "11"
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
# https://semver.org/#is-v123-a-semantic-version for the semantics.
_SUFFIX = ""