bellingcat/vk-url-scraper

Fork 0

mirror of https://github.com/bellingcat/vk-url-scraper.git synced 2026-06-07 19:08:38 +03:00

Go to file

msramalho c4a1333428 cleanup

2022-06-20 13:44:05 +02:00

.github

cleanup

2022-06-20 13:44:05 +02:00

docs

cleanup

2022-06-20 13:44:05 +02:00

scripts

cleanup

2022-06-20 13:44:05 +02:00

tests

docs

2022-06-18 00:11:24 +02:00

vk_url_scraper

version

2022-06-18 00:16:41 +02:00

.flake8

2022-06-17 19:57:36 +02:00

.gitignore

ported vk scraper logic into lib

2022-06-17 19:15:20 +02:00

.readthedocs.yaml

docs

2022-06-18 00:11:24 +02:00

CONTRIBUTING.md

cleanup

2022-06-20 13:44:05 +02:00

dev-requirements.txt

Initial commit

2022-06-17 13:25:27 +01:00

LICENSE

ported vk scraper logic into lib

2022-06-17 19:15:20 +02:00

Makefile

docs

2022-06-18 00:11:24 +02:00

mypy.ini

Initial commit

2022-06-17 13:25:27 +01:00

Pipfile

ported vk scraper logic into lib

2022-06-17 19:15:20 +02:00

Pipfile.lock

reqs

2022-06-17 19:20:11 +02:00

pyproject.toml

Initial commit

2022-06-17 13:25:27 +01:00

pytest.ini

Initial commit

2022-06-17 13:25:27 +01:00

README.md

cleanup

2022-06-20 13:44:05 +02:00

requirements.txt

reqs

2022-06-17 19:23:31 +02:00

setup.py

ported vk scraper logic into lib

2022-06-17 19:15:20 +02:00

README.md

vk-url-scraper

Library to scrape data and especially media links (videos and photos) from vk.com URLs.

Quick usage API

pip install vk-url-scraper to install.

from vk_url_scraper import VkScraper

vks = VkScraper("username", "password")

# scrape any "photo" URL
res = vks.scrape("https://vk.com/photo1_278184324?rev=1")

# scrape any "wall" URL
res = vks.scrape("https://vk.com/wall-1_398461")

# scrape any "video" URL
res = vks.scrape("https://vk.com/video-6596301_145810025")
print(res[0]["text]) # eg: -> to get the text from code

# Every scrape* function returns a list of dict like
{
	"id": "wall_id",
	"text": "text in this post" ,
	"datetime": utc datetime of post,
	"attachments": {
		# if photo, video, link exists
		"photo": [list of urls with max quality],
		"video": [list of urls with max quality],
		"link": [list of urls with max quality],
	},
	"payload": "original JSON response converted to dict which you can parse for more data
}

see [docs] for all available functions.

TODO

docs online from sphinx

Development

(more info in CONTRIBUTING.md).

setup dev environment with pip install -r dev-requirements or pipenv install -r dev-requirements
setup environment with pip install -r requirements or pipenv install -r requirements
To run all checks to make run-checks (fixes style) or individually
1. To fix style: black . and isort . -> flake8 . to validate lint
2. To do type checking: mypy .
3. To test: pytest . (pytest -v --color=yes --doctest-modules tests/ vk_url_scraper/ to user verbose, colors, and test docstring examples)
make docs to generate shpynx docs -> edit config.py if needed

Releasing new version

edit version.py with proper versioning
run ./scripts/release.sh to create a tag and push, alternatively
1. git tag vx.y.z to tag version
2. git push origin vx.y.z -> this will trigger workflow and put project on pypi

Fixing a failed release

If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with

git tag -l | xargs git tag -d && git fetch -t

Then repeat the steps above.