17 Commits
main ... v0.1.3

Author SHA1 Message Date
msramalho
e048a63bb8 missing comma 2022-07-01 14:47:25 +02:00
msramalho
c6e3671a16 fixing kepler json and version 2022-07-01 14:45:51 +02:00
Miguel Ramalho
f287cb8d02 Bump version to v0.1.0 for release 2022-07-01 14:28:32 +02:00
msramalho
6f83246478 renaming to geoclustering due to pypi 2022-07-01 14:24:22 +02:00
msramalho
b02139c50f test:release 2022-07-01 14:13:03 +02:00
msramalho
0c789c3335 cleanup 2022-07-01 13:32:41 +02:00
msramalho
55cdec2fc8 delete lint 2022-07-01 13:23:32 +02:00
msramalho
aa228bcde2 simplify 2022-07-01 13:22:39 +02:00
msramalho
fa4983aea6 main.yml fix 2022-07-01 13:13:22 +02:00
msramalho
2596b3d87c trigger ga 2022-07-01 13:12:40 +02:00
Miguel Sozinho Ramalho
c91b0cd94d Merge branch 'main' into feat-pypi-workflow 2022-07-01 12:12:07 +01:00
msramalho
e6f56d6c62 updates 2022-07-01 13:10:42 +02:00
msramalho
4c46ff44a8 vresion fix 2022-07-01 13:04:57 +02:00
msramalho
2e63491f72 steps 2022-07-01 13:03:12 +02:00
msramalho
03e132ff03 build dist 2022-07-01 13:02:08 +02:00
msramalho
3b47f2343d on push 2022-07-01 12:48:19 +02:00
msramalho
6eb9007ece testing workflow without 2022-07-01 12:46:49 +02:00
27 changed files with 778 additions and 2315 deletions

View File

@@ -30,7 +30,8 @@ runs:
id: virtualenv-cache
with:
path: .venv
key: ${{ inputs.cache-prefix }}-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('Pipfile.lock') }}
key: ${{ inputs.cache-prefix }}-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('requirements.txt', 'dev-requirements.txt') }}
- if: steps.virtualenv-cache.outputs.cache-hit != 'true'
shell: bash
run: |

View File

@@ -36,19 +36,13 @@ jobs:
run: |
python setup.py check
python setup.py bdist_wheel sdist
- python: "3.10"
task:
name: "Lint"
name: "Style"
run: |
black --check .
- python: "3.10"
task:
name: "Test"
run: pytest --exitfirst --failed-first --assert=plain
- python: "3.8"
task:
name: "Test (3.8)"
run: pytest --exitfirst --failed-first --assert=plain
steps:
- uses: actions/checkout@v3
@@ -65,7 +59,7 @@ jobs:
- name: Upload package distribution files
if: matrix.task.name == 'Build'
uses: actions/upload-artifact@v4
uses: actions/upload-artifact@v3
with:
name: package
path: dist
@@ -99,7 +93,7 @@ jobs:
echo "TAG=${GITHUB_REF#refs/tags/}" >> $GITHUB_ENV
- name: Download package distribution files
uses: actions/download-artifact@v4
uses: actions/download-artifact@v3
with:
name: package
path: dist

View File

@@ -1,10 +0,0 @@
repos:
- repo: https://github.com/psf/black
rev: 22.3.0
hooks:
- id: black
# It is recommended to specify the latest version of Python
# supported by your project here, or alternatively use
# pre-commit's default_language_version, see
# https://pre-commit.com/#top_level-default_language_version
language_version: python3.9

View File

@@ -1,6 +1,6 @@
MIT License
Copyright (c) 2022, Stichting Bellingcat
Copyright (c) 2022, Felix Spöttel
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@@ -13,10 +13,7 @@ scikit-learn = "*"
[dev-packages]
black = "*"
pre-commit = "*"
pytest = "*"
wheel = "*"
geoclustering = {editable = true, path = "."}
[requires]
python_version = "3.9"

2654
Pipfile.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -10,39 +10,35 @@
### Clustering Method
A cluster is created when a certain number of points (defined with `--size`) each are within a given distance (defined with `--distance`) of at least one other point in the cluster.
A cluster is created when a certain number of points (=> `--size`) each are within a given distance (=> `--distance`) of at least one other point in the cluster.
## Install
Install with pip:
Clone the repository:
```sh
# with kepler.gl visualization support
pip install geoclustering[full]
# only text-based output
pip install geoclustering
git clone https://github.com/bellingcat/geoclustering
cd geoclustering
```
If the `full` install fails, you might need to install kepler.gl build dependencies:
Install keplergl build dependencies:
```sh
# macos
brew install proj gdal
```
Install project with pip:
```sh
pip install .
```
## Usage
```
Usage: geoclustering [OPTIONS] FILENAME
Tool to cluster geolocations. A cluster is created when a certain number of
points (defined with --size) each are within a given distance (defined with
--distance) of at least one other point in the cluster. Input is supplied as
a csv file. At a minimum, each row needs to have a 'lat' and a 'lon' column.
Other rows are reflected to the output.
Options:
-d, --distance FLOAT (in km) Max. distance between two points in
a cluster. [required]
@@ -54,15 +50,12 @@ Options:
Clustering algorithm to be used. `optics`
produces tighter clusters but is slower.
Default: dbscan
--open Open the generated visualization in the
default browser automatically.
--debug Print debug output.
--help Show this message and exit.
```
## Input
Inputs are supplied as a `.csv` file. At a minimum, each row needs to have a `lat` and a `lon`` column. Other rows are reflected to the output.
Inputs are supplied as a `.csv` file. The only required fields are `lat` and `lon`, all other fields are reflected to the output.
```csv
id,name,lat,lon
@@ -72,7 +65,7 @@ id,name,lat,lon
## Output
If at least one cluster was found, the tool outputs a folder with output as `json`, `geojson`, `txt`, `csv` files. A kepler.gl `html` file is generated as well.
If at least one cluster was found, the tool outputs a folder with `json`, `geojson`, `text` and a kepler.gl `html` files.
### JSON
@@ -121,7 +114,7 @@ Encodes a single `FeatureCollection`, containing all points as `Feature` objects
}
```
### Text
### txt
Encodes cluster as blocks separated by a newline, where each line in a cluster block contains one point.
@@ -132,39 +125,6 @@ id 9, name Rosanna Foggo, lat -6.2074293, lon 106.8915948
// ...
```
### CSV
Encodes each event in one line with `cluster_id` information associated.
```csv
cluster_id,name,lat,lon
9,Rosanna Foggo,-6.2074293,106.8915948
...
```
### kepler.gl
![kepler.gl instance](https://user-images.githubusercontent.com/1682504/176478177-c0446b51-4060-495c-803d-79e2bbd3e966.png)
## Develop
It is assumed that you are using **Python3.9+**. It is encouraged to [setup a virtualenv](https://wiki.archlinux.org/title/Python/Virtual_environment#venv>) for development.
```sh
# install dependencies & dev-dependencies
# PIP
pip install -e .[dev,full]
# PIPENV
pipenv install --dev -e .
# install a git hook that runs the code formatter before each commit.
pre-commit install
```
We use [Black](https://github.com/psf/black) as our code formatter. If you don't want to use the `pre-commit` hook, you can run the formatter manually or via an editor plugin.
## Release
1. Update [version.py](geoclustering/version.py)
2. Run `scripts/release.sh`
3. Confirm GH action completed successfully

View File

@@ -1,4 +1,3 @@
from pathlib import Path
import click
import webbrowser
@@ -7,9 +6,7 @@ import geoclustering.encoding as encoding
import geoclustering.io as io
@click.command(
help="Tool to cluster geolocations. A cluster is created when a certain number of points (defined with --size) each are within a given distance (defined with --distance) of at least one other point in the cluster. Input is supplied as a csv file. At a minimum, each row needs to have a 'lat' and a 'lon' column. Other rows are reflected to the output."
)
@click.command()
@click.option(
"--distance",
"-d",
@@ -41,55 +38,26 @@ import geoclustering.io as io
default="dbscan",
help="Clustering algorithm to be used. `optics` produces tighter clusters but is slower. Default: dbscan",
)
@click.option(
"--open",
"_open",
is_flag=True,
help="Open the generated visualization in the default browser automatically.",
)
@click.option("--debug", is_flag=True, help="Print debug output.")
@click.argument("filename", type=click.Path(exists=True))
def main(distance, size, output, filename, algorithm, _open, debug):
def print_debug(s):
if debug:
click.secho(s, fg="bright_black")
def main(distance, size, output, filename, algorithm):
df = io.read_csv_file(filename)
print_debug(f"Read {len(df)} valid coordinates from {Path(filename).absolute()}")
clusters = clustering.cluster_locations(
df=df, algorithm=algorithm, radius_km=distance, min_cluster_size=size
)
if not bool(clusters):
click.secho("Did not find clusters matching input parameters.", fg="yellow")
click.echo("Did not find clusters matching input parameters.")
return
print_debug(f"Found {len(clusters)} valid clusters using {algorithm}")
encoded = encoding.encode_clusters(clusters)
io.write_output_file(output, "result.txt", encoded["string"])
io.write_output_file(output, "result.json", encoded["json"])
io.write_output_file(output, "result.geojson", encoded["geojson"])
io.write_output_file(output, "result.csv", encoded["csv"])
vis = io.write_visualization(output, "result.html", encoded["geojson"])
if vis is None:
print_debug("Skipped generating visualization: kepler is not installed.")
click.echo(f"Output files saved to {Path(output).absolute()}")
if _open:
if vis:
webbrowser.open_new_tab("file://" + str(vis.absolute()))
print_debug("Opened visualization in default browser.")
else:
click.secho(
"Can't open kepler.gl: package not installed. Please re-install geoclustering with `pip install geoclustering[full]`.",
fg="yellow",
)
click.secho("Clustering completed.", fg="green")
webbrowser.open_new_tab("file://" + str(vis.absolute()))
if __name__ == "__main__":

View File

@@ -14,6 +14,8 @@ def to_cluster_dict(df, clustering):
"""
clusters_by_id = {}
print(clustering.labels_)
for idx, cluster_id in enumerate(clustering.labels_):
# ignore "noise" locations that don't belong to any cluster.
if cluster_id > -1:

View File

@@ -1,8 +1,6 @@
import json
import numpy as np
import geojson
import csv
import io # not io.py
class NpEncoder(json.JSONEncoder):
@@ -49,7 +47,7 @@ class JSONEncoder:
for record in cluster:
cluster_data["points"].append(record)
self.state.append(cluster_data)
self.state.append(cluster_data)
def get(self):
return json.dumps(self.state, cls=NpEncoder)
@@ -76,37 +74,13 @@ class GeoJSONEncoder:
return json.dumps(geojson.FeatureCollection(self.state), cls=NpEncoder)
class CSVEncoder:
"""Encodes clustering result as a CSV"""
def __init__(self):
self.state = io.StringIO()
self.writer = False
def visitor(self, cluster_id, cluster):
if not self.writer:
self.writer = csv.DictWriter(
self.state,
fieldnames=["cluster_id"] + list(cluster[0].keys()),
quoting=csv.QUOTE_NONNUMERIC,
lineterminator="\n",
)
self.writer.writeheader()
for record in cluster:
self.writer.writerow({**record, "cluster_id": cluster_id})
def get(self):
return self.state.getvalue()
def encode_clusters(clusters):
json_encoder = JSONEncoder()
geojson_encoder = GeoJSONEncoder()
string_encoder = StringEncoder()
csv_encoder = CSVEncoder()
encoders = [json_encoder, geojson_encoder, string_encoder, csv_encoder]
encoders = [json_encoder, geojson_encoder, string_encoder]
for cluster_id, cluster in clusters.items():
for encoder in encoders:
encoder.visitor(cluster_id, cluster)
@@ -115,5 +89,4 @@ def encode_clusters(clusters):
"json": json_encoder.get(),
"geojson": geojson_encoder.get(),
"string": string_encoder.get(),
"csv": csv_encoder.get(),
}

View File

@@ -1,30 +1,10 @@
from keplergl import KeplerGl
from pathlib import Path
from pkg_resources import resource_filename
import json
import json
import pandas as pd
import numpy as np
import os
import sys
# kepler is optional, check if installed.
try:
from keplergl import KeplerGl
except:
has_kepler = False
else:
has_kepler = True
class HiddenPrints:
"""Disables stdout prints for a block of code."""
def __enter__(self):
self._original_stdout = sys.stdout
sys.stdout = open(os.devnull, "w")
def __exit__(self, exc_type, exc_val, exc_tb):
sys.stdout.close()
sys.stdout = self._original_stdout
def is_valid_lat(val: str) -> bool:
@@ -45,35 +25,23 @@ def is_valid_lon(val: str) -> bool:
return False
def is_not_none(val: any) -> bool:
return val is not None
def read_csv_file(filename):
"""Read input csv file, dropping rows that don't have valid location data."""
# replace NaN for all fields not to break kepler parsing.
df = pd.read_csv(filename).replace({np.nan: None})
df = pd.read_csv(filename)
initial_rows = len(df)
# construct an index of values with valid lat & lon.
valid_index = df.lat.apply(is_valid_lat) & df.lon.apply(is_valid_lon)
df_invalid = df[~valid_index]
count_invalid = len(df_invalid)
if count_invalid:
df_not_empty = df_invalid[
(df_invalid.lat.apply(is_not_none) | df_invalid.lon.apply(is_not_none))
]
count_not_empty = len(df_not_empty)
count_empty = count_invalid - count_not_empty
if count_empty:
print(f"Removed {count_empty} empty coordinate pairs.")
if count_not_empty:
print(f"Removed {count_not_empty} invalid coordinate pairs:")
print(df_not_empty[["lat", "lon"]].to_string())
df = df.dropna(subset=["lat", "lon"])
df = df.replace(
{np.nan: None}
) # replace for other fields not to break kepler parsing
print(f"Ignored {initial_rows - len(df)} coordinates with NaN")
valid_index = df.lat.astype(str).apply(is_valid_lat) & df.lon.astype(str).apply(
is_valid_lon
)
if len(df_invalid := df[~valid_index]):
print(f"Found {len(df_invalid)} invalid coordinate pairs, ignoring:")
print(df_invalid[["lat", "lon"]].to_string())
return df[valid_index]
@@ -96,14 +64,7 @@ def write_output_file(dirname, filename, data):
def write_visualization(dirname, filename, data):
"""Write a visualization, ensuring parent directories."""
if not has_kepler:
return None
# Hide kepler stdout output.
with HiddenPrints():
map = KeplerGl()
map = KeplerGl()
map.add_data(data=data, name="clusters")
# config configures a default color scheme for our clusters layer.
@@ -112,9 +73,6 @@ def write_visualization(dirname, filename, data):
map.config = json.loads(f.read())
filepath = ensure_file_path(dirname, filename)
# Hide kepler stdout output.
with HiddenPrints():
map.save_to_html(file_name=str(filepath), center_map=True)
map.save_to_html(file_name=str(filepath), center_map=True)
return filepath

View File

@@ -9,7 +9,7 @@
"config": {
"dataId": "clusters",
"label": "clusters",
"color": [248, 149, 112],
"color": [179, 173, 158],
"highlightColor": [252, 242, 26, 255],
"columns": { "geojson": "_geojson" },
"isVisible": true,
@@ -19,30 +19,16 @@
"thickness": 0.5,
"strokeColor": null,
"colorRange": {
"name": "Uber Viz Qualitative 4",
"type": "qualitative",
"name": "Global Warming",
"type": "sequential",
"category": "Uber",
"colors": [
"#12939A",
"#DDB27C",
"#88572C",
"#FF991F",
"#F15C17",
"#223F9A",
"#DA70BF",
"#125C77",
"#4DC19C",
"#776E57",
"#17B8BE",
"#F6D18A",
"#B7885E",
"#FFCB99",
"#F89570",
"#829AE3",
"#E79FD5",
"#1E96BE",
"#89DAC1",
"#B3AD9E"
"#5A1846",
"#900C3F",
"#C70039",
"#E3611C",
"#F1920E",
"#FFC300"
]
},
"strokeColorRange": {

View File

@@ -1,8 +1,8 @@
_MAJOR = "0"
_MINOR = "4"
_MINOR = "1"
# On main and in a nightly release the patch should be one ahead of the last
# released build.
_PATCH = "1"
_PATCH = "3"
# This is mainly for nightly builds which have the suffix ".dev$DATE". See
# https://semver.org/#is-v123-a-semantic-version for the semantics.
_SUFFIX = ""

View File

@@ -1,3 +0,0 @@
[pytest]
testpaths = tests/
python_files = *.py

0
scripts/release.sh Executable file → Normal file
View File

View File

@@ -28,14 +28,12 @@ setup(
install_requires=[
"click",
"geojson",
"keplergl",
"numpy",
"pandas",
"scikit-learn",
],
extras_require={
"dev": ["black", "wheel", "pre-commit", "pytest"],
"full": ["keplergl"],
},
extras_require={"dev": ["black", "wheel"]},
include_package_data=True,
zip_safe=False,
)

View File

View File

@@ -1,41 +0,0 @@
from geoclustering.clustering import cluster_locations
from tests.helpers import read_fixture_csv
df = read_fixture_csv("clustering.csv")
def has_member(list, name):
return any(x for x in list if x["name"] == name)
def test_clustering_all():
# there should be one cluster with all members but Erin.
res = cluster_locations(
df=df, algorithm="dbscan", radius_km=1.97, min_cluster_size=4
)
assert len(res.values()) == 1
assert len(res[0]) == 4
def test_clustering_split():
res = cluster_locations(
df=df, algorithm="dbscan", radius_km=0.5, min_cluster_size=2
)
# there should be two cluster: Alice & Bob and Carol & Dan
assert len(res.values()) == 2
cluster_one = res[0]
cluster_two = res[1]
assert len(cluster_one) == 2
assert has_member(cluster_one, "Alice")
assert has_member(cluster_one, "Bob")
assert has_member(cluster_two, "Carol")
assert has_member(cluster_two, "Dan")
def test_clustering_none():
# there should be no clusters now.
res = cluster_locations(
df=df, algorithm="dbscan", radius_km=0.5, min_cluster_size=3
)
assert len(res.values()) == 0

View File

@@ -1,30 +0,0 @@
from geoclustering.encoding import encode_clusters
from tests.helpers import read_fixture_csv, read_fixture_content
df = read_fixture_csv("clustering.csv")
def test_encoders():
clusters = {
0: [
{"id": 1, "name": "Alice", "lat": 52.523955, "lon": 13.442362},
{"id": 2, "name": "Bob", "lat": 52.526659, "lon": 13.448097},
],
1: [
{"id": 3, "name": "Carol", "lat": 52.525626, "lon": 13.419246},
{
"id": 4,
"name": "Dan",
"lat": 52.52443559865125,
"lon": 13.41261723049818,
},
],
}
res = encode_clusters(clusters)
assert res["string"] == read_fixture_content("snapshots/result.txt")
assert res["json"] == read_fixture_content("snapshots/result.json")
assert res["geojson"] == read_fixture_content("snapshots/result.geojson")
assert res["csv"] == read_fixture_content("snapshots/result.csv")

View File

@@ -1,6 +0,0 @@
id,name,lat,lon
1,Alice,52.523955,13.442362
2,Bob,52.526659,13.448097
3,Carol,52.525626,13.419246
4,Dan,52.52443559865125,13.41261723049818
5,Erin,52.524838991760774,13.383188597040382
1 id name lat lon
2 1 Alice 52.523955 13.442362
3 2 Bob 52.526659 13.448097
4 3 Carol 52.525626 13.419246
5 4 Dan 52.52443559865125 13.41261723049818
6 5 Erin 52.524838991760774 13.383188597040382

View File

@@ -1,9 +0,0 @@
id,name,lat,lon
1,Alice,,
2,,52.523955,13.442362
,,-90.12,132.23
4,,78.234,-180.1212
5,Bob,52.524838991760774,13.383188597040382
6,Peter,91.234,
7,Horst,,23.23
7,Erin,foo,bar
1 id name lat lon
2 1 Alice
3 2 52.523955 13.442362
4 -90.12 132.23
5 4 78.234 -180.1212
6 5 Bob 52.524838991760774 13.383188597040382
7 6 Peter 91.234
8 7 Horst 23.23
9 7 Erin foo bar

View File

@@ -1,5 +0,0 @@
"cluster_id","id","name","lat","lon"
0,1,"Alice",52.523955,13.442362
0,2,"Bob",52.526659,13.448097
1,3,"Carol",52.525626,13.419246
1,4,"Dan",52.52443559865125,13.41261723049818
1 cluster_id id name lat lon
2 0 1 Alice 52.523955 13.442362
3 0 2 Bob 52.526659 13.448097
4 1 3 Carol 52.525626 13.419246
5 1 4 Dan 52.52443559865125 13.41261723049818

View File

@@ -1 +0,0 @@
{"type": "FeatureCollection", "features": [{"type": "Feature", "geometry": {"type": "Point", "coordinates": [13.442362, 52.523955]}, "properties": {"id": 1, "name": "Alice", "cluster_id": 0}}, {"type": "Feature", "geometry": {"type": "Point", "coordinates": [13.448097, 52.526659]}, "properties": {"id": 2, "name": "Bob", "cluster_id": 0}}, {"type": "Feature", "geometry": {"type": "Point", "coordinates": [13.419246, 52.525626]}, "properties": {"id": 3, "name": "Carol", "cluster_id": 1}}, {"type": "Feature", "geometry": {"type": "Point", "coordinates": [13.412617, 52.524436]}, "properties": {"id": 4, "name": "Dan", "cluster_id": 1}}]}

View File

@@ -1 +0,0 @@
[{"cluster_id": 0, "points": [{"id": 1, "name": "Alice", "lat": 52.523955, "lon": 13.442362}, {"id": 2, "name": "Bob", "lat": 52.526659, "lon": 13.448097}]}, {"cluster_id": 1, "points": [{"id": 3, "name": "Carol", "lat": 52.525626, "lon": 13.419246}, {"id": 4, "name": "Dan", "lat": 52.52443559865125, "lon": 13.41261723049818}]}]

View File

@@ -1,7 +0,0 @@
Cluster 0
id 1, name Alice, lat 52.523955, lon 13.442362
id 2, name Bob, lat 52.526659, lon 13.448097
Cluster 1
id 3, name Carol, lat 52.525626, lon 13.419246
id 4, name Dan, lat 52.52443559865125, lon 13.41261723049818

View File

@@ -1,16 +0,0 @@
import os
from geoclustering.io import read_csv_file
def get_fixture_path(filename):
dir_path = os.path.dirname(os.path.realpath(__file__))
return os.path.join(dir_path, "fixtures", filename)
def read_fixture_csv(filename):
return read_csv_file(get_fixture_path(filename))
def read_fixture_content(filename):
with open(get_fixture_path(filename)) as f:
return f.read()

View File

@@ -1,25 +0,0 @@
from pathlib import Path
import shutil
from geoclustering.io import write_output_file
from tests.helpers import read_fixture_csv
def test_csv_filters():
df = read_fixture_csv("io.csv")
# entries 2 & 5 in fixture are valid.
assert len(df) == 2
assert df.iloc[0]["name"] == None
assert df.iloc[1]["name"] == "Bob"
def test_write_output_file():
p = "./this/dir/does/not/exist"
f = "test.txt"
write_output_file(p, f, "test")
path = Path(p) / f
with open(path) as f:
assert f.read() == "test"
shutil.rmtree(Path("./this"))