mirror of
https://github.com/bellingcat/geoclustering.git
synced 2026-06-07 19:18:30 +03:00
170 lines
4.5 KiB
Markdown
170 lines
4.5 KiB
Markdown
# geoclustering
|
|
|
|
> 📍 command-line tool for clustering geolocations.
|
|
|
|
### Features
|
|
|
|
- Uses [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or [OPTICS](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html) to perform clustering.
|
|
- Outputs clustering results as `json`, `txt` and `geojson`.
|
|
- Creates a [kepler.gl](https://kepler.gl) visualization of clusters.
|
|
|
|
### Clustering Method
|
|
|
|
A cluster is created when a certain number of points (defined with `--size`) each are within a given distance (defined with `--distance`) of at least one other point in the cluster.
|
|
|
|
|
|
## Install
|
|
|
|
Install with pip:
|
|
|
|
```sh
|
|
# with kepler.gl visualization support
|
|
pip install geoclustering[full]
|
|
|
|
# only text-based output
|
|
pip install geoclustering
|
|
```
|
|
|
|
If the `full` install fails, you might need to install kepler.gl build dependencies:
|
|
|
|
```sh
|
|
# macos
|
|
brew install proj gdal
|
|
```
|
|
|
|
## Usage
|
|
|
|
```
|
|
Usage: geoclustering [OPTIONS] FILENAME
|
|
|
|
Tool to cluster geolocations. A cluster is created when a certain number of
|
|
points (defined with --size) each are within a given distance (defined with
|
|
--distance) of at least one other point in the cluster. Input is supplied as
|
|
a csv file. At a minimum, each row needs to have a 'lat' and a 'lon' column.
|
|
Other rows are reflected to the output.
|
|
|
|
Options:
|
|
-d, --distance FLOAT (in km) Max. distance between two points in
|
|
a cluster. [required]
|
|
-s, --size INTEGER Min. number of points in a cluster.
|
|
[required]
|
|
-o, --output PATH Output directory for results. Default:
|
|
./output
|
|
-a, --algorithm [dbscan|optics]
|
|
Clustering algorithm to be used. `optics`
|
|
produces tighter clusters but is slower.
|
|
Default: dbscan
|
|
--open Open the generated visualization in the
|
|
default browser automatically.
|
|
--debug Print debug output.
|
|
--help Show this message and exit.
|
|
```
|
|
|
|
## Input
|
|
|
|
Inputs are supplied as a `.csv` file. At a minimum, each row needs to have a `lat` and a `lon`` column. Other rows are reflected to the output.
|
|
|
|
```csv
|
|
id,name,lat,lon
|
|
1,Bonnibelle Mathwen,40.1324085,64.4911086
|
|
...
|
|
```
|
|
|
|
## Output
|
|
|
|
If at least one cluster was found, the tool outputs a folder with output as `json`, `geojson`, `txt`, `csv` files. A kepler.gl `html` file is generated as well.
|
|
|
|
### JSON
|
|
|
|
Encodes an array of clusters, each containing an array of points.
|
|
|
|
```json
|
|
[
|
|
{
|
|
"cluster_id": 0,
|
|
"points": [
|
|
{
|
|
"id": 9,
|
|
"name": "Rosanna Foggo",
|
|
"lat": -6.2074293,
|
|
"lon": 106.8915948
|
|
}
|
|
]
|
|
}
|
|
]
|
|
```
|
|
|
|
### GeoJSON
|
|
|
|
Encodes a single `FeatureCollection`, containing all points as `Feature` objects.
|
|
|
|
```json
|
|
{
|
|
"type": "FeatureCollection",
|
|
"features": [
|
|
{
|
|
"type": "Feature",
|
|
"geometry": {
|
|
"type": "Point",
|
|
"coordinates": [
|
|
106.891595,
|
|
-6.207429
|
|
]
|
|
},
|
|
"properties": {
|
|
"id": 9,
|
|
"name": "Rosanna Foggo",
|
|
"cluster_id": 0
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Text
|
|
|
|
Encodes cluster as blocks separated by a newline, where each line in a cluster block contains one point.
|
|
|
|
```txt
|
|
Cluster 0
|
|
id 9, name Rosanna Foggo, lat -6.2074293, lon 106.8915948
|
|
|
|
// ...
|
|
```
|
|
|
|
### CSV
|
|
|
|
Encodes each event in one line with `cluster_id` information associated.
|
|
|
|
```csv
|
|
cluster_id,name,lat,lon
|
|
9,Rosanna Foggo,-6.2074293,106.8915948
|
|
...
|
|
```
|
|
|
|
### kepler.gl
|
|
|
|

|
|
|
|
## Develop
|
|
|
|
It is assumed that you are using **Python3.9+**. It is encouraged to [setup a virtualenv](https://wiki.archlinux.org/title/Python/Virtual_environment#venv>) for development.
|
|
|
|
```sh
|
|
# install dependencies & dev-dependencies
|
|
# PIP
|
|
pip install -e .[dev,full]
|
|
# PIPENV
|
|
pipenv install --dev -e .
|
|
|
|
# install a git hook that runs the code formatter before each commit.
|
|
pre-commit install
|
|
```
|
|
|
|
We use [Black](https://github.com/psf/black) as our code formatter. If you don't want to use the `pre-commit` hook, you can run the formatter manually or via an editor plugin.
|
|
|
|
## Release
|
|
|
|
1. Update [version.py](geoclustering/version.py)
|
|
2. Run `scripts/release.sh`
|
|
3. Confirm GH action completed successfully |