geoclustering/README.md

# geoclustering

> 📍 command-line tool for clustering geolocations.

### Features

 - Uses [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) or [OPTICS](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html) to perform clustering.
 - Outputs clustering results as `json`, `txt` and `geojson`.
 - Creates a [kepler.gl](https://kepler.gl) visualization of clusters.

### Clustering Method

A cluster is created when a certain number of points (defined with `--size`) each are within a given distance (defined with `--distance`) of at least one other point in the cluster.


## Install

Install with pip:

```sh
# with kepler.gl visualization support
pip install geoclustering[full]

# only text-based output
pip install geoclustering
```

If the `full` install fails, you might need to install kepler.gl build dependencies:

```sh
# macos
brew install proj gdal
```

## Usage

```
Usage: geoclustering [OPTIONS] FILENAME

  Tool to cluster geolocations. A cluster is created when a certain number of
  points (defined with --size) each are within a given distance (defined with
  --distance) of at least one other point in the cluster. Input is supplied as
  a csv file. At a minimum, each row needs to have a 'lat' and a 'lon' column.
  Other rows are reflected to the output.

Options:
  -d, --distance FLOAT            (in km) Max. distance between two points in
                                  a cluster.  [required]
  -s, --size INTEGER              Min. number of points in a cluster.
                                  [required]
  -o, --output PATH               Output directory for results. Default:
                                  ./output
  -a, --algorithm [dbscan|optics]
                                  Clustering algorithm to be used. `optics`
                                  produces tighter clusters but is slower.
                                  Default: dbscan
  --open                          Open the generated visualization in the
                                  default browser automatically.
  --debug                         Print debug output.
  --help                          Show this message and exit.
```

## Input

Inputs are supplied as a `.csv` file. At a minimum, each row needs to have a `lat` and a `lon`` column. Other rows are reflected to the output.

```csv
id,name,lat,lon
1,Bonnibelle Mathwen,40.1324085,64.4911086
...
```

## Output

If at least one cluster was found, the tool outputs a folder with output as `json`, `geojson`, `txt`, `csv` files. A kepler.gl `html` file is generated as well.

### JSON

Encodes an array of clusters, each containing an array of points.

```json
[
  {
    "cluster_id": 0,
    "points": [
      {
        "id": 9,
        "name": "Rosanna Foggo",
        "lat": -6.2074293,
        "lon": 106.8915948
      }
    ]
  }
]
```

### GeoJSON

Encodes a single `FeatureCollection`, containing all points as `Feature` objects.

```json
{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [
          106.891595,
          -6.207429
        ]
      },
      "properties": {
        "id": 9,
        "name": "Rosanna Foggo",
        "cluster_id": 0
      }
    }
  ]
}
```

### Text

Encodes cluster as blocks separated by a newline, where each line in a cluster block contains one point.

```txt
Cluster 0
id 9, name Rosanna Foggo, lat -6.2074293, lon 106.8915948

// ...
```

### CSV

Encodes each event in one line with `cluster_id` information associated.

```csv
cluster_id,name,lat,lon
9,Rosanna Foggo,-6.2074293,106.8915948
...
```

### kepler.gl

![kepler.gl instance](https://user-images.githubusercontent.com/1682504/176478177-c0446b51-4060-495c-803d-79e2bbd3e966.png)

## Develop

It is assumed that you are using **Python3.9+**. It is encouraged to [setup a virtualenv](https://wiki.archlinux.org/title/Python/Virtual_environment#venv>) for development.

```sh
    # install dependencies & dev-dependencies
    # PIP
    pip install -e .[dev,full]
    # PIPENV
    pipenv install --dev -e .

    # install a git hook that runs the code formatter before each commit.
    pre-commit install
```

We use [Black](https://github.com/psf/black) as our code formatter. If you don't want to use the `pre-commit` hook, you can run the formatter manually or via an editor plugin.

## Release

1. Update [version.py](geoclustering/version.py)
2. Run `scripts/release.sh`
3. Confirm GH action completed successfully