extensive update
172
Quickstart.ipynb
Normal file
@@ -0,0 +1,172 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b6926e35",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"*Quickstart hands-on exercise. For in-depth intro checkout Tutorial 1:*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f17ebdd2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sugartrail import mapview, api, base\n",
|
||||
"from ipywidgets import VBox, HBox"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d5f9b6ad",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Insert a valid [Companies House Public Data API key](https://developer.company-information.service.gov.uk/get-started/) as `username` string value below. If you don't want to use the API and would prefer loading a pre-built network, uncomment and run the cell below and then run the final cell to build and load the map. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4a9639e6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# # network build from Domain Foundation, company_id = \"11951034\"\n",
|
||||
"# import pickle\n",
|
||||
"\n",
|
||||
"# with open('assets/networks/domain_corp_network.pickle', 'rb') as handle:\n",
|
||||
"# network = pickle.load(handle)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "89b0082a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"api.basic_auth.username = \"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "63220f29",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Enter the company number (as string) for a company you would like to explore. Example value is provided: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8aca6a54",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"company_id = \"11951034\"\n",
|
||||
"network = base.Network(company_id=company_id)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7de31e72",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"5"
|
||||
]
|
||||
},
|
||||
"source": [
|
||||
"Perform `n` number of hops (3 or less at first is advised to keep the network manageable in size):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d80be86d",
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"6"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"n = 2\n",
|
||||
"network = base.Network(officer_id=officer_id)\n",
|
||||
"network.perform_hop(n)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4481c80d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now lets visualise the connections:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "022f026e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.run_map_preprocessing()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "01dca0cf",
|
||||
"metadata": {
|
||||
"scrolled": true,
|
||||
"tags": [
|
||||
"7"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"map_data,path_table = mapview.build_map(network) \n",
|
||||
"hbox = HBox([path_table])\n",
|
||||
"vbox = VBox([map_data, hbox])\n",
|
||||
"vbox"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "457bf4d0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Each marker represents a company in the network. Green markers represent active companies based at the address, red markers represent active companies no longer based at the address and black markers represent dissolved companies once based at the address. \n",
|
||||
"\n",
|
||||
"Select a marker to display additional information: \n",
|
||||
"- pop-up with the selected company's name and address\n",
|
||||
"- table containing the most efficient paths from the origin to the selected company\n",
|
||||
"- antpaths for each company in the network. Red antpath represents the path through all the historic addresses for the selected company. Black antpath represents the path from the network origin through all the addresses in the path to the selected company as displayed in the table. "
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.15"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
12
README.md
@@ -1,13 +1,8 @@
|
||||
# Sugartrail
|
||||
|
||||
## Tool Description
|
||||

|
||||
|
||||
Sugartrail is a work-in-progress network analysis tool and workflow that helps researchers to use a suspicious officer to discover other suspicious officers, companies and locations through Companies House.
|
||||
|
||||
The workflow is based on the following observations:
|
||||
|
||||
- suspicious directors often have many active companies registered to multiple historic addresses
|
||||
- addresses with many registered companies can contain multiple scam companies
|
||||
Sugartrail is a network analysis and visualisation tool easier and faster for researchers to explore connections between companies, persons and addresses within Companies House.
|
||||
|
||||
## Requirements
|
||||
|
||||
@@ -28,10 +23,9 @@ git clone https://github.com/ribenamaplesyrup/sugartrail.git
|
||||
```bash
|
||||
conda env create -f environment.yml
|
||||
conda activate candystore
|
||||
jupyter nbextension enable --py --sys-prefix ipyleaflet
|
||||
jupyter notebook
|
||||
```
|
||||
4. Open `Tutorial 1 - Exit Through the Candy Shop`
|
||||
|
||||
## Usage
|
||||
|
||||
- A walkthrough of how to use the tool is included in the linked Jupyter notebook showing how we can get from suspicious Candy Stores of Oxford Street to several prolific scammers.
|
||||
|
||||
545
Tutorial 1 - Get Started.ipynb
Normal file
@@ -0,0 +1,545 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0639ca05",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"*In this tutorial we will walk through the capabilities of the tool in depth.*"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "538c9eb1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Introduction \n",
|
||||
"\n",
|
||||
"'Sugartrail' was developed to make it easier and faster for researchers to explore connections between companies, persons and addresses within [Companies House](https://www.gov.uk/government/organisations/companies-house). Researchers can build networks of connected companies, persons and addresses based on a defined set of connectivity criteria and then visualise these connections through an [OpenStreetMaps interface](https://ipyleaflet.readthedocs.io/en/latest/index.html)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eee8d524",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Prerequisites\n",
|
||||
"\n",
|
||||
"Sugartrail uses the [Companies House Public Data API](https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference) to gather data on connected companies, persons and addresses. To access this API you will need a key which you can aquire by registering a [user account](https://developer.company-information.service.gov.uk/get-started/). Once you've aquired the key, insert it below as the string value of `api.basic_auth.username`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "81c37bf3",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sugartrail import api, mapview, base\n",
|
||||
"\n",
|
||||
"api.basic_auth.username = \"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ad4599dc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Lets make a test request to validate everything works by attempting to get all the officers who work at [this company](https://find-and-update.company-information.service.gov.uk/company/12411673). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "51a1dd4f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"company_id = \"12411673\"\n",
|
||||
"api.get_company_officers(company_id)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "29d8dd26",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Initialising Networks \n",
|
||||
"\n",
|
||||
"To create a network we start from a single company, person or address. Networks are build and stored with the `Network` class. Lets go ahead and create a new network:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "63bc00fa",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network = base.Network()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "aeedf139",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"`Network` accepts either a company ID, officer ID or address string as the initial node. For example, [this company](https://find-and-update.company-information.service.gov.uk/company/12411673): `company_id` = \"12411673\"\n",
|
||||
"\n",
|
||||
"If we wanted to search by address, then `address` = \"513 Tong Street, Flat 5, Bradford, England, BD4 6NA\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f73b17d8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b3caccb6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For [this officer](https://find-and-update.company-information.service.gov.uk/officers/6WODVBRaegvY3UvEhcQxg0OsPkc/appointments), `officer_id` = \"6WODVBRaegvY3UvEhcQxg0OsPkc\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e21f3c98",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a6198a80",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Lets build the network from `company_id`: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "31eea99d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.company_id=\"11004735\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7bd5060d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We could also just initialise the network by passing `company_id` as an input: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9c70f41f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network = base.Network(company_id=\"11004735\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cd0f2a9e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Data about companies, persons and addresses are stored in several attributes within the `Network` class. If we check the `company_ids` property, we will find the entry we just created:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e12f5461",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.company_ids"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "91c14cbb",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Each company is represented by its unique ID (`company_id`), number of hops from the origin company (`n`) and the company, address or person it connects to. As we've only saved the origin company so far, there isn't any information on links or connected nodes. There are also attributes for storing officer ids (`officer_ids`) and (`addresses`) although they have no information in them yet:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "33ed61e2",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.officer_ids"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d5a52e6a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.addresses"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "72f30427",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Building Networks"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "862f00ef",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can now build the network by performing hops that will find new company IDs, officer IDs and addresses connected to the entities already stored within the network. \n",
|
||||
"\n",
|
||||
"There are a finite number of ways that officers, companies and addresses can be connected within Companies House:\n",
|
||||
"\n",
|
||||
"#### Companies \n",
|
||||
"\n",
|
||||
"1. Companies → Officers: companies have officers \n",
|
||||
"2. Companies → Addresses: companies have a history of registered addresses \n",
|
||||
"3. Companies → Addresses: companies have correspondence addresses for their persons of significant control (psc)\n",
|
||||
"\n",
|
||||
"#### Officers \n",
|
||||
"\n",
|
||||
"4. Officers → Companies: officers have appointments (companies they have a role in) \n",
|
||||
"5. Officers → Addresses: officers have correspondence addresses\n",
|
||||
"6. Officers → Officers: officers may have duplicate enteries within Companies House; other officers using the same name and birth date (but different values for `officer_id`\n",
|
||||
"\n",
|
||||
"#### Addresses \n",
|
||||
"\n",
|
||||
"7. Addresses → Officers: addresses are used as officer correspondence addresses \n",
|
||||
"8. Addresses → Companies: addresses are used as company correspondence addresses \n",
|
||||
"\n",
|
||||
"To build the network we can use any combination of this connectivity criteria. The above connections are implemented as methods that get called everytime we perform a hop: \n",
|
||||
"\n",
|
||||
"1. get_company_officers\n",
|
||||
"2. get_company_address_history\n",
|
||||
"3. get_psc_correspondance_address\n",
|
||||
"3. get_officer_appointments\n",
|
||||
"4. get_officer_correspondance_address \n",
|
||||
"5. get_officer_duplicates \n",
|
||||
"6. get_officers_at_address\n",
|
||||
"7. get_companies_at_address\n",
|
||||
"\n",
|
||||
"We can toggle each of these methods via boolean properties of the `Hop` subclass:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "32643a9c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.hop.__dict__"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1802bb34",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see the `Hop` subclass contains all of the connections mentioned above set to `True` by default, therefore everytime we perform a hop, the network will use these methods to get data.\n",
|
||||
"\n",
|
||||
"We also notice that there are some properties setting a \"maxsize\" limit. These properties ensure that if the number of results returned by the method exceeds this limit then the results will not be stored within the `Network` class properties. This limit is quite important when building networks as some of these methods can return 1000s of results and if we're not interested in these results they can make it difficult to visualise meaningful connections within the network (see Tutorial 3 for more on this). \n",
|
||||
"\n",
|
||||
"Lets go ahead and perform one hop using these default settings and see what addresses, companies and officers are added:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "167cc25c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.perform_hop(1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "2486aa17",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Lets now check out `company_ids`, `officer_ids` and `addresses` to see what new enteries have been added. Nothing new in `company_ids` but this is expected as none of the API methods above connect companies with companies in one hop:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c6ce4047",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.company_ids"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eb5cb2f6",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see we now have an officer below in `officer_ids` and some of the other properties in the table now have values other than None. `node_type` describes what the type of node the company is connected to (Company, Person or Address), `node_id` provides the unique id for the node (`company_id`, `officer_id` or `address`) and `link_type` describes the relationship between the company and the node."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "947c4cf1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.officer_ids"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a8cf6fa0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can interpret the table above as:\n",
|
||||
"\n",
|
||||
"There is an officer with ID=`Nd2URspq4bvLy-hwzDZ0_p7FGJw` who is an officer to a company with ID=`11004735`. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7083402a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.addresses"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "264de2dd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can see from the table above that:\n",
|
||||
"\n",
|
||||
"`3rd Floor 13 Charles Ii Street London SW1Y 4QU England` is an address that used to be home to a company (with ID=`11004735`):"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b4828d92",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For reproducibility, each time we perform a hop, the methods and limit configs are stored in "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9bb5f542",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.hop_history"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ac1dab27",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Lets perform another two hops: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f2b0baba",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.perform_hop(2)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cec66fcc",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can go ahead and visualise this in a map. To do this we need to get a bit more info that isn't present, namely the coordinates for all the addresses mentioned and the company names for each company. We can get this information via `run_map_preprocessing()`:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3be52255",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.run_map_preprocessing()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dfa1b90c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To see the information added, we can check out `address_history` and `companies` properties of our class:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b800202c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.address_history"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "37013a7e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"network.companies "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3e3b597d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can now visualise all the companies in the network with a UK address through OpenStreetMaps:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7256c5f9",
|
||||
"metadata": {
|
||||
"scrolled": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from ipywidgets import VBox, HBox\n",
|
||||
"map_data,path_table = mapview.build_map(network) \n",
|
||||
"hbox = HBox([path_table])\n",
|
||||
"vbox = VBox([map_data, hbox])\n",
|
||||
"vbox"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7e225045",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Each marker represents a company in the network. Green markers represent active companies based at the address, red markers represent active companies no longer based at the address and black markers represent dissolved companies once based at the address. \n",
|
||||
"\n",
|
||||
"Select a marker to display additional information: \n",
|
||||
"- pop-up with the selected company's name and address\n",
|
||||
"- table containing the most efficient paths from the origin to the selected company\n",
|
||||
"- antpaths for each company in the network. Red antpath represents the path through all the historic addresses for the selected company. Black antpath represents the path from the network origin through all the addresses in the path to the selected company as displayed in the table. \n",
|
||||
"\n",
|
||||
"To read paths from the table we start from the bottom of the table where we find one or several rows containing our selected company (`Node`) but with differing values for `Node Index`, `Node Type` and `Link`. If we encounter multiple rows containing our selected node, this tells us there are multiple paths of equal length from the selected node (origin) to the origin. For example, consider the following table: "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f6674e52",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<img src=\"assets/images/kingdom_table.png\" alt=\"Drawing\" style=\"width: 700px;\"/>\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fd5d9a0d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Pick N Mix London Limited (E) is a 'company at address' for 3rd Floor 13 Charles Ii Street (C) which is a 'historic address' for Kingdom of Sweets Ltd (A).\n",
|
||||
"\n",
|
||||
"Additionally, Pick N Mix London Limited (D) is an appointment of (B) who is an officer of Kingdom of Sweets Ltd (A). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4a6662be",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Network Persistance"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a68e26ca",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The network object can be saved with 'pickle' and reloaded when needed:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ee8d8c24",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pickle\n",
|
||||
"\n",
|
||||
"with open('assets/networks/kingdom_of_sweets_network.pickle', 'wb') as handle:\n",
|
||||
" pickle.dump(network, handle)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0e7c5578",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"with open('assets/networks/kingdom_of_sweets_network.pickle', 'rb') as handle:\n",
|
||||
" network = pickle.load(handle)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.9.15"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
BIN
.DS_Store → assets/.DS_Store
vendored
|
Before Width: | Height: | Size: 1.4 MiB After Width: | Height: | Size: 1.4 MiB |
|
Before Width: | Height: | Size: 1.6 MiB After Width: | Height: | Size: 1.6 MiB |
|
Before Width: | Height: | Size: 165 KiB After Width: | Height: | Size: 165 KiB |
|
Before Width: | Height: | Size: 119 KiB After Width: | Height: | Size: 119 KiB |
BIN
assets/images/domain.png
Normal file
|
After Width: | Height: | Size: 4.0 MiB |
|
Before Width: | Height: | Size: 2.0 MiB After Width: | Height: | Size: 2.0 MiB |
|
Before Width: | Height: | Size: 24 KiB After Width: | Height: | Size: 24 KiB |
|
Before Width: | Height: | Size: 1.7 MiB After Width: | Height: | Size: 1.7 MiB |
BIN
assets/images/kingdom_table.png
Normal file
|
After Width: | Height: | Size: 89 KiB |
BIN
assets/images/regent.png
Normal file
|
After Width: | Height: | Size: 683 KiB |
BIN
assets/images/regent_storefront.jpeg
Normal file
|
After Width: | Height: | Size: 180 KiB |
|
Before Width: | Height: | Size: 51 KiB After Width: | Height: | Size: 51 KiB |
BIN
assets/images/scrooge.png
Normal file
|
After Width: | Height: | Size: 486 KiB |
BIN
assets/images/shelton.png
Normal file
|
After Width: | Height: | Size: 673 KiB |
BIN
assets/images/spy.png
Normal file
|
After Width: | Height: | Size: 464 KiB |
BIN
assets/networks/domain_corp_network.pickle
Normal file
BIN
assets/networks/kingdom_of_sweets_network.pickle
Normal file
519
crawler.py
@@ -1,519 +0,0 @@
|
||||
from requests.auth import HTTPBasicAuth
|
||||
import requests
|
||||
import pandas as pd
|
||||
import sys
|
||||
from IPython.display import clear_output
|
||||
import time
|
||||
import numpy as np
|
||||
import collections
|
||||
from datetime import datetime
|
||||
import math
|
||||
# from GoogleNews import GoogleNews
|
||||
import random
|
||||
access_token = ""
|
||||
username = access_token
|
||||
password = ""
|
||||
size = "5000"
|
||||
basic = HTTPBasicAuth(username, password)
|
||||
|
||||
class Ownership_Network:
|
||||
def __init__(self, officer_id=None, company_id=None, address=None):
|
||||
self.addresses = pd.DataFrame(columns=['address','n'])
|
||||
self.officer_ids = pd.DataFrame(columns=['officer_id','n'])
|
||||
self.company_ids = pd.DataFrame(columns=['company_id','n'])
|
||||
self.companies = pd.DataFrame(columns=['company_number','n'])
|
||||
self.officer_id = officer_id
|
||||
self.company_id = company_id
|
||||
self.address = address
|
||||
self.n = 0
|
||||
self.edge = "Origin"
|
||||
self.initialise_dataframe()
|
||||
|
||||
def initialise_dataframe(self):
|
||||
if self.officer_id:
|
||||
self.officer_ids = self.officer_ids.append({'officer_id': self.officer_id, 'name': get_appointments(self.officer_id)[0]['name'], 'n':self.n, 'edge':self.edge, 'node': None, 'node_type': 'Person'}, ignore_index=True)
|
||||
elif self.company_id:
|
||||
self.company_ids = self.company_ids.append({'company_id': self.company_id, 'n':self.n, 'edge':self.edge, 'node': None, 'node_type': 'Company'}, ignore_index=True)
|
||||
company = get_company(self.company_id)
|
||||
company['n'] = self.n
|
||||
company['edge'] = self.edge
|
||||
self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
elif self.address:
|
||||
self.addresses = self.addresses.append({'address': self.address, 'n':self.n, 'edge':self.edge, 'node': None, 'node_type': 'Address'}, ignore_index=True)
|
||||
else:
|
||||
print("no input provided")
|
||||
|
||||
def search_officer_id(self, officer_id):
|
||||
appointments = get_appointments(officer_id)
|
||||
self.node_type = "Person"
|
||||
self.node = officer_id
|
||||
for appointment in appointments:
|
||||
if normalise_address(appointment['address']) not in self.addresses['address'].unique():
|
||||
self.edge = "Appointment Address"
|
||||
self.addresses = self.addresses.append({'address': normalise_address(appointment['address']), 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
if appointment['appointed_to']['company_number'] not in self.company_ids['company_id'].unique():
|
||||
self.edge = "Appointment"
|
||||
self.company_ids = self.company_ids.append({'company_id': appointment['appointed_to']['company_number'], 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
# company = get_company(appointment['appointed_to']['company_number'])
|
||||
# company['n'] = self.n
|
||||
# company['edge'] = self.edge
|
||||
# company['node'] = self.node
|
||||
# company['node_type'] = self.node_type
|
||||
# self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
correspondance_address = get_correspondance_address(officer_id)
|
||||
if normalise_address(correspondance_address) not in self.addresses['address'].unique():
|
||||
self.edge = "Officer Corresponance Address"
|
||||
self.addresses = self.addresses.append({'address': normalise_address(correspondance_address), 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
duplicate_officers = get_duplicate_officers(officer_id)
|
||||
for duplicate in duplicate_officers:
|
||||
self.edge = "Duplicate Officer"
|
||||
if duplicate['links']['self'].split('/')[2] not in self.officer_ids['officer_id'].unique():
|
||||
self.officer_ids = self.officer_ids.append({'officer_id': duplicate['links']['self'].split('/')[2], 'name': duplicate['title'], 'n':self.n, 'edge': self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
|
||||
def normalise_name(name):
|
||||
name_list = name.replace(',','').split(" ")
|
||||
name_list.insert(0, name_list.pop())
|
||||
return ' '.join(name_list)
|
||||
|
||||
def search_company_id(self, company_id):
|
||||
officers = get_officers(company_id)
|
||||
self.node_type = "Company"
|
||||
self.node = company_id
|
||||
if officers:
|
||||
for officer in officers:
|
||||
if normalise_address(officer['address']) not in self.addresses['address'].unique():
|
||||
self.edge = "Officer Corresponance Address"
|
||||
self.addresses = self.addresses.append({'address': normalise_address(officer['address']), 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
if officer['links']['officer']['appointments'].split('/')[2] not in self.officer_ids['officer_id'].unique():
|
||||
self.edge = "Officer"
|
||||
self.officer_ids = self.officer_ids.append({'officer_id': officer['links']['officer']['appointments'].split('/')[2], 'name': normalise_name(officer['name']), 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
psc = get_psc(company_id)
|
||||
if psc:
|
||||
for person in psc:
|
||||
if "address" in person:
|
||||
self.edge = "Person of Significant Control Address"
|
||||
if normalise_address(person['address']) not in self.addresses['address'].unique():
|
||||
self.addresses = self.addresses.append({'address': normalise_address(person['address']), 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
address_history = build_address_history(company_id)
|
||||
for address in address_history:
|
||||
self.edge = "Company Historical Address"
|
||||
if address['address'] not in self.addresses['address'].unique():
|
||||
self.addresses = self.addresses.append({'address': address['address'], 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
|
||||
def search_address(self, address):
|
||||
companies = get_companies_at_address(address)
|
||||
self.node_type = "Address"
|
||||
self.node = address
|
||||
if companies:
|
||||
for company in companies:
|
||||
self.edge = "Company Address"
|
||||
if company['company_number'] not in self.company_ids['company_id'].unique():
|
||||
self.company_ids = self.company_ids.append({'company_id': company['company_number'], 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
# company = get_company(company['company_number'])
|
||||
# if company:
|
||||
# company['n'] = self.n
|
||||
# company['edge'] = self.edge
|
||||
# company['node'] = self.node
|
||||
# company['node_type'] = self.node_type
|
||||
# self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
officers = get_officers_at_location(address)
|
||||
for officer in officers:
|
||||
self.edge = "Officer at Address"
|
||||
if officer['links']['self'].split('/')[2] not in self.officer_ids['officer_id'].unique():
|
||||
self.officer_ids = self.officer_ids.append({'officer_id': officer['links']['self'].split('/')[2], 'name': officer['title'], 'n':self.n, 'edge':self.edge, 'node': self.node, 'node_type': self.node_type}, ignore_index=True)
|
||||
|
||||
def get_company_from_id(self, company_id=None):
|
||||
company_list = []
|
||||
if company_id:
|
||||
if company_id in self.company_ids['company_id'].unique():
|
||||
company_list = [company_id]
|
||||
else:
|
||||
print("add valid company id")
|
||||
else:
|
||||
company_list = self.company_ids['company_id'].unique()
|
||||
for company_id in company_list:
|
||||
if company_id not in self.companies['company_number'].unique():
|
||||
company = get_company(company_id)
|
||||
if company:
|
||||
company['n'] = self.company_ids.loc[self.company_ids['company_id'] == company_id]['n']
|
||||
company['edge'] = self.company_ids.loc[self.company_ids['company_id'] == company_id]['edge']
|
||||
company['node'] = self.company_ids.loc[self.company_ids['company_id'] == company_id]['node']
|
||||
company['node_type'] = self.company_ids.loc[self.company_ids['company_id'] == company_id]['node_type']
|
||||
self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
|
||||
def hop(self, hops):
|
||||
for hop in range(hops):
|
||||
print("hop: " + str(hop+1))
|
||||
self.n += 1
|
||||
selected_addresses = self.addresses.loc[self.addresses['n'] == self.n-1]['address']
|
||||
selected_companies = self.company_ids.loc[self.company_ids['n'] == self.n-1]['company_id']
|
||||
selected_officers = self.officer_ids.loc[self.officer_ids['n'] == self.n-1]['officer_id']
|
||||
for i,address in enumerate(selected_addresses):
|
||||
self.search_address(address)
|
||||
clear_output(wait=True)
|
||||
print("Processed " + str(i+1) + "/" + str(len(selected_addresses)) + " addresses")
|
||||
for j,company in enumerate(selected_companies):
|
||||
self.search_company_id(company)
|
||||
clear_output(wait=True)
|
||||
print("Processed " + str(j+1) + "/" + str(len(selected_companies)) + " companies")
|
||||
for k,officer in enumerate(selected_officers):
|
||||
self.search_officer_id(officer)
|
||||
clear_output(wait=True)
|
||||
print("Processed " + str(k+1) + "/" + str(len(selected_officers)) + " officers")
|
||||
|
||||
def find_path(self, select_company):
|
||||
select_row = self.company_ids.loc[self.company_ids['company_id'] == select_company]
|
||||
path = []
|
||||
self.get_company_from_id(company_id=select_company)
|
||||
backlink = self.companies[self.companies["company_number"] == select_company]['company_name'].item() + " (" + select_row['edge'].item() + ") "
|
||||
path.insert(0, backlink)
|
||||
while True:
|
||||
if select_row['node_type'].item() == "Address":
|
||||
select_row = self.addresses.loc[self.addresses['address'] == select_row['node'].item()]
|
||||
if select_row['edge'].item() == "Origin":
|
||||
path.insert(0, select_row['address'].item() + " ->")
|
||||
break
|
||||
else:
|
||||
backlink = select_row['address'].item() + " (" + select_row['edge'].item() + ") " + "->"
|
||||
path.insert(0, backlink)
|
||||
elif select_row['node_type'].item() == "Company":
|
||||
select_row = self.company_ids.loc[self.company_ids['company_id'] == select_row['node'].item()]
|
||||
self.get_company_from_id(company_id=select_row['company_id'].item())
|
||||
if select_row['edge'].item() == "Origin":
|
||||
path.insert(0,self.companies[self.companies["company_number"] == select_row['company_id'].item()]['company_name'].item()+ " ->")
|
||||
break
|
||||
else:
|
||||
backlink = self.companies[self.companies["company_number"] == select_row['company_id'].item()]['company_name'].item() + " (" + select_row['edge'].item() + ") " + "->"
|
||||
path.insert(0, backlink)
|
||||
elif select_row['node_type'].item() == "Person":
|
||||
select_row = self.officer_ids.loc[self.officer_ids['officer_id'] == select_row['node'].item()]
|
||||
if select_row['edge'].item() == "Origin":
|
||||
path.insert(0, select_row["name"].item() + " ->")
|
||||
break
|
||||
else:
|
||||
backlink = str(select_row['name'].item()) + " (" + str(select_row['edge'].item()) + ") " + "->"
|
||||
path.insert(0, backlink)
|
||||
else:
|
||||
print("error")
|
||||
break
|
||||
print(' '.join(path))
|
||||
|
||||
def get_appointments(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments?size=" + size
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
# print metadata
|
||||
return response.json()['items']
|
||||
|
||||
def get_correspondance_address(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments?size=" + size
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
return response.json()['items'][0]['address']
|
||||
|
||||
def get_duplicate_officers(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments?size=5000"
|
||||
response = requests.get(url, auth=basic)
|
||||
officer_data = response.json()
|
||||
officer_self_link = response.json()['links']['self']
|
||||
name_list = officer_data['name'].replace(',','').split(' ')
|
||||
name = " ".join(name_list[1:]) + " " + name_list[0]
|
||||
# search officers with same name
|
||||
url = "https://api.company-information.service.gov.uk/search/officers?q=" + name
|
||||
try:
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
# filter offices with same birthday as search query officer
|
||||
|
||||
filtered_results = []
|
||||
if 'items' in response.json():
|
||||
for officer in response.json()['items']:
|
||||
if 'date_of_birth' in officer.keys() and 'date_of_birth' in officer_data.keys():
|
||||
if officer['date_of_birth'] == officer_data['date_of_birth'] and officer['links']['self'] != officer_self_link:
|
||||
filtered_results.append(officer)
|
||||
return filtered_results
|
||||
else:
|
||||
return
|
||||
except requests.exceptions.RequestException as e: # This is the correct syntax
|
||||
raise SystemExit(e)
|
||||
|
||||
def get_psc(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_id + "/persons-with-significant-control"
|
||||
try:
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
return response.json()['items']
|
||||
else:
|
||||
return
|
||||
except requests.exceptions.RequestException as e: # This is the correct syntax
|
||||
raise SystemExit(e)
|
||||
|
||||
def get_company(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_id
|
||||
try:
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
print(response.status_code)
|
||||
return
|
||||
except requests.exceptions.RequestException as e:
|
||||
raise SystemExit(e)
|
||||
|
||||
def get_address_changes(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + str(company_id) + "/filing-history/?category=address"
|
||||
try:
|
||||
time.sleep(0.5)
|
||||
# test here to see if page has been found
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
if 'items' in response.json():
|
||||
return response.json()
|
||||
else:
|
||||
return
|
||||
except requests.exceptions.RequestException as e: # This is the correct syntax
|
||||
raise SystemExit(e)
|
||||
|
||||
def get_company_info(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + str(company_id)
|
||||
try:
|
||||
time.sleep(0.5)
|
||||
# test here to see if page has been found
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.json():
|
||||
return response.json()
|
||||
else:
|
||||
return
|
||||
except requests.exceptions.RequestException as e: # This is the correct syntax
|
||||
raise SystemExit(e)
|
||||
|
||||
def normalise_name(name):
|
||||
name_list = name.replace(',','').split(" ")
|
||||
name_list.insert(0, name_list.pop())
|
||||
return ' '.join(name_list)
|
||||
|
||||
def process_address_changes(address_changes):
|
||||
# fill in missing new address values:
|
||||
for i in reversed(range(1,len(address_changes['items']))):
|
||||
if 'new_address' not in address_changes['items'][i]['description_values'].keys():
|
||||
if 'old_address' in address_changes['items'][i-1]['description_values'].keys():
|
||||
address_changes['items'][i]['description_values']['new_address'] = address_changes['items'][i-1]['description_values']['old_address']
|
||||
# df = pd.json_normalize(address_changes['items'])
|
||||
return address_changes
|
||||
|
||||
def build_address_history(company_id):
|
||||
company_info = get_company_info(company_id)
|
||||
company_info_subset = {k:company_info[k] for k in ("date_of_creation","date_of_cessation","registered_office_address") if k in company_info}
|
||||
address_changes = get_address_changes(company_id)
|
||||
address_keys = ('start_date','end_date','address')
|
||||
if address_changes['items']:
|
||||
address_changes = process_address_changes(address_changes)
|
||||
###
|
||||
addresses = []
|
||||
entry = {}
|
||||
entry["address"] = str(normalise_address(company_info_subset['registered_office_address']))
|
||||
entry["start_date"] = str(address_changes['items'][0]['date'])
|
||||
if 'date_of_cessation' in company_info_subset:
|
||||
entry["end_date"] = str(company_info_subset['date_of_cessation'])
|
||||
else:
|
||||
entry["end_date"] = None
|
||||
addresses.append(entry)
|
||||
|
||||
for i,change in enumerate(address_changes['items']):
|
||||
entry = {}
|
||||
if 'old_address' in change['description_values']:
|
||||
entry["address"] = change['description_values']['old_address']
|
||||
else:
|
||||
entry["address"] = ""
|
||||
if i+1 < len(address_changes['items']):
|
||||
entry["start_date"] = str(address_changes['items'][i+1]['date'])
|
||||
else:
|
||||
entry["start_date"] = company_info_subset['date_of_creation']
|
||||
entry["end_date"] = str(change['date'])
|
||||
addresses.append(entry)
|
||||
return addresses
|
||||
else:
|
||||
address_history = []
|
||||
entry = {}
|
||||
for k, key in enumerate(["date_of_creation","date_of_cessation","registered_office_address"]):
|
||||
if key in company_info:
|
||||
entry[address_keys[k]] = company_info[key]
|
||||
else:
|
||||
entry[address_keys[k]] = None
|
||||
entry['address'] = normalise_address(entry['address'])
|
||||
return [entry]
|
||||
|
||||
def normalise_address(address_dict):
|
||||
address_list = []
|
||||
for key in ['premises','address_line_1', 'locality','postal_code', 'country']:
|
||||
if key in address_dict:
|
||||
address_list.append(address_dict[key])
|
||||
address_string = ' '.join(address_list)
|
||||
return address_string
|
||||
|
||||
def get_news(df):
|
||||
company_news = []
|
||||
full_name_news = []
|
||||
short_name_news = []
|
||||
searched = {}
|
||||
for index, row in df.iterrows():
|
||||
time.sleep(random.uniform(0, 1))
|
||||
company_name = row['company_name']
|
||||
full_name = row['name']
|
||||
if type(row["name_elements"]) == dict:
|
||||
short_name = '"' + row["name_elements"]["forename"] + " " + row["name_elements"]["surname"] + '"'
|
||||
else:
|
||||
short_name = '"' + row["name_elements"] + '"'
|
||||
# add a check ...
|
||||
if company_name in searched:
|
||||
company_news.append(searched[company_name])
|
||||
else:
|
||||
searched[company_name] = company_news_check(company_name)
|
||||
company_news.append(searched[company_name])
|
||||
if full_name in searched:
|
||||
full_name_news.append(searched[full_name])
|
||||
else:
|
||||
searched[full_name] = company_news_check(full_name)
|
||||
full_name_news.append(searched[full_name])
|
||||
if short_name in searched:
|
||||
short_name_news.append(searched[short_name])
|
||||
else:
|
||||
searched[short_name] = company_news_check(short_name)
|
||||
short_name_news.append(searched[short_name])
|
||||
progress = str(int(100*index/len(df)))+"%"
|
||||
print(progress)
|
||||
df['company_news'] = company_news
|
||||
df['full_name_news'] = full_name_news
|
||||
df['short_name_news'] = short_name_news
|
||||
return df
|
||||
|
||||
def company_news_check(search_term):
|
||||
time.sleep(random.uniform(0, 0.2))
|
||||
googlenews = GoogleNews(period='10y')
|
||||
news = []
|
||||
googlenews.get_news('"' + str(search_term) + '"')
|
||||
for story in googlenews.results():
|
||||
if story['title'] not in news:
|
||||
news += [story['title']]
|
||||
return news
|
||||
|
||||
def get_locations(companies, address_type: str):
|
||||
df = companies
|
||||
if address_type == "correspondance":
|
||||
addresses = []
|
||||
for address in df['address']:
|
||||
address_string_list = []
|
||||
for key in ['premises','address_line_1', 'locality', 'country','postal_code']:
|
||||
if key in address:
|
||||
address_string_list.append(address[key])
|
||||
address_string = ', '.join(address_string_list)
|
||||
addresses += [address_string]
|
||||
elif address_type == "registered":
|
||||
addresses = []
|
||||
keys = ["address_line_1","address_line_2","country","locality","postal_code"]
|
||||
for link in df['links']:
|
||||
url = "https://api.company-information.service.gov.uk" + link['company'] + "/registered-office-address"
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
address = []
|
||||
postcode = []
|
||||
for key in keys:
|
||||
if key in response.json():
|
||||
address += [response.json()[key]]
|
||||
if key == "postal_code":
|
||||
postcode += [response.json()[key]]
|
||||
address = ", ".join(address)
|
||||
addresses += [address]
|
||||
else:
|
||||
print("unrecognised address type: should be either corresponance or registered")
|
||||
return None
|
||||
postcode_frequency = dict(collections.Counter(postcode).items(), key=lambda item: item[1], reverse=True)
|
||||
print(str(len(postcode_frequency)) + " unique postcodes")
|
||||
frequency = dict(sorted(collections.Counter(addresses).items(), key=lambda item: item[1], reverse=True))
|
||||
print(str(len(frequency)) + " unique " + address_type + " addresses")
|
||||
print(frequency)
|
||||
return addresses
|
||||
|
||||
def remove_company_type(company_name):
|
||||
split_name = company_name.split(" ")
|
||||
if split_name[-1] in ["LIMITED","LTD","LTD.","PLC","LLP","RTM","CIC","CASC"]:
|
||||
return " ".join(split_name[:-1])
|
||||
else:
|
||||
return company_name
|
||||
|
||||
def year_of_creation(companies):
|
||||
years = [address['date_of_creation'][0:4] for address in companies]
|
||||
frequency = collections.Counter(years)
|
||||
return dict(sorted(frequency.items(), key=lambda item: item[1], reverse=True))
|
||||
|
||||
def age(creation: str, cessation: str):
|
||||
delta = datetime.strptime(cessation, "%Y-%m-%d")-datetime.strptime(creation, "%Y-%m-%d")
|
||||
return math.floor(delta.days/365)
|
||||
|
||||
def get_companies_at_address(address):
|
||||
companies = {}
|
||||
companies_summary = {}
|
||||
url = "https://api.company-information.service.gov.uk/advanced-search/companies?location=" + address + "&size=" + "50"
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
# this is what we want in a dataframe:
|
||||
return response.json()['items']
|
||||
|
||||
def company_summary(df):
|
||||
registered_companies = len(df)
|
||||
active_companies = df['company_status'].value_counts().get('active')
|
||||
dissolved_companies = df['company_status'].value_counts().get('dissolved')
|
||||
liquidated_companies = df['company_status'].value_counts().get('liquidation')
|
||||
administration_companies = df['company_status'].value_counts().get('administration')
|
||||
recievership_companies = df['company_status'].value_counts().get('receivership')
|
||||
insolvent_companies = df['company_status'].value_counts().get('insolvency-proceedings')
|
||||
active_creation = df.loc[df['company_status'] == 'active']['year_of_creation'].value_counts()[0:3]
|
||||
if len(active_creation) < 3:
|
||||
active = len(active_creation)
|
||||
else:
|
||||
active = 3
|
||||
print(df["address"][0])
|
||||
print(str(active_companies) + " active companies")
|
||||
print(str(len(df)) + " companies registered")
|
||||
for i in range(active):
|
||||
print(str(active_creation[i]) + " active companies created in " + active_creation.keys()[i])
|
||||
# 3 most common periods of company survival in years
|
||||
print(str(dissolved_companies) + " dissolved companies")
|
||||
print(str(liquidated_companies) + " liquidated companies")
|
||||
print(str(administration_companies) + " companies in administration")
|
||||
print(str(recievership_companies) + " companies in recievership")
|
||||
print(str(insolvent_companies) + " companies in insolvency")
|
||||
survival = df['survival_years'].value_counts()
|
||||
if len(survival) > 0:
|
||||
if len(survival) < 3:
|
||||
survive = len(survival)
|
||||
else:
|
||||
survive = 3
|
||||
for i in range(survive):
|
||||
key = int(df['survival_years'].value_counts().keys()[i])
|
||||
print(str(df['survival_years'].value_counts()[key]) + " companies lasted " + str(int(key)) + "-" + str(int(key+1)) + " years")
|
||||
|
||||
def get_officers_at_location(location):
|
||||
url = "https://api.company-information.service.gov.uk/search/officers" + "?q=location:" + location
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
# filter json
|
||||
officers = []
|
||||
word_list = []
|
||||
for word in location.replace(',','').split():
|
||||
word_list.append(word)
|
||||
for officer in response.json()['items']:
|
||||
if all(word in officer['address_snippet'] for word in word_list):
|
||||
officers.append(officer)
|
||||
return officers
|
||||
|
||||
def get_officers(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_id + "/officers"
|
||||
time.sleep(0.5)
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
return response.json()['items']
|
||||
210
environment.yml
@@ -1,111 +1,115 @@
|
||||
name: candystore
|
||||
channels:
|
||||
- anaconda
|
||||
- defaults
|
||||
dependencies:
|
||||
- appnope=0.1.2=py39hecd8cb5_1001
|
||||
- argon2-cffi=21.3.0=pyhd3eb1b0_0
|
||||
- argon2-cffi-bindings=21.2.0=py39hca72f7f_0
|
||||
- asttokens=2.0.5=pyhd3eb1b0_0
|
||||
- attrs=21.4.0=pyhd3eb1b0_0
|
||||
- backcall=0.2.0=pyhd3eb1b0_0
|
||||
- beautifulsoup4=4.11.1=py39hecd8cb5_0
|
||||
- blas=1.0=mkl
|
||||
- bleach=4.1.0=pyhd3eb1b0_0
|
||||
- bottleneck=1.3.4=py39h67323c0_0
|
||||
- ca-certificates=2022.4.26=hecd8cb5_0
|
||||
- certifi=2022.6.15=py39hecd8cb5_0
|
||||
- cffi=1.15.0=py39hc55c11b_1
|
||||
- debugpy=1.5.1=py39he9d5cce_0
|
||||
- decorator=5.1.1=pyhd3eb1b0_0
|
||||
- defusedxml=0.7.1=pyhd3eb1b0_0
|
||||
- entrypoints=0.4=py39hecd8cb5_0
|
||||
- executing=0.8.3=pyhd3eb1b0_0
|
||||
- icu=58.2=h0a44026_3
|
||||
- intel-openmp=2021.4.0=hecd8cb5_3538
|
||||
- ipykernel=6.9.1=py39hecd8cb5_0
|
||||
- ipython=8.3.0=py39hecd8cb5_0
|
||||
- ipython_genutils=0.2.0=pyhd3eb1b0_1
|
||||
- ipywidgets=7.6.5=pyhd3eb1b0_1
|
||||
- jedi=0.18.1=py39hecd8cb5_1
|
||||
- jinja2=3.0.3=pyhd3eb1b0_0
|
||||
- jpeg=9e=hca72f7f_0
|
||||
- jsonschema=4.4.0=py39hecd8cb5_0
|
||||
- jupyter=1.0.0=py39hecd8cb5_7
|
||||
- jupyter_client=7.2.2=py39hecd8cb5_0
|
||||
- jupyter_console=6.4.3=pyhd3eb1b0_0
|
||||
- jupyter_core=4.10.0=py39hecd8cb5_0
|
||||
- jupyterlab_pygments=0.1.2=py_0
|
||||
- jupyterlab_widgets=1.0.0=pyhd3eb1b0_1
|
||||
- libcxx=12.0.0=h2f01273_0
|
||||
- libffi=3.3=hb1e8313_2
|
||||
- libpng=1.6.37=ha441bb4_0
|
||||
- libsodium=1.0.18=h1de35cc_0
|
||||
- markupsafe=2.1.1=py39hca72f7f_0
|
||||
- matplotlib-inline=0.1.2=pyhd3eb1b0_2
|
||||
- mistune=0.8.4=py39h9ed2024_1000
|
||||
- mkl=2021.4.0=hecd8cb5_637
|
||||
- mkl-service=2.4.0=py39h9ed2024_0
|
||||
- mkl_fft=1.3.1=py39h4ab4a9b_0
|
||||
- mkl_random=1.2.2=py39hb2f4e1b_0
|
||||
- nbclient=0.5.13=py39hecd8cb5_0
|
||||
- nbconvert=6.4.4=py39hecd8cb5_0
|
||||
- nbformat=5.3.0=py39hecd8cb5_0
|
||||
- ca-certificates=2022.10.11=hecd8cb5_0
|
||||
- certifi=2022.12.7=py39hecd8cb5_0
|
||||
- libcxx=14.0.6=h9765a3e_0
|
||||
- libffi=3.4.2=hecd8cb5_6
|
||||
- ncurses=6.3=hca72f7f_3
|
||||
- nest-asyncio=1.5.5=py39hecd8cb5_0
|
||||
- notebook=6.4.11=py39hecd8cb5_0
|
||||
- numexpr=2.8.1=py39h2e5f0a9_2
|
||||
- numpy=1.22.3=py39h2e5f0a9_0
|
||||
- numpy-base=1.22.3=py39h3b1a694_0
|
||||
- openssl=1.1.1o=hca72f7f_0
|
||||
- packaging=21.3=pyhd3eb1b0_0
|
||||
- pandas=1.4.2=py39he9d5cce_0
|
||||
- pandocfilters=1.5.0=pyhd3eb1b0_0
|
||||
- parso=0.8.3=pyhd3eb1b0_0
|
||||
- pexpect=4.8.0=pyhd3eb1b0_3
|
||||
- pickleshare=0.7.5=pyhd3eb1b0_1003
|
||||
- pip=22.1.2=py39hecd8cb5_0
|
||||
- prometheus_client=0.13.1=pyhd3eb1b0_0
|
||||
- prompt-toolkit=3.0.20=pyhd3eb1b0_0
|
||||
- prompt_toolkit=3.0.20=hd3eb1b0_0
|
||||
- ptyprocess=0.7.0=pyhd3eb1b0_2
|
||||
- pure_eval=0.2.2=pyhd3eb1b0_0
|
||||
- pycparser=2.21=pyhd3eb1b0_0
|
||||
- pygments=2.11.2=pyhd3eb1b0_0
|
||||
- pyparsing=3.0.4=pyhd3eb1b0_0
|
||||
- pyqt=5.9.2=py39h23ab428_6
|
||||
- pyrsistent=0.18.0=py39hca72f7f_0
|
||||
- python=3.9.12=hdfd78df_1
|
||||
- python-dateutil=2.8.2=pyhd3eb1b0_0
|
||||
- python-fastjsonschema=2.15.1=pyhd3eb1b0_0
|
||||
- pytz=2022.1=py39hecd8cb5_0
|
||||
- pyzmq=22.3.0=py39he9d5cce_2
|
||||
- qt=5.9.7=h468cd18_1
|
||||
- qtconsole=5.3.0=pyhd3eb1b0_0
|
||||
- qtpy=2.0.1=pyhd3eb1b0_0
|
||||
- readline=8.1.2=hca72f7f_1
|
||||
- send2trash=1.8.0=pyhd3eb1b0_1
|
||||
- setuptools=63.4.1=py39hecd8cb5_0
|
||||
- sip=4.19.13=py39h23ab428_0
|
||||
- six=1.16.0=pyhd3eb1b0_1
|
||||
- soupsieve=2.3.1=pyhd3eb1b0_0
|
||||
- sqlite=3.39.2=h707629a_0
|
||||
- stack_data=0.2.0=pyhd3eb1b0_0
|
||||
- terminado=0.13.1=py39hecd8cb5_0
|
||||
- testpath=0.6.0=py39hecd8cb5_0
|
||||
- openssl=1.1.1s=hca72f7f_0
|
||||
- pip=22.3.1=py39hecd8cb5_0
|
||||
- python=3.9.15=h218abb5_2
|
||||
- readline=8.2=hca72f7f_0
|
||||
- setuptools=65.5.0=py39hecd8cb5_0
|
||||
- sqlite=3.40.0=h880c91c_0
|
||||
- tk=8.6.12=h5d9f67b_0
|
||||
- tornado=6.1=py39h9ed2024_0
|
||||
- traitlets=5.1.1=pyhd3eb1b0_0
|
||||
- typing-extensions=4.1.1=hd3eb1b0_0
|
||||
- typing_extensions=4.1.1=pyh06a4308_0
|
||||
- tzdata=2022a=hda174b7_0
|
||||
- wcwidth=0.2.5=pyhd3eb1b0_0
|
||||
- webencodings=0.5.1=py39hecd8cb5_1
|
||||
- tzdata=2022g=h04d1e81_0
|
||||
- wheel=0.37.1=pyhd3eb1b0_0
|
||||
- widgetsnbextension=3.5.2=py39hecd8cb5_0
|
||||
- xz=5.2.5=hca72f7f_1
|
||||
- zeromq=4.3.4=h23ab428_0
|
||||
- zlib=1.2.12=h4dc903c_2
|
||||
- xz=5.2.8=h6c40b1e_0
|
||||
- zlib=1.2.13=h4dc903c_0
|
||||
- pip:
|
||||
- chwrapper==0.3.0
|
||||
- requests==2.8.1
|
||||
- anyio==3.6.2
|
||||
- appnope==0.1.3
|
||||
- argon2-cffi==21.3.0
|
||||
- argon2-cffi-bindings==21.2.0
|
||||
- arrow==1.2.3
|
||||
- asttokens==2.2.1
|
||||
- attrs==22.2.0
|
||||
- backcall==0.2.0
|
||||
- beautifulsoup4==4.11.1
|
||||
- bleach==5.0.1
|
||||
- branca==0.6.0
|
||||
- cffi==1.15.1
|
||||
- charset-normalizer==2.1.1
|
||||
- comm==0.1.2
|
||||
- debugpy==1.6.4
|
||||
- decorator==5.1.1
|
||||
- defusedxml==0.7.1
|
||||
- entrypoints==0.4
|
||||
- executing==1.2.0
|
||||
- fastjsonschema==2.16.2
|
||||
- fqdn==1.5.1
|
||||
- idna==3.4
|
||||
- importlib-metadata==5.2.0
|
||||
- ipykernel==6.19.4
|
||||
- ipyleaflet==0.17.2
|
||||
- ipython==8.7.0
|
||||
- ipython-genutils==0.2.0
|
||||
- ipywidgets==8.0.4
|
||||
- isoduration==20.11.0
|
||||
- jedi==0.18.2
|
||||
- jinja2==3.1.2
|
||||
- jsonpointer==2.3
|
||||
- jsonschema==4.17.3
|
||||
- jupyter-client==7.4.8
|
||||
- jupyter-core==5.1.1
|
||||
- jupyter-events==0.5.0
|
||||
- jupyter-server==2.0.5
|
||||
- jupyter-server-terminals==0.4.3
|
||||
- jupyterlab-pygments==0.2.2
|
||||
- jupyterlab-widgets==3.0.5
|
||||
- markupsafe==2.1.1
|
||||
- matplotlib-inline==0.1.6
|
||||
- mistune==2.0.4
|
||||
- nbclassic==0.4.8
|
||||
- nbclient==0.7.2
|
||||
- nbconvert==7.2.7
|
||||
- nbformat==5.7.1
|
||||
- nest-asyncio==1.5.6
|
||||
- notebook==6.5.2
|
||||
- notebook-shim==0.2.2
|
||||
- numpy==1.24.0
|
||||
- packaging==22.0
|
||||
- pandas==1.5.2
|
||||
- pandocfilters==1.5.0
|
||||
- parso==0.8.3
|
||||
- pexpect==4.8.0
|
||||
- pickleshare==0.7.5
|
||||
- platformdirs==2.6.0
|
||||
- prometheus-client==0.15.0
|
||||
- prompt-toolkit==3.0.36
|
||||
- psutil==5.9.4
|
||||
- ptyprocess==0.7.0
|
||||
- pure-eval==0.2.2
|
||||
- pycparser==2.21
|
||||
- pygments==2.13.0
|
||||
- pyrsistent==0.19.2
|
||||
- python-dateutil==2.8.2
|
||||
- python-json-logger==2.0.4
|
||||
- pytz==2022.7
|
||||
- pyyaml==6.0
|
||||
- pyzmq==24.0.1
|
||||
- regex==2022.10.31
|
||||
- requests==2.28.1
|
||||
- rfc3339-validator==0.1.4
|
||||
- rfc3986-validator==0.1.1
|
||||
- send2trash==1.8.0
|
||||
- six==1.16.0
|
||||
- sniffio==1.3.0
|
||||
- soupsieve==2.3.2.post1
|
||||
- stack-data==0.6.2
|
||||
- terminado==0.17.1
|
||||
- tinycss2==1.2.1
|
||||
- tornado==6.2
|
||||
- traitlets==5.8.0
|
||||
- traittypes==0.2.1
|
||||
- uri-template==1.2.0
|
||||
- urllib3==1.26.13
|
||||
- wcwidth==0.2.5
|
||||
- webcolors==1.12
|
||||
- webencodings==0.5.1
|
||||
- websocket-client==1.4.2
|
||||
- widgetsnbextension==4.0.5
|
||||
- xyzservices==2022.9.0
|
||||
- zipp==3.11.0
|
||||
|
||||
147
sugartrail.py
@@ -1,147 +0,0 @@
|
||||
from requests.auth import HTTPBasicAuth
|
||||
import requests
|
||||
import pandas as pd
|
||||
import sys
|
||||
from IPython.display import clear_output
|
||||
import time
|
||||
import collections
|
||||
from datetime import datetime
|
||||
import math
|
||||
access_token = "829952e2-23ab-44ab-b6e3-efb57f2fceb7"
|
||||
username = access_token
|
||||
password = ""
|
||||
size = "5000"
|
||||
basic = HTTPBasicAuth(username, password)
|
||||
|
||||
def get_appointments(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments?size=" + size
|
||||
response = requests.get(url, auth=basic)
|
||||
# print metadata
|
||||
df = pd.DataFrame(response.json()['items'])
|
||||
appointments = len(df)
|
||||
print(str(appointments) + " appointments")
|
||||
print(str(appointments - df["resigned_on"].count()) + " active appointments")
|
||||
return response.json()
|
||||
|
||||
def get_locations(companies, address_type: str):
|
||||
df = pd.DataFrame(companies['items'])
|
||||
if address_type == "correspondance":
|
||||
postcode = [address['postal_code'] for address in df['address']]
|
||||
addresses = [address['premises'] + ", " + address['address_line_1'] + ", " + address['locality'] + ", " + address['country'] + ", " + address['postal_code'] for address in df['address']]
|
||||
elif address_type == "registered":
|
||||
addresses = []
|
||||
keys = ["address_line_1","address_line_2","country","locality","postal_code"]
|
||||
for link in df['links']:
|
||||
url = "https://api.company-information.service.gov.uk" + link['company'] + "/registered-office-address"
|
||||
response = requests.get(url, auth=basic)
|
||||
address = []
|
||||
postcode = []
|
||||
for key in keys:
|
||||
if key in response.json():
|
||||
address += [response.json()[key]]
|
||||
if key == "postal_code":
|
||||
postcode += [response.json()[key]]
|
||||
address = ", ".join(address)
|
||||
addresses += [address]
|
||||
else:
|
||||
print("unrecognised address type: should be either corresponance or registered")
|
||||
return None
|
||||
postcode_frequency = dict(collections.Counter(postcode).items(), key=lambda item: item[1], reverse=True)
|
||||
print(str(len(postcode_frequency)) + " unique postcodes")
|
||||
frequency = dict(sorted(collections.Counter(addresses).items(), key=lambda item: item[1], reverse=True))
|
||||
print(str(len(frequency)) + " unique " + address_type + " addresses")
|
||||
print(frequency)
|
||||
return addresses
|
||||
|
||||
def year_of_creation(companies):
|
||||
years = [address['date_of_creation'][0:4] for address in companies]
|
||||
frequency = collections.Counter(years)
|
||||
return dict(sorted(frequency.items(), key=lambda item: item[1], reverse=True))
|
||||
|
||||
def age(creation: str, cessation: str):
|
||||
delta = datetime.strptime(cessation, "%Y-%m-%d")-datetime.strptime(creation, "%Y-%m-%d")
|
||||
return math.floor(delta.days/365)
|
||||
|
||||
|
||||
def get_companies(addresses):
|
||||
companies = {}
|
||||
companies_summary = {}
|
||||
for address in addresses:
|
||||
url = "https://api.company-information.service.gov.uk/advanced-search/companies?location=" + address + "&size=" + size
|
||||
response = requests.get(url, auth=basic)
|
||||
if response.status_code == 200:
|
||||
companies[address] = response.json()['items']
|
||||
companies_summary[address] = {}
|
||||
companies_summary[address]["frequency"] = response.json()['hits']
|
||||
all_companies = [address for address in response.json()['items']]
|
||||
active_companies = [address for address in response.json()['items'] if address['company_status'] == 'active']
|
||||
dead_companies = [address for address in response.json()['items'] if address['company_status'] == 'dissolved']
|
||||
companies_summary[address]["active_companies"] = len(active_companies)
|
||||
years = year_of_creation(all_companies)
|
||||
survival_months = [age(address['date_of_creation'],address['date_of_cessation']) for address in dead_companies]
|
||||
survival_frequency = collections.Counter(survival_months)
|
||||
survival_frequency = dict(sorted(survival_frequency.items(), key=lambda item: item[1], reverse=True))
|
||||
active_years = year_of_creation(active_companies)
|
||||
companies_summary[address]["3_years_active"] = {k: active_years[k] for k in list(active_years)[:3]}
|
||||
companies_summary[address]["3_years_all"] = {k: years[k] for k in list(years)[:3]}
|
||||
companies_summary[address]["3_survival"] = {k: survival_frequency[k] for k in list(survival_frequency)[:3]}
|
||||
companies_summary = dict(sorted(companies_summary.items(), key=lambda item: item[1]["frequency"],reverse=True))
|
||||
for i,company in enumerate(companies_summary):
|
||||
print("Index: " + str(i))
|
||||
print(company)
|
||||
print(str(companies_summary[company]['frequency']) + " companies registered or corresponding here, " + str(companies_summary[company]['active_companies']) + " are active.")
|
||||
keys = list(companies_summary[company]['3_years_active'].keys())
|
||||
life_keys = list(companies_summary[company]['3_survival'].keys())
|
||||
for key in keys:
|
||||
print(str(companies_summary[company]['3_years_active'][key]) + " currently active companies registered in " + str(key))
|
||||
for key in life_keys:
|
||||
print(str(companies_summary[company]['3_survival'][key]) + " companies dissolved between years " + str(key+1) + "-" + str(key))
|
||||
print("")
|
||||
|
||||
return {key: companies[key] for key in companies_summary if key in companies}
|
||||
|
||||
def get_officers(company_locations, indices):
|
||||
officers = {}
|
||||
for index in indices:
|
||||
# get businesses at location
|
||||
company_name = list(company_locations.keys())[index]
|
||||
officers[str(company_name)] = []
|
||||
companies = company_locations[company_name]
|
||||
length = len(companies)
|
||||
for i, business in enumerate(companies):
|
||||
company_number = business['company_number']
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_number + "/officers?size=" + size
|
||||
while True:
|
||||
try:
|
||||
clear_output(wait=True)
|
||||
print("completion: " + str(100*i/length) + ", index:" + str(i))
|
||||
leadership = requests.get(url, auth=basic)
|
||||
print(leadership)
|
||||
if leadership.json():
|
||||
officers[str(company_name)] += [[officer['name'] for officer in leadership.json()['items']]]
|
||||
clear_output(wait=True)
|
||||
time.sleep(0.41)
|
||||
break
|
||||
else:
|
||||
officers[str(company_name)] += [[]]
|
||||
clear_output(wait=True)
|
||||
time.sleep(0.41)
|
||||
break
|
||||
except:
|
||||
print(sys.exc_info()[0])
|
||||
print("taking a 10 second timeout")
|
||||
time.sleep(10)
|
||||
clear_output(wait=True)
|
||||
for location in list(officers.keys()):
|
||||
directors = []
|
||||
for business in officers[location]:
|
||||
directors += business
|
||||
frequency = collections.Counter(directors)
|
||||
frequency = dict(sorted(frequency.items(), key=lambda item: item[1], reverse=True))
|
||||
print(location)
|
||||
print("-")
|
||||
print("Most prolific officers:")
|
||||
for officer in list(frequency):
|
||||
print(str(officer) + " runs " + str(frequency[str(officer)]) + " businesses")
|
||||
print("")
|
||||
return officers
|
||||
4
sugartrail/__init__.py
Normal file
@@ -0,0 +1,4 @@
|
||||
from . import api
|
||||
from . import base
|
||||
from . import processing
|
||||
from . import mapview
|
||||
87
sugartrail/api.py
Normal file
@@ -0,0 +1,87 @@
|
||||
import requests
|
||||
import time
|
||||
import os
|
||||
|
||||
access_token = ""
|
||||
username = access_token
|
||||
password = ""
|
||||
size = "5000"
|
||||
basic_auth = requests.auth.HTTPBasicAuth(username, password)
|
||||
|
||||
def make_request(url, input, input_type, response_type):
|
||||
time.sleep(0.5)
|
||||
try:
|
||||
response = requests.get(url, auth=basic_auth)
|
||||
response.raise_for_status()
|
||||
# print("here")
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
except requests.exceptions.RequestException as err:
|
||||
print (err, f"{os.linesep}Failed to get {response_type} for {input_type}:", str(input))
|
||||
except requests.exceptions.HTTPError as errh:
|
||||
print (errh, f"{os.linesep}Failed to get {response_type} for {input_type}:", str(input))
|
||||
except requests.exceptions.ConnectionError as errc:
|
||||
print (errc, f"{os.linesep}Failed to get {response_type} for {input_type}:", str(input))
|
||||
except requests.exceptions.Timeout as errt:
|
||||
print (errt, f"{os.linesep}Failed to get {response_type} for {input_type}:", str(input))
|
||||
|
||||
def get_company_officers(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_id + "/officers"
|
||||
return make_request(url, company_id, 'company', 'officers')
|
||||
|
||||
def get_psc(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_id + "/persons-with-significant-control"
|
||||
return make_request(url, company_id, 'company', 'psc')
|
||||
|
||||
def get_company(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + company_id
|
||||
return make_request(url, company_id, 'company', 'company')
|
||||
|
||||
def get_address_changes(company_id):
|
||||
url = "https://api.company-information.service.gov.uk/company/" + str(company_id) + "/filing-history/?category=address"
|
||||
return make_request(url, company_id, 'company', 'address history')
|
||||
|
||||
def get_correspondance_address(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments?size=" + size
|
||||
return make_request(url, officer_id, 'officer', 'correspondance address')
|
||||
|
||||
def get_appointments(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments"
|
||||
return make_request(url, officer_id, 'officer', 'appointments')['items']
|
||||
|
||||
def get_duplicate_officers(officer_id):
|
||||
url = "https://api.company-information.service.gov.uk/officers/" + officer_id + "/appointments"
|
||||
response = make_request(url, officer_id, 'officer', 'appointments')
|
||||
if response:
|
||||
officer_data = response
|
||||
officer_self_link = response['links']['self']
|
||||
name_list = officer_data['name'].replace(',','').split(' ')
|
||||
name = " ".join(name_list[1:]) + " " + name_list[0]
|
||||
url = "https://api.company-information.service.gov.uk/search/officers?q=" + name
|
||||
response = make_request(url, name, 'officer name', 'officers')
|
||||
filtered_results = []
|
||||
if 'items' in response:
|
||||
for officer in response['items']:
|
||||
if 'date_of_birth' in officer.keys() and 'date_of_birth' in officer_data.keys():
|
||||
if officer['date_of_birth'] == officer_data['date_of_birth'] and officer['links']['self'] != officer_self_link:
|
||||
filtered_results.append(officer)
|
||||
return filtered_results
|
||||
else:
|
||||
return
|
||||
|
||||
def get_companies_at_address(address):
|
||||
url = "https://api.company-information.service.gov.uk/advanced-search/companies?location=" + address + "&size=" + "5000"
|
||||
return make_request(url, address, 'address', 'companies')
|
||||
|
||||
def get_officers_at_address(address):
|
||||
url = "https://api.company-information.service.gov.uk/search/officers?q=location:" + address
|
||||
response = make_request(url, address, 'address', 'officers')
|
||||
if 'items' in response:
|
||||
officers = []
|
||||
word_list = []
|
||||
for word in address.replace(',','').split():
|
||||
word_list.append(word)
|
||||
for officer in response['items']:
|
||||
if all(word in officer['address_snippet'] for word in word_list):
|
||||
officers.append(officer)
|
||||
return officers
|
||||
334
sugartrail/base.py
Normal file
@@ -0,0 +1,334 @@
|
||||
from sugartrail import api
|
||||
from sugartrail import processing
|
||||
import pandas as pd
|
||||
import IPython
|
||||
import numpy as np
|
||||
import math
|
||||
import warnings
|
||||
from string import ascii_lowercase as alc
|
||||
warnings.simplefilter(action='ignore', category=FutureWarning)
|
||||
pd.set_option('display.max_columns', 500)
|
||||
pd.set_option('display.max_rows', 150)
|
||||
|
||||
class Network:
|
||||
def __init__(self, officer_id=None, company_id=None, address=None):
|
||||
self.addresses = pd.DataFrame(columns=['address','n','link_type','node_type','node','lat','lon'])
|
||||
self.officer_ids = pd.DataFrame(columns=['officer_id','name','n','link_type','node_type','node'])
|
||||
self.company_ids = pd.DataFrame(columns=['company_id','n','link_type','node_type','node',])
|
||||
self.companies = pd.DataFrame(columns=['company_number','n'])
|
||||
self.address_history = pd.DataFrame(columns=['company_number', 'address', 'start_date', 'end_date', 'lat', 'lon'])
|
||||
self._officer_id = officer_id
|
||||
self._company_id = company_id
|
||||
self._address = address
|
||||
self.n = 0
|
||||
self.link_type = None
|
||||
self.initialise_dataframe()
|
||||
self.hop = self.Hop()
|
||||
self.hop_history = pd.DataFrame()
|
||||
self.maxsize_entities = pd.DataFrame(columns=['node','type', 'maxsize_type', 'size'])
|
||||
|
||||
@property
|
||||
def officer_id(self):
|
||||
return self._officer_id
|
||||
|
||||
@officer_id.setter
|
||||
def officer_id(self, new_value):
|
||||
self._officer_id = new_value
|
||||
self._company_id = None
|
||||
self._address_id = None
|
||||
self.initialise_dataframe()
|
||||
|
||||
@property
|
||||
def company_id(self):
|
||||
return self._company_id
|
||||
|
||||
@company_id.setter
|
||||
def company_id(self, new_value):
|
||||
self._company_id = new_value
|
||||
self._officer_id = None
|
||||
self._address_id = None
|
||||
self.initialise_dataframe()
|
||||
|
||||
@property
|
||||
def address(self):
|
||||
return self._address
|
||||
|
||||
@address.setter
|
||||
def address(self, new_value):
|
||||
self._address = new_value
|
||||
self._company_id = None
|
||||
self._officer_id = None
|
||||
self.initialise_dataframe()
|
||||
|
||||
def initialise_dataframe(self):
|
||||
self.company_ids = self.company_ids.iloc[0:0]
|
||||
self.officer_ids = self.officer_ids.iloc[0:0]
|
||||
self.addresses = self.addresses.iloc[0:0]
|
||||
if self._officer_id:
|
||||
self.officer_ids = self.officer_ids.append({'officer_id': self._officer_id, 'name': api.get_appointments(self._officer_id)[0]['name'], 'n':self.n, 'link_type': None, 'node_type': None, 'node': None}, ignore_index=True)
|
||||
elif self.company_id:
|
||||
self.company_ids = self.company_ids.append({'company_id': self._company_id, 'n':self.n, 'link_type': None, 'node_type': None, 'node': None}, ignore_index=True)
|
||||
company = api.get_company(self._company_id)
|
||||
company['n'] = self.n
|
||||
company['link_type'] = self.link_type
|
||||
self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
elif self._address:
|
||||
self.addresses = self.addresses.append({'address': self._address, 'n':self.n, 'link_type': None, 'node_type': None, 'node': None,}, ignore_index=True)
|
||||
else:
|
||||
print("No input provided. Please provide either officer_id, company_id or address value as input.")
|
||||
|
||||
def get_company_from_id(self, company_df=None, company_id=None, print_progress=True):
|
||||
company_list = []
|
||||
if company_id:
|
||||
if company_id in self.company_ids['company_id'].unique():
|
||||
company_list = [company_id]
|
||||
else:
|
||||
print("add valid company id")
|
||||
else:
|
||||
company_list = self.company_ids['company_id'].unique()
|
||||
for i, company_id in enumerate(company_list):
|
||||
IPython.display.clear_output(wait=True)
|
||||
if print_progress:
|
||||
print("Processed " + str(i+1) + "/" + str(len(company_list)) + " companies.")
|
||||
if company_id not in self.companies['company_number'].unique():
|
||||
if company_df is not None:
|
||||
try:
|
||||
company = company_df[company_df[" CompanyNumber"] == str(company_id)]["CompanyName"].item()
|
||||
if company:
|
||||
self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
except:
|
||||
try:
|
||||
company = api.get_company(company_id)
|
||||
if company:
|
||||
self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
except:
|
||||
print(f"Failed to get data for {company_id}")
|
||||
else:
|
||||
company = api.get_company(company_id)
|
||||
if company:
|
||||
self.companies = self.companies.append(pd.json_normalize(company), ignore_index=True)
|
||||
|
||||
def run_map_preprocessing(self):
|
||||
self.get_company_from_id()
|
||||
self.get_coords()
|
||||
return
|
||||
|
||||
def get_coords(self):
|
||||
for i, row in self.addresses.iterrows():
|
||||
IPython.display.clear_output(wait=True)
|
||||
print("Processed " + str(i+1) + "/" + str(len(self.addresses)) + " addresses.")
|
||||
if row.isnull()['lat'] and row.isnull()['lon']:
|
||||
coords = processing.get_coords_from_address(row['address'])
|
||||
if coords:
|
||||
self.addresses['lat'][i] = coords['lat']
|
||||
self.addresses['lon'][i] = coords['lon']
|
||||
else:
|
||||
print("No coords found: " + row['address'])
|
||||
historic_indices = self.address_history.index[self.address_history["address"]==row['address']].tolist()
|
||||
for j in historic_indices:
|
||||
self.address_history["lon"][j] = self.addresses['lon'][i]
|
||||
self.address_history["lat"][j] = self.addresses['lat'][i]
|
||||
|
||||
def find_path(self, select_company):
|
||||
network_link_type_rows = self.company_ids.loc[self.company_ids['company_id'] == select_company]
|
||||
path = []
|
||||
company_info = self.get_company_from_id(company_id=select_company, print_progress=False)
|
||||
for i, row in network_link_type_rows.iterrows():
|
||||
path.insert(0, {'n': row['n'], "type": "Company", "id": select_company, "value": self.companies[self.companies["company_number"] == select_company]['company_name'].item(), "link_type": row['link_type'], "link": row['node']})
|
||||
search_terms = [{'n': row['n']-1, 'node_type':row['node_type'], 'node':row['node']}]
|
||||
for j in range(row['n']-1,-1,-1):
|
||||
for term in search_terms:
|
||||
if term['n'] == j:
|
||||
if term['node_type'] == "Address":
|
||||
select_rows = self.addresses.loc[(self.addresses['address'] == term['node']) & (self.addresses['n'] == j)]
|
||||
for k, select_row in select_rows.iterrows():
|
||||
if select_row['n'] == 0:
|
||||
origin = {'n': j, "type": "Address", "id": select_row['address'], "value": select_row['address'], "link_type": "", "link": ""}
|
||||
if origin not in path:
|
||||
path.insert(0, origin)
|
||||
break
|
||||
else:
|
||||
item = {'n': j, "type": "Address", "id": select_row['address'], "value": select_row['address'], "link_type": select_row['link_type'], "link": select_row['node']}
|
||||
if item not in path:
|
||||
path.insert(0, item)
|
||||
search_terms.append({'n': j-1, 'node_type':select_row['node_type'], 'node':select_row['node']})
|
||||
break
|
||||
elif term['node_type'] == "Company":
|
||||
select_rows = self.company_ids.loc[(self.company_ids['company_id'] == term['node']) & (self.company_ids['n'] == j)]
|
||||
for l, select_row in select_rows.iterrows():
|
||||
self.get_company_from_id(company_id=select_row['company_id'], print_progress=False)
|
||||
if select_row['n'] == 0:
|
||||
origin = {'n': j, "type": "Company", "id": select_row['company_id'], "value": self.companies[self.companies["company_number"] == select_row['company_id']]['company_name'].item(), "link_type": "", "link": ""}
|
||||
if origin not in path:
|
||||
path.insert(0, origin)
|
||||
break
|
||||
else:
|
||||
item = {'n': j, "type": "Company", "id": select_row['company_id'], "value": self.companies[self.companies["company_number"] == select_row['company_id']]['company_name'].item(), "link_type": select_row['link_type'], "link": select_row['node']}
|
||||
if item not in path:
|
||||
path.insert(0, item)
|
||||
search_terms.append({'n': j-1, 'node_type':select_row['node_type'], 'node':select_row['node']})
|
||||
elif term['node_type'] == "Person":
|
||||
select_rows = self.officer_ids.loc[(self.officer_ids['officer_id'] == term['node']) & (self.officer_ids['n'] == j)]
|
||||
for m, select_row in select_rows.iterrows():
|
||||
if select_row['link_type'] == 0:
|
||||
origin = {'n': j, "type": "Person", "id": select_row["officer_id"], "value": select_row['name'], "link_type": "", "link": ""}
|
||||
if origin not in path:
|
||||
path.insert(0, origin)
|
||||
break
|
||||
else:
|
||||
item = {'n': j, "type": "Person", "id": select_row["officer_id"], "value": str(select_row['name']), "link_type": str(select_row['link_type']), "link": select_row['node']}
|
||||
if item not in path:
|
||||
path.insert(0, item)
|
||||
search_terms.append({'n': j-1, 'node_type':select_row['node_type'], 'node':select_row['node']})
|
||||
break
|
||||
else:
|
||||
print(f"{row['node_type']} is invalid node_type")
|
||||
break
|
||||
sorted_path = sorted(path, key=lambda d: d['n'])
|
||||
for i in range(len(sorted_path)-1,-1,-1):
|
||||
search_term = sorted_path[i]['link']
|
||||
link_indices = []
|
||||
for j,item in enumerate(sorted_path):
|
||||
if item['id'] == search_term:
|
||||
link_indices.append(alc[j].upper())
|
||||
sorted_path[i]["links_to"] = ','.join(link_indices)
|
||||
sorted_path[i]["node_index"] = alc[i].upper()
|
||||
return sorted_path
|
||||
|
||||
def perform_hop(self, hops):
|
||||
for hop in range(hops):
|
||||
selected_addresses = self.addresses.loc[self.addresses['n'] == self.n]['address']
|
||||
selected_companies = self.company_ids.loc[self.company_ids['n'] == self.n]['company_id']
|
||||
selected_officers = self.officer_ids.loc[self.officer_ids['n'] == self.n]['officer_id']
|
||||
if len(selected_addresses.index) == 0 and len(selected_companies.index) == 0 and len(selected_officers.index) == 0:
|
||||
print("link_type of network reached.")
|
||||
break
|
||||
else:
|
||||
self.n += 1
|
||||
self.hop_history = self.hop_history.append(self.hop.__dict__, ignore_index=True)
|
||||
for i,address in enumerate(selected_addresses):
|
||||
self.hop.search_address(self, address)
|
||||
IPython.display.clear_output(wait=True)
|
||||
print("Hop number: " + str(hop+1))
|
||||
print("Processed " + str(i+1) + "/" + str(len(selected_addresses)) + " addresses.")
|
||||
for j,company in enumerate(selected_companies):
|
||||
self.hop.search_company_id(self,company)
|
||||
IPython.display.clear_output(wait=True)
|
||||
print("Hop number: " + str(hop+1))
|
||||
print("Processed " + str(len(selected_addresses)) + "/" + str(len(selected_addresses)) + " addresses.")
|
||||
print("Processed " + str(j+1) + "/" + str(len(selected_companies)) + " companies.")
|
||||
for k,officer in enumerate(selected_officers):
|
||||
self.hop.search_officer_id(self,officer)
|
||||
IPython.display.clear_output(wait=True)
|
||||
print("Hop number: " + str(hop+1))
|
||||
print("Processed " + str(len(selected_addresses)) + "/" + str(len(selected_addresses)) + " addresses.")
|
||||
print("Processed " + str(len(selected_companies)) + "/" + str(len(selected_companies)) + " companies.")
|
||||
print("Processed " + str(k+1) + "/" + str(len(selected_officers)) + " officers.")
|
||||
|
||||
class Hop:
|
||||
def __init__(self):
|
||||
self.get_company_officers = True
|
||||
self.get_company_address_history = True
|
||||
self.get_psc_correspondance_address = True
|
||||
self.get_officer_appointments = True
|
||||
self.officer_appointments_maxsize = 2000
|
||||
self.get_officer_correspondance_address = True
|
||||
self.get_officer_duplicates = True
|
||||
self.officer_duplicates_maxsize = None
|
||||
self.get_officers_at_address = True
|
||||
self.officers_at_address_maxsize = 1000
|
||||
self.get_companies_at_address = True
|
||||
self.companies_at_address_maxsize = 500
|
||||
|
||||
def search_company_id(self, network, company_id):
|
||||
officers = []
|
||||
if self.get_company_officers:
|
||||
if api.get_company_officers(company_id):
|
||||
officers = api.get_company_officers(company_id)['items']
|
||||
network.node_type = "Company"
|
||||
network.node = company_id
|
||||
if officers:
|
||||
for officer in officers:
|
||||
if processing.normalise_address(officer['address']) not in network.addresses[network.addresses['n'] < network.n]['address'].unique():
|
||||
network.link_type = "Officer Corresponance Address"
|
||||
network.addresses = network.addresses.append({'address': processing.normalise_address(officer['address']), 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
if officer['links']['officer']['appointments'].split('/')[2] not in network.officer_ids[network.officer_ids['n'] < network.n]['officer_id'].unique():
|
||||
network.link_type = "Officer"
|
||||
network.officer_ids = network.officer_ids.append({'officer_id': officer['links']['officer']['appointments'].split('/')[2], 'name': processing.normalise_name(officer['name']), 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
if self.get_psc_correspondance_address:
|
||||
psc = api.get_psc(company_id)
|
||||
if psc:
|
||||
for person in psc['items']:
|
||||
if "address" in person:
|
||||
network.link_type = "Person of Significant Control Address"
|
||||
if processing.normalise_address(person['address']) not in network.addresses[network.addresses['n'] < network.n]['address'].unique():
|
||||
network.addresses = network.addresses.append({'address': processing.normalise_address(person['address']), 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
if self.get_company_address_history:
|
||||
address_history = processing.build_address_history(company_id)
|
||||
network.address_history = network.address_history.append(address_history, ignore_index=True)
|
||||
for address in address_history:
|
||||
network.link_type = "Historic Address"
|
||||
if address['address'] not in network.addresses[network.addresses['n'] < network.n]['address'].unique():
|
||||
network.addresses = network.addresses.append({'address': address['address'], 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
network.address_history = network.address_history.drop_duplicates().reset_index(drop=True)
|
||||
network.addresses = network.addresses.drop_duplicates().reset_index(drop=True)
|
||||
network.officer_ids = network.officer_ids.drop_duplicates().reset_index(drop=True)
|
||||
|
||||
def search_officer_id(self, network, officer_id):
|
||||
network.node_type = "Person"
|
||||
network.node = officer_id
|
||||
if self.get_officer_appointments:
|
||||
appointments = api.get_appointments(officer_id)
|
||||
if self.officer_appointments_maxsize == None or len(appointments) < int(self.officer_appointments_maxsize or 0):
|
||||
for appointment in appointments:
|
||||
if processing.normalise_address(appointment['address']) not in network.addresses[network.addresses['n'] < network.n]['address'].unique():
|
||||
network.link_type = "Appointment Address"
|
||||
network.addresses = network.addresses.append({'address': processing.normalise_address(appointment['address']), 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
if appointment['appointed_to']['company_number'] not in network.company_ids[network.company_ids['n'] < network.n]['company_id'].unique():
|
||||
network.link_type = "Appointment"
|
||||
network.company_ids = network.company_ids.append({'company_id': appointment['appointed_to']['company_number'], 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
elif len(appointments) > int(self.officer_appointments_maxsize):
|
||||
network.maxsize_entities = network.maxsize_entities.append({'node':officer_id,'type': 'Officer', 'maxsize_type': 'Appointments', 'size': len(appointments)}, ignore_index=True)
|
||||
if self.get_officer_correspondance_address:
|
||||
correspondance_address = api.get_correspondance_address(officer_id)
|
||||
if correspondance_address:
|
||||
if processing.normalise_address(correspondance_address['items'][0]['address']) not in network.addresses[network.addresses['n'] < network.n]['address'].unique():
|
||||
network.link_type = "Officer Corresponance Address"
|
||||
network.addresses = network.addresses.append({'address': processing.normalise_address(correspondance_address['items'][0]['address']), 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
if self.get_officer_duplicates:
|
||||
duplicate_officers = api.get_duplicate_officers(officer_id)
|
||||
if self.officer_duplicates_maxsize == None or len(duplicate_officers) < int(self.officer_duplicates_maxsize or 0):
|
||||
for duplicate in duplicate_officers:
|
||||
network.link_type = "Duplicate Officer"
|
||||
if duplicate['links']['self'].split('/')[2] not in network.officer_ids[network.officer_ids['n'] < network.n]['officer_id'].unique():
|
||||
network.officer_ids = network.officer_ids.append({'officer_id': duplicate['links']['self'].split('/')[2], 'name': duplicate['title'], 'n':network.n, 'link_type': network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
elif len(duplicate_officers) > int(self.officer_duplicates_maxsize):
|
||||
network.maxsize_entities = network.maxsize_entities.append({'node':officer_id,'type': 'Officer', 'maxsize_type': 'Duplicates', 'size': len(duplicate_officers)}, ignore_index=True)
|
||||
network.addresses = network.addresses.drop_duplicates().reset_index(drop=True)
|
||||
network.officer_ids = network.officer_ids.drop_duplicates().reset_index(drop=True)
|
||||
network.company_ids = network.company_ids.drop_duplicates().reset_index(drop=True)
|
||||
|
||||
def search_address(self, network, address):
|
||||
network.node_type = "Address"
|
||||
network.node = address
|
||||
if self.get_companies_at_address:
|
||||
companies = api.get_companies_at_address(address)
|
||||
if companies:
|
||||
if self.companies_at_address_maxsize == None or len(companies['items']) < int(self.companies_at_address_maxsize or 0):
|
||||
for company in companies['items']:
|
||||
network.link_type = "Company at Address"
|
||||
if company['company_number'] not in network.company_ids[network.company_ids['n'] < network.n]['company_id'].unique():
|
||||
network.company_ids = network.company_ids.append({'company_id': company['company_number'], 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
elif len(companies['items']) > int(self.companies_at_address_maxsize):
|
||||
network.maxsize_entities = network.maxsize_entities.append({'node':address,'type': 'Address', 'maxsize_type': 'Companies', 'size': len(companies['items'])},ignore_index=True)
|
||||
if self.get_officers_at_address:
|
||||
officers = api.get_officers_at_address(address)
|
||||
if self.officers_at_address_maxsize == None or len(officers) < int(self.officers_at_address_maxsize or 0):
|
||||
for officer in officers:
|
||||
network.link_type = "Officer at Address"
|
||||
if officer['links']['self'].split('/')[2] not in network.officer_ids[network.officer_ids['n'] < network.n]['officer_id'].unique():
|
||||
network.officer_ids = network.officer_ids.append({'officer_id': officer['links']['self'].split('/')[2], 'name': officer['title'], 'n':network.n, 'link_type':network.link_type, 'node_type': network.node_type, 'node': network.node}, ignore_index=True)
|
||||
elif len(officers) > int(self.officers_at_address_maxsize):
|
||||
network.maxsize_entities = network.maxsize_entities.append({'node':address,'type': 'Address', 'maxsize_type': 'Officers', 'size': len(officers)},ignore_index=True)
|
||||
network.officer_ids = network.officer_ids.drop_duplicates().reset_index(drop=True)
|
||||
network.company_ids = network.company_ids.drop_duplicates().reset_index(drop=True)
|
||||
114
sugartrail/mapview.py
Normal file
@@ -0,0 +1,114 @@
|
||||
from ipywidgets import HTML, Widget, Layout, Output, VBox, HBox, Textarea
|
||||
from ipyleaflet import Map, Marker, MarkerCluster, AwesomeIcon, AntPath, Popup
|
||||
import pandas as pd
|
||||
from datetime import datetime
|
||||
import functools
|
||||
from string import ascii_lowercase as alc
|
||||
|
||||
def build_map(network):
|
||||
Widget.close_all()
|
||||
m, path_table = load_map_data(network)
|
||||
return m, path_table
|
||||
|
||||
def get_address_path(network, company_id):
|
||||
company_address_history = network.address_history.loc[network.address_history['company_number'] == company_id]
|
||||
address_path = []
|
||||
for index, row in company_address_history.iterrows():
|
||||
address_path.insert(0,[row['lat'], row['lon']])
|
||||
return address_path
|
||||
|
||||
def locations_from_origin_path(path, network):
|
||||
locations = []
|
||||
for node in path:
|
||||
if node['type'] == 'Company':
|
||||
last_company_address_row = network.address_history.loc[network.address_history['company_number'] == node['id']].iloc[:1]
|
||||
lat = last_company_address_row['lat'].item()
|
||||
lon = last_company_address_row['lon'].item()
|
||||
locations.append([float(lat),float(lon)])
|
||||
elif node['type'] == 'Address':
|
||||
address_row = network.addresses.loc[network.addresses['address'] == node['value']].iloc[:1]
|
||||
lat = address_row['lat'].item()
|
||||
lon = address_row['lon'].item()
|
||||
locations.append([float(lat),float(lon)])
|
||||
return locations
|
||||
|
||||
def on_button_clicked(address_path, path, location, address_trail, path_table, origin_trail, locations_from_origin, **kwargs):
|
||||
address_trail.locations = address_path
|
||||
locations_from_origin[-1] = location
|
||||
origin_trail.locations = locations_from_origin
|
||||
path_table.value = html_table_generator(path)
|
||||
return
|
||||
|
||||
def html_table_generator(path):
|
||||
table_style = '<style>table {font-family: arial, sans-serif;border-collapse: collapse;}td, th {border: 1px solid #dddddd;text-align: left;padding: 8px;}tr:nth-child(even) {background-color: #dddddd;}</style>'
|
||||
headers = ['Node Index', 'Node', 'Hop', 'Node Type', 'Link']
|
||||
headers_row = ""
|
||||
for header in headers:
|
||||
headers_row += '<th>' + header + '</th>'
|
||||
nodes = ""
|
||||
for i, node in enumerate(path):
|
||||
nodes += '<tr><td>' + node['node_index'] + '</td><td>' + str(node['value']) + '</td><td>' + str(node['n']) + '</td><td>' + str(node['link_type']) + '</td><td>' + str(node['links_to']) + '</td></tr>'
|
||||
table_html = table_style + '<table><tr>' + headers_row + '</tr>' + nodes + '</table>'
|
||||
return table_html
|
||||
|
||||
def load_map_data(network):
|
||||
address_trail = AntPath(
|
||||
locations=[],
|
||||
dash_array=[1,10],
|
||||
delay=1000,
|
||||
color='#ed2f2f',
|
||||
pulse_color='#FFFFFF'
|
||||
)
|
||||
origin_trail = AntPath(
|
||||
locations=[],
|
||||
dash_array=[1,10],
|
||||
delay=1000,
|
||||
color='#000000',
|
||||
pulse_color='#FFFFFF'
|
||||
)
|
||||
path_table = HTML(
|
||||
value=""
|
||||
)
|
||||
m = Map(center=(50, 0),
|
||||
zoom=5,
|
||||
layout=Layout(width='90%', height='650px'))
|
||||
m.add_layer(address_trail)
|
||||
m.add_layer(origin_trail)
|
||||
marker_cluster = MarkerCluster(
|
||||
center=(50, 0),
|
||||
markers=get_marker_data(network, address_trail, origin_trail, path_table),
|
||||
disable_clustering_at_zoom = 25,
|
||||
max_cluster_radius = 25
|
||||
)
|
||||
m.add_layer(marker_cluster)
|
||||
return m, path_table
|
||||
|
||||
def get_marker_data(network,address_trail, origin_trail, path_table):
|
||||
address_trail=address_trail
|
||||
origin_trail=origin_trail
|
||||
ms = []
|
||||
for index, row in network.address_history.iterrows():
|
||||
path = ""
|
||||
locations_from_origin = ""
|
||||
message = HTML()
|
||||
marker_color = "green"
|
||||
company = network.companies.loc[network.companies['company_number'] == row['company_number']]
|
||||
company_name = company['company_name'].item()
|
||||
company_status = company['company_status'].item()
|
||||
if company_status == "active":
|
||||
if row['end_date'] != None:
|
||||
marker_color = "red"
|
||||
else:
|
||||
marker_color = "black"
|
||||
address = row['address']
|
||||
path = network.find_path(str(row['company_number']))
|
||||
locations_from_origin = locations_from_origin_path(path, network)
|
||||
message.value = company_name + "<hr>" + address
|
||||
icon = AwesomeIcon(
|
||||
marker_color=marker_color
|
||||
)
|
||||
address_path = get_address_path(network,str(row['company_number']))
|
||||
marker = Marker(icon=icon, location=(row['lat'], row['lon']), draggable=False, popup=message, title="Address")
|
||||
marker.on_click(functools.partial(on_button_clicked, address_path=address_path, address_trail=address_trail, path_table=path_table, origin_trail=origin_trail, path=path, location=(row['lat'], row['lon']), locations_from_origin = locations_from_origin))
|
||||
ms.append(marker)
|
||||
return ms
|
||||
123
sugartrail/processing.py
Normal file
@@ -0,0 +1,123 @@
|
||||
from sugartrail import api
|
||||
import requests
|
||||
import pandas as pd
|
||||
import random
|
||||
import urllib
|
||||
import regex as re
|
||||
|
||||
def infer_postcode(address_string):
|
||||
postcode = re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', address_string)
|
||||
if postcode:
|
||||
return postcode[0]
|
||||
else:
|
||||
return
|
||||
|
||||
def load_company_data(company_data_filepath):
|
||||
try:
|
||||
company_data = pd.read_csv(company_data_filepath)
|
||||
return company_data
|
||||
except:
|
||||
return
|
||||
|
||||
def get_nearby_postcode(postcode_string):
|
||||
url = "http://api.postcodes.io/postcodes/" + postcode_string[:-1] + "/autocomplete"
|
||||
response = requests.get(url).json()
|
||||
if response['result'] != None:
|
||||
closest_address = {}
|
||||
for postcode in response["result"]:
|
||||
distance = abs(ord(postcode[-1]) - ord(postcode_string[-1]))
|
||||
if closest_address:
|
||||
if distance < closest_address["distance"]:
|
||||
closest_address = {"postcode": postcode, "distance": distance}
|
||||
else:
|
||||
closest_address = {"postcode": postcode, "distance": distance}
|
||||
return closest_address["postcode"]
|
||||
|
||||
def get_coords_from_address(address_string):
|
||||
address = urllib.parse.quote(address_string)
|
||||
url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
|
||||
response = requests.get(url).json()
|
||||
if response:
|
||||
return {'lat': response[0]['lat'], 'lon': response[0]['lon'], 'address': address_string}
|
||||
else:
|
||||
postcode_string = infer_postcode(address_string)
|
||||
if postcode_string:
|
||||
url = "http://api.postcodes.io/postcodes/" + urllib.parse.quote(postcode_string)
|
||||
response = requests.get(url).json()
|
||||
if str(response['status']) == '200':
|
||||
return {'lat': response['result']['latitude'], 'lon': response['result']['longitude'], 'postcode': postcode_string}
|
||||
else:
|
||||
# try nearby postcode:
|
||||
nearby_postcode = get_nearby_postcode(postcode_string)
|
||||
if nearby_postcode:
|
||||
url = "http://api.postcodes.io/postcodes/" + urllib.parse.quote(nearby_postcode)
|
||||
response = requests.get(url).json()
|
||||
if str(response['status']) == "200":
|
||||
return {'lat': response['result']['latitude'], 'lon': response['result']['longitude'], 'postcode': nearby_postcode}
|
||||
else:
|
||||
print("failed")
|
||||
else:
|
||||
print("No postcode found for: " + address_string)
|
||||
|
||||
def normalise_name(name):
|
||||
name_list = name.replace(',','').split(" ")
|
||||
name_list.append(name_list.pop(0))
|
||||
return ' '.join(name_list)
|
||||
|
||||
def process_address_changes(address_changes):
|
||||
for i in reversed(range(1,len(address_changes['items']))):
|
||||
if 'new_address' not in address_changes['items'][i]['description_values'].keys():
|
||||
if 'old_address' in address_changes['items'][i-1]['description_values'].keys():
|
||||
address_changes['items'][i]['description_values']['new_address'] = address_changes['items'][i-1]['description_values']['old_address']
|
||||
return address_changes
|
||||
|
||||
def build_address_history(company_id):
|
||||
company_info = api.get_company(company_id)
|
||||
company_info_subset = {k:company_info[k] for k in ("date_of_creation","date_of_cessation","registered_office_address") if k in company_info}
|
||||
address_changes = api.get_address_changes(company_id)
|
||||
address_keys = ('start_date','end_date','address')
|
||||
if address_changes['items']:
|
||||
address_changes = process_address_changes(address_changes)
|
||||
addresses = []
|
||||
entry = {}
|
||||
entry["company_number"] = str(company_id)
|
||||
entry["address"] = str(normalise_address(company_info_subset['registered_office_address']))
|
||||
entry["start_date"] = str(address_changes['items'][0]['date'])
|
||||
if 'date_of_cessation' in company_info_subset:
|
||||
entry["end_date"] = str(company_info_subset['date_of_cessation'])
|
||||
else:
|
||||
entry["end_date"] = None
|
||||
addresses.append(entry)
|
||||
for i,change in enumerate(address_changes['items']):
|
||||
entry = {}
|
||||
entry["company_number"] = str(company_id)
|
||||
if 'old_address' in change['description_values']:
|
||||
entry["address"] = change['description_values']['old_address']
|
||||
else:
|
||||
entry["address"] = ""
|
||||
if i+1 < len(address_changes['items']):
|
||||
entry["start_date"] = str(address_changes['items'][i+1]['date'])
|
||||
else:
|
||||
entry["start_date"] = company_info_subset['date_of_creation']
|
||||
entry["end_date"] = str(change['date'])
|
||||
addresses.append(entry)
|
||||
return addresses
|
||||
else:
|
||||
address_history = []
|
||||
entry = {}
|
||||
for k, key in enumerate(["date_of_creation","date_of_cessation","registered_office_address"]):
|
||||
if key in company_info:
|
||||
entry[address_keys[k]] = company_info[key]
|
||||
else:
|
||||
entry[address_keys[k]] = None
|
||||
entry["company_number"] = str(company_id)
|
||||
entry['address'] = normalise_address(entry['address'])
|
||||
return [entry]
|
||||
|
||||
def normalise_address(address_dict):
|
||||
address_list = []
|
||||
for key in ['premises','address_line_1', 'locality','postal_code', 'country']:
|
||||
if key in address_dict:
|
||||
address_list.append(address_dict[key])
|
||||
address_string = ' '.join(address_list)
|
||||
return address_string
|
||||