Files
sugartrail/notebooks/.ipynb_checkpoints/001_getting_started-checkpoint.ipynb
2022-12-31 13:51:42 +00:00

546 lines
16 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "0639ca05",
"metadata": {},
"source": [
"*In this tutorial we will walk through the capabilities of the tool in depth.*"
]
},
{
"cell_type": "markdown",
"id": "538c9eb1",
"metadata": {},
"source": [
"### Introduction \n",
"\n",
"'Sugartrail' was developed to make it easier and faster for researchers to explore connections between companies, persons and addresses within [Companies House](https://www.gov.uk/government/organisations/companies-house). Researchers can build networks of connected companies, persons and addresses based on a defined set of connectivity criteria and then visualise these connections through an [OpenStreetMaps interface](https://ipyleaflet.readthedocs.io/en/latest/index.html)."
]
},
{
"cell_type": "markdown",
"id": "eee8d524",
"metadata": {},
"source": [
"### Prerequisites\n",
"\n",
"Sugartrail uses the [Companies House Public Data API](https://developer-specs.company-information.service.gov.uk/companies-house-public-data-api/reference) to gather data on connected companies, persons and addresses. To access this API you will need a key which you can aquire by registering a [user account](https://developer.company-information.service.gov.uk/get-started/). Once you've aquired the key, insert it below as the string value of `api.basic_auth.username`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81c37bf3",
"metadata": {},
"outputs": [],
"source": [
"from sugartrail import api, mapview, base\n",
"from ipywidgets import VBox, HBox\n",
"\n",
"api.basic_auth.username = \"\""
]
},
{
"cell_type": "markdown",
"id": "ad4599dc",
"metadata": {},
"source": [
"Lets make a test request to validate everything works by attempting to get all the officers who work at [this company](https://find-and-update.company-information.service.gov.uk/company/12411673). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51a1dd4f",
"metadata": {},
"outputs": [],
"source": [
"company_id = \"12411673\"\n",
"api.get_company_officers(company_id)"
]
},
{
"cell_type": "markdown",
"id": "29d8dd26",
"metadata": {},
"source": [
"### Initialising Networks \n",
"\n",
"To create a network we start from a single company, person or address. Networks are build and stored with the `Network` class. Lets go ahead and create a new network:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63bc00fa",
"metadata": {},
"outputs": [],
"source": [
"network = base.Network()"
]
},
{
"cell_type": "markdown",
"id": "aeedf139",
"metadata": {},
"source": [
"`Network` accepts either a company ID, officer ID or address string as the initial node. For example, [this company](https://find-and-update.company-information.service.gov.uk/company/12411673): `company_id` = \"12411673\"\n",
"\n",
"If we wanted to search by address, then `address` = \"513 Tong Street, Flat 5, Bradford, England, BD4 6NA\""
]
},
{
"cell_type": "markdown",
"id": "f73b17d8",
"metadata": {},
"source": [
"![title](../assets/images/spy.png)"
]
},
{
"cell_type": "markdown",
"id": "b3caccb6",
"metadata": {},
"source": [
"For [this officer](https://find-and-update.company-information.service.gov.uk/officers/6WODVBRaegvY3UvEhcQxg0OsPkc/appointments), `officer_id` = \"6WODVBRaegvY3UvEhcQxg0OsPkc\""
]
},
{
"cell_type": "markdown",
"id": "e21f3c98",
"metadata": {},
"source": [
"![title](../assets/images/scrooge.png)"
]
},
{
"cell_type": "markdown",
"id": "a6198a80",
"metadata": {},
"source": [
"Lets build the network from `company_id`: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "31eea99d",
"metadata": {},
"outputs": [],
"source": [
"network.company_id=\"11004735\""
]
},
{
"cell_type": "markdown",
"id": "7bd5060d",
"metadata": {},
"source": [
"We could also just initialise the network by passing `company_id` as an input: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c70f41f",
"metadata": {},
"outputs": [],
"source": [
"network = base.Network(company_id=\"11004735\")"
]
},
{
"cell_type": "markdown",
"id": "cd0f2a9e",
"metadata": {},
"source": [
"Data about companies, persons and addresses are stored in several attributes within the `Network` class. If we check the `company_ids` property, we will find the entry we just created:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e12f5461",
"metadata": {},
"outputs": [],
"source": [
"network.company_ids"
]
},
{
"cell_type": "markdown",
"id": "91c14cbb",
"metadata": {},
"source": [
"Each company is represented by its unique ID (`company_id`), number of hops from the origin company (`n`) and the company, address or person it connects to. As we've only saved the origin company so far, there isn't any information on links or connected nodes. There are also attributes for storing officer ids (`officer_ids`) and (`addresses`) although they have no information in them yet:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "33ed61e2",
"metadata": {},
"outputs": [],
"source": [
"network.officer_ids"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d5a52e6a",
"metadata": {},
"outputs": [],
"source": [
"network.addresses"
]
},
{
"cell_type": "markdown",
"id": "72f30427",
"metadata": {},
"source": [
"### Building Networks"
]
},
{
"cell_type": "markdown",
"id": "862f00ef",
"metadata": {},
"source": [
"We can now build the network by performing hops that will find new company IDs, officer IDs and addresses connected to the entities already stored within the network. \n",
"\n",
"There are a finite number of ways that officers, companies and addresses can be connected within Companies House:\n",
"\n",
"#### Companies \n",
"\n",
"1. Companies → Officers: companies have officers \n",
"2. Companies → Addresses: companies have a history of registered addresses \n",
"3. Companies → Addresses: companies have correspondence addresses for their persons of significant control (psc)\n",
"\n",
"#### Officers \n",
"\n",
"4. Officers → Companies: officers have appointments (companies they have a role in) \n",
"5. Officers → Addresses: officers have correspondence addresses\n",
"6. Officers → Officers: officers may have duplicate enteries within Companies House; other officers using the same name and birth date (but different values for `officer_id`\n",
"\n",
"#### Addresses \n",
"\n",
"7. Addresses → Officers: addresses are used as officer correspondence addresses \n",
"8. Addresses → Companies: addresses are used as company correspondence addresses \n",
"\n",
"To build the network we can use any combination of this connectivity criteria. The above connections are implemented as methods that get called everytime we perform a hop: \n",
"\n",
"1. get_company_officers\n",
"2. get_company_address_history\n",
"3. get_psc_correspondance_address\n",
"3. get_officer_appointments\n",
"4. get_officer_correspondance_address \n",
"5. get_officer_duplicates \n",
"6. get_officers_at_address\n",
"7. get_companies_at_address\n",
"\n",
"We can toggle each of these methods via boolean properties of the `Hop` subclass:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32643a9c",
"metadata": {},
"outputs": [],
"source": [
"network.hop.__dict__"
]
},
{
"cell_type": "markdown",
"id": "1802bb34",
"metadata": {},
"source": [
"We can see the `Hop` subclass contains all of the connections mentioned above set to `True` by default, therefore everytime we perform a hop, the network will use these methods to get data.\n",
"\n",
"We also notice that there are some properties setting a \"maxsize\" limit. These properties ensure that if the number of results returned by the method exceeds this limit then the results will not be stored within the `Network` class properties. This limit is quite important when building networks as some of these methods can return 1000s of results and if we're not interested in these results they can make it difficult to visualise meaningful connections within the network (see Tutorial 3 for more on this). \n",
"\n",
"Lets go ahead and perform one hop using these default settings and see what addresses, companies and officers are added:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "167cc25c",
"metadata": {},
"outputs": [],
"source": [
"network.perform_hop(1)"
]
},
{
"cell_type": "markdown",
"id": "2486aa17",
"metadata": {},
"source": [
"Lets now check out `company_ids`, `officer_ids` and `addresses` to see what new enteries have been added. Nothing new in `company_ids` but this is expected as none of the API methods above connect companies with companies in one hop:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6ce4047",
"metadata": {},
"outputs": [],
"source": [
"network.company_ids"
]
},
{
"cell_type": "markdown",
"id": "eb5cb2f6",
"metadata": {},
"source": [
"We can see we now have an officer below in `officer_ids` and some of the other properties in the table now have values other than None. `node_type` describes what the type of node the company is connected to (Company, Person or Address), `node_id` provides the unique id for the node (`company_id`, `officer_id` or `address`) and `link_type` describes the relationship between the company and the node."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "947c4cf1",
"metadata": {},
"outputs": [],
"source": [
"network.officer_ids"
]
},
{
"cell_type": "markdown",
"id": "a8cf6fa0",
"metadata": {},
"source": [
"We can interpret the table above as:\n",
"\n",
"There is an officer with ID=`Nd2URspq4bvLy-hwzDZ0_p7FGJw` who is an officer to a company with ID=`11004735`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7083402a",
"metadata": {},
"outputs": [],
"source": [
"network.addresses"
]
},
{
"cell_type": "markdown",
"id": "264de2dd",
"metadata": {},
"source": [
"We can see from the table above that:\n",
"\n",
"`3rd Floor 13 Charles Ii Street London SW1Y 4QU England` is an address that used to be home to a company (with ID=`11004735`):"
]
},
{
"cell_type": "markdown",
"id": "b4828d92",
"metadata": {},
"source": [
"For reproducibility, each time we perform a hop, the methods and limit configs are stored in "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9bb5f542",
"metadata": {},
"outputs": [],
"source": [
"network.hop_history"
]
},
{
"cell_type": "markdown",
"id": "ac1dab27",
"metadata": {},
"source": [
"Lets perform another two hops: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f2b0baba",
"metadata": {},
"outputs": [],
"source": [
"network.perform_hop(2)"
]
},
{
"cell_type": "markdown",
"id": "cec66fcc",
"metadata": {},
"source": [
"Now we can go ahead and visualise this in a map. To do this we need to get a bit more info that isn't present, namely the coordinates for all the addresses mentioned and the company names for each company. We can get this information via `run_map_preprocessing()`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3be52255",
"metadata": {},
"outputs": [],
"source": [
"network.run_map_preprocessing()"
]
},
{
"cell_type": "markdown",
"id": "dfa1b90c",
"metadata": {},
"source": [
"To see the information added, we can check out `address_history` and `companies` properties of our class:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b800202c",
"metadata": {},
"outputs": [],
"source": [
"network.address_history"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "37013a7e",
"metadata": {},
"outputs": [],
"source": [
"network.companies "
]
},
{
"cell_type": "markdown",
"id": "3e3b597d",
"metadata": {},
"source": [
"We can now visualise all the companies in the network with a UK address through OpenStreetMaps:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7256c5f9",
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"map_data, path_table = mapview.build_map(network) \n",
"hbox = HBox([path_table])\n",
"vbox = VBox([map_data, hbox])\n",
"vbox"
]
},
{
"cell_type": "markdown",
"id": "7e225045",
"metadata": {},
"source": [
"Each marker represents a company in the network. Green markers represent active companies based at the address, red markers represent active companies no longer based at the address and black markers represent dissolved companies once based at the address. \n",
"\n",
"Select a marker to display additional information: \n",
"- pop-up with the selected company's name and address\n",
"- table containing the most efficient paths from the origin to the selected company\n",
"- antpaths for each company in the network. Red antpath represents the path through all the historic addresses for the selected company. Black antpath represents the path from the network origin through all the addresses in the path to the selected company as displayed in the table. \n",
"\n",
"To read paths from the table we start from the bottom of the table where we find one or several rows containing our selected company (`Node`) but with differing values for `Node Index`, `Node Type` and `Link`. If we encounter multiple rows containing our selected node, this tells us there are multiple paths of equal length from the selected node (origin) to the origin. For example, consider the following table: "
]
},
{
"cell_type": "markdown",
"id": "f6674e52",
"metadata": {},
"source": [
"<img src=\"../assets/images/kingdom_table.png\" alt=\"Drawing\" style=\"width: 700px;\"/>\n"
]
},
{
"cell_type": "markdown",
"id": "fd5d9a0d",
"metadata": {},
"source": [
"Pick N Mix London Limited (E) is a 'company at address' for 3rd Floor 13 Charles Ii Street (C) which is a 'historic address' for Kingdom of Sweets Ltd (A).\n",
"\n",
"Additionally, Pick N Mix London Limited (D) is an appointment of (B) who is an officer of Kingdom of Sweets Ltd (A). "
]
},
{
"cell_type": "markdown",
"id": "4a6662be",
"metadata": {},
"source": [
"### Network Persistance"
]
},
{
"cell_type": "markdown",
"id": "a68e26ca",
"metadata": {},
"source": [
"The network object can be saved with 'pickle' and reloaded when needed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee8d8c24",
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"\n",
"with open('../assets/networks/kingdom_of_sweets_network.pickle', 'wb') as handle:\n",
" pickle.dump(network, handle)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e7c5578",
"metadata": {},
"outputs": [],
"source": [
"with open('../assets/networks/kingdom_of_sweets_network.pickle', 'rb') as handle:\n",
" network = pickle.load(handle)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}