Files
sugartrail/notebooks/003_virtual_offices.ipynb
2024-05-06 17:52:01 +01:00

547 lines
16 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "b2110da7",
"metadata": {},
"source": [
"*In this tutorial we will investigate addresses with a large number of companies registered via the API and Companies House Data Product download.*"
]
},
{
"cell_type": "markdown",
"id": "25528662",
"metadata": {},
"source": [
"### Busy Addresses and API Limits"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab9e8ee0",
"metadata": {},
"outputs": [],
"source": [
"import sugartrail \n",
"import pandas as pd\n",
"import zipfile\n",
"import os"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19c84477",
"metadata": {},
"outputs": [],
"source": [
"sugartrail.api.basic_auth.username = \"\""
]
},
{
"cell_type": "markdown",
"id": "00c6a5be",
"metadata": {},
"source": [
"When navigating Companies House there are times that we will run into some very popular addresses. For example lets say build a network from [this officer](https://find-and-update.company-information.service.gov.uk/officers/Nd2URspq4bvLy-hwzDZ0_p7FGJw/appointments):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "723f234a",
"metadata": {},
"outputs": [],
"source": [
"officer_id = \"Nd2URspq4bvLy-hwzDZ0_p7FGJw\"\n",
"network = sugartrail.base.Network(officer_id=officer_id)\n",
"network.perform_hop(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56481559",
"metadata": {},
"outputs": [],
"source": [
"# network=sugartrail.base.Network(file=f'{sugartrail.const.networks_path}virtual_offices_a.json')"
]
},
{
"cell_type": "markdown",
"id": "edad561e",
"metadata": {},
"source": [
"Within 2 hops we've got over 60 addresses (although many of them look like duplicate entries):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eea32631",
"metadata": {},
"outputs": [],
"source": [
"pd.DataFrame(network.addresses)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7ce897c0",
"metadata": {},
"outputs": [],
"source": [
"pd.DataFrame(network.addresses)['title'].unique()"
]
},
{
"cell_type": "markdown",
"id": "8c17fff5",
"metadata": {},
"source": [
"If we check out the `maxsize_entities` property of our Network class, we will see a dataframe containing all of the addresses and officers that have exceeded the maxsize limits imposed in the Hop class. In this case, we can see that several of the addresses in the network have over 4800 companies based there."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8b8d3c20",
"metadata": {},
"outputs": [],
"source": [
"pd.DataFrame(network.maxsize_entities)"
]
},
{
"cell_type": "markdown",
"id": "5ad7b443",
"metadata": {},
"source": [
"Because we set a limit of 50 companies on the maxsize of companies returned via `companies_at_address_maxsize`, these companies will not be added to `companies_id`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4f94f731",
"metadata": {},
"outputs": [],
"source": [
"network.hop.companies_at_address_maxsize"
]
},
{
"cell_type": "markdown",
"id": "2d4edaf0",
"metadata": {},
"source": [
"If we check `companies_id` we'll notice it hasn't had 4800 companies added to it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e3ef12fe",
"metadata": {},
"outputs": [],
"source": [
"len(network.company_ids)"
]
},
{
"cell_type": "markdown",
"id": "d177f1b5",
"metadata": {},
"source": [
"Including limits are useful to avoid our databases getting clogged up with random companies. \n",
"Although lets pause to briefly explore what address would have thousands of companies registered there?"
]
},
{
"cell_type": "markdown",
"id": "e8644d6b",
"metadata": {},
"source": [
"![title](../assets/images/regent_storefront.jpeg)"
]
},
{
"cell_type": "markdown",
"id": "40354a28",
"metadata": {},
"source": [
"\"3rd Floor, 207, Regent Street\" is a \"virtual office\" run by a company called [Hold Everything](https://www.hold-everything.com/). Businesses can use this address for correspondance/registration for £24 a month:"
]
},
{
"cell_type": "markdown",
"id": "11b08c79",
"metadata": {},
"source": [
"![title](../assets/images/exclusive.png)"
]
},
{
"cell_type": "markdown",
"id": "2c9e85ed",
"metadata": {},
"source": [
"However the large number of companies registered at a single address can lead to many instances of mistaken identity. Just because a company is registered at a virtual office does not mean it has any connection with other companies registered there.:"
]
},
{
"cell_type": "markdown",
"id": "be5e4352",
"metadata": {},
"source": [
"![title](../assets/images/review.png)"
]
},
{
"cell_type": "markdown",
"id": "282ba8ea",
"metadata": {},
"source": [
"Numerous media outlets have reported on fraudulent companies that use virtual offices and incorporation services: \n",
"- Kemp House, 162 City Road | Capital Officer: [Mystery group took millions in furlough funds - Financial Times](https://www.ft.com/content/b3c70369-5170-47ca-b779-fc0898fd29e6)\n",
"- 20-22 Wenlock Road | Made Simple: [Court shuts down companies behind £9m truffle scam - Gov.uk](https://www.gov.uk/government/news/court-shuts-down-companies-behind-9m-truffle-scam)\n",
"- 2 Woodberry Down | A1 Company Services [How A Suburban North London House Is Connected To The Paul Manafort Indictment - Huffington Post](https://www.huffingtonpost.co.uk/entry/manfort-london-connection_uk_59f72f50e4b07fdc5fbf92c7)\n",
"- 29 Harley Street | Formations House [Offshore in central London: the curious case of 29 Harley Street - The Guardian](https://www.theguardian.com/business/2016/apr/19/offshore-central-london-curious-case-29-harley-street)\n",
"- 63-66 Hatton Garden | Valemont Properties Ltd [The Global Laundromat: how did it work and who benefited? - The Guardian](https://www.theguardian.com/world/2017/mar/20/the-global-laundromat-how-did-it-work-and-who-benefited)"
]
},
{
"cell_type": "markdown",
"id": "a85fdcfa",
"metadata": {},
"source": [
"If we wanted to get all companies listed at 207 Regent Street we can adjust our maxsize limits to `None` and attempt to perform a hop again:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "eb0c02d0",
"metadata": {},
"outputs": [],
"source": [
"regent_street_network = sugartrail.base.Network(address='3rd Floor, 207 Regent Street London W1B 3HH England')\n",
"regent_street_network.hop.companies_at_address_maxsize = None\n",
"regent_street_network.hop.officers_at_address_maxsize = None\n",
"regent_street_network.perform_hop(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d245831",
"metadata": {},
"outputs": [],
"source": [
"# regent_street_network = sugartrail.base.Network(file=f'{sugartrail.const.networks_path}virtual_offices_b.json')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3dc0f165",
"metadata": {},
"outputs": [],
"source": [
"pd.DataFrame(regent_street_network.company_ids)"
]
},
{
"cell_type": "markdown",
"id": "cff1061e",
"metadata": {},
"source": [
"Such large networks can still be interesting to analyse. For instance if we perform another hop this will get all the officers for every company at the address. This will take several hours to build as we have lots of companies to analyse, however if we want to save time we could just uncomment and load a pre-made network below: "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ef262359",
"metadata": {},
"outputs": [],
"source": [
"regent_street_network.perform_hop(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4b7616c7",
"metadata": {},
"outputs": [],
"source": [
"# regent_street_network = sugartrail.base.Network(file=f'{sugartrail.const.networks_path}virtual_offices_c.json')"
]
},
{
"cell_type": "markdown",
"id": "d6e330ee",
"metadata": {},
"source": [
"Analysing the most frequently occuring officers running businesses from 207 Regent Street returns some very busy officers and incorporation agents:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e97fa3b",
"metadata": {},
"outputs": [],
"source": [
"pd.DataFrame(regent_street_network.officer_ids)['title'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "d6a22e40",
"metadata": {},
"source": [
"A quick news lookup on two of the officers in the top 5, J. Beardsley of Helve TCS Limited and S. Poppleton reveal these names to be connected to several known instances of fraud:\n",
"- [Fraudster duo jailed for their part in defrauding millions of pounds from over 100 victims - Crown Prosecution Service](https://www.cps.gov.uk/cps/news/fraudster-duo-jailed-their-part-defrauding-millions-pounds-over-100-victims)\n",
"- [Print farming companies struck off - Printweek](https://www.printweek.com/news/article/print-farming-companies-struck-off)\n",
"- [Rogue book publishers slammed shut by the courts - Gov.uk](https://www.gov.uk/government/news/rogue-book-publishers-slammed-shut-by-the-courts)"
]
},
{
"cell_type": "markdown",
"id": "f0699a27",
"metadata": {},
"source": [
"### Busier Addresses and Downloaded Data"
]
},
{
"cell_type": "markdown",
"id": "944525cb",
"metadata": {},
"source": [
"There are situations where some addresses have thousands or even tens of thousands of companies registered. Companies House provides two methods for getting company data, API and data product. We used the API to get the information above which returns all active and dissolved companies registered to the address. We get the same result when we attempt to perform an advanced company search using this address through the website:"
]
},
{
"cell_type": "markdown",
"id": "c307994f",
"metadata": {},
"source": [
"![title](../assets/images/regent.png)"
]
},
{
"cell_type": "markdown",
"id": "517e6aaa",
"metadata": {},
"source": [
"Unfortunately the API is limited to returing 5000 result max. This is fine in our case with 207 Regent Street because we're just under the limit. However there are much bigger fish out there for instance, '75 Shelton Street':"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f40ee11",
"metadata": {},
"outputs": [],
"source": [
"shelton_street_network = sugartrail.base.Network(address=\"71-75, Shelton Street, Covent Garden, London, WC2H 9JQ\")\n",
"shelton_street_network.perform_hop(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "618b2ed1",
"metadata": {},
"outputs": [],
"source": [
"# shelton_street_network = sugartrail.base.Network(file=f'{sugartrail.const.networks_path}virtual_offices_d.json')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd5d1258",
"metadata": {},
"outputs": [],
"source": [
"shelton_street_network.maxsize_entities[0]"
]
},
{
"cell_type": "markdown",
"id": "6f1abb52",
"metadata": {},
"source": [
"We can already see its over 5000 limit for the API. If we check online we can see the number is huge: "
]
},
{
"cell_type": "markdown",
"id": "03b64f03",
"metadata": {},
"source": [
"![title](../assets/images/shelton.png)"
]
},
{
"cell_type": "markdown",
"id": "f9fda7a6",
"metadata": {},
"source": [
"This is where the data product comes in. We can download it in one go and use it to get all of the \"active\" companies. To use the data product:\n",
"1. Download it from [here](http://download.companieshouse.gov.uk/en_output.html) (might take some time as its a pretty large file ~430Mb)\n",
"2. Move it to local directory `assets/company_data/` and unzip the file \n",
"3. Load into a dataframe which we can pass to our network class\n",
"\n",
"Might take a minute to load. How adjust the file string below and attempt to load it into `company_data`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0a49500a",
"metadata": {},
"outputs": [],
"source": [
"latest_file = f'https://download.companieshouse.gov.uk/BasicCompanyDataAsOneFile-{sugartrail.utils.get_first_day_of_current_month()}.zip'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ab4d6388",
"metadata": {},
"outputs": [],
"source": [
"company_data_file = sugartrail.utils.download_file(latest_file, sugartrail.const.data_path)"
]
},
{
"cell_type": "markdown",
"id": "2e3ce6c1",
"metadata": {},
"source": [
"Unzip the file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "616f74e2",
"metadata": {},
"outputs": [],
"source": [
"with zipfile.ZipFile(company_data_file, 'r') as zip_ref:\n",
" # Extract all the contents into the directory\n",
" zip_ref.extractall(sugartrail.const.data_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a819845a",
"metadata": {},
"outputs": [],
"source": [
"company_csv_file = company_data_file.replace('.zip','.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d9d0080",
"metadata": {},
"outputs": [],
"source": [
"company_data = pd.read_csv(company_csv_file)"
]
},
{
"cell_type": "markdown",
"id": "2273cf39",
"metadata": {},
"source": [
"Now lets try get every company at the very overcrowded 71-75 Shelton Street address (might take several minutes- can uncomment the cell below to load pre-made network):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3e273ce0",
"metadata": {},
"outputs": [],
"source": [
"shelton_street_network = sugartrail.base.Network(address=\"71-75, Shelton Street, Covent Garden, London, WC2H 9JQ\")\n",
"shelton_street_network.hop.companies_at_address_maxsize = None\n",
"shelton_street_network.hop.officers_at_address_maxsize = None\n",
"shelton_street_network.get_officers_at_address = False\n",
"shelton_street_network.perform_hop(1, company_data = company_data)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e348c6e",
"metadata": {},
"outputs": [],
"source": [
"# shelton_street_network = sugartrail.base.Network(file=f'{sugartrail.const.networks_path}virtual_offices_e.json')"
]
},
{
"cell_type": "markdown",
"id": "820a908d",
"metadata": {},
"source": [
"If we check `company_ids` we have over 100000 companies (as of 6/5/2024) that we could build a network from if we had lots of time on our hands:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12acb915",
"metadata": {},
"outputs": [],
"source": [
"len(shelton_street_network.company_ids)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e9207a89",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
}
},
"nbformat": 4,
"nbformat_minor": 5
}