We (our research team) will use the open data by the Companies House for academic purposes. To facilitate processing of the company director appointments and company profiles, I tried to develop an importer program. The intention is the store the records into a highly structured database with tables for company profiles, directors, and appointments. However, things didn’t really work out as intended.
Has anyone else struggled with the documentation and inconsistencies in the bulk snapshots and daily updates?
I would more than appreciate it if I could have a quick look at your code, to find out where I made a mistake!
documentation and inconsistencies in the bulk snapshots and daily updates?
As was said it might be more helpful if you could give examples of:
a) what you wanted to achieve
b) what “didn’t really work as intended”
c) what documentation you were looking at.
Caveat - we don’t - yet - make heavy use of the bulk data but do use this for “investigation”.
Documentation - I’m only aware of (please post if you know more):
The above seems to overlap (same data set?) with the Company URI data - more info on data found in this (again dated - seems to be from 2012).
Officers bulk data - don’t know this as we’re not currently subscribed.
PSC bulk data - last I checked this is newline-separated json objects formatted as per the various PSC resources (see the PSC list resource which covers them all).
Accounts bulk data - we don’t use ATM, there are links to documentation at Companies House
The CH API documentation - which is somewhat patchy in terms of accuracy but you can always search…
…This forum. We do sometimes look at the XML Gateway API forum too (since this should be looking at the same underlying data set).
API
I don’t know whether anyone else wants to post their code, but if you Google about there are some libraries now for CH access (at least via the API) and plenty of sample code on the forum.
The issue that we are facing involves the daily update files for company profiles and company director appointments. Issues arise when determining whether a row in the update file is an update to an existing data entry, a correction, or a new data entry. The documentation does provide information about these subjects, but for an unknown reason, the data are not processed validly when comparing the result to the online CH application.
Do you have, perhaps, other documentation regarding these update files?
Thank you for the links. I was indeed able to find plenty documentation for the API. However, we have access to the bulk snapshots and update files, which are processed differently than data retrieved via the API.
Code on Google mainly relates to the API however, or the monthly company profiles archive.
I saw your other post, I’ll let that get picked up by others e.g. CH, but I think you should be able to map these all to something in the API constants at GitHub - companieshouse/api-enumerations
So:
APP_DATE_ORIGIN - not sure but the officer appointments data structure in the API differentiates between “appointed_before” and “appointed_on”, according to another field ( “is_pre_1992_appointment”).
Prefix for COMPANY_NUMBER - represented by company_type in the API / main constants (main constants file) (and possibly company_subtype). Mostly this is straightforward with two complications:
The API constants represent a “company” e.g. private / public as one of a range of constants. However the company prefix instead records which country (England / Wales vs. Scotland vs. Ireland - for which there are two prefixes) they were incorporated in, and the England / Wales ones don’t have a specific prefix but just have numbers for all 8 characters.
for ICVC types there are various sub-options (as represented in the API constants).
I’ve listed JSON below - with no guarantees! - which should be correct as of now and will map from prefix to CH constant as far as they go. Working out sub-types is up to you! For UK companies the “00” dummy prefix is what you need.
COMPANY_STATUS should map on to some combination of company_status / company_status_detail in the main api constants.
###JSON for company prefix → API constants###
The const field is one or more API company type constant values, type is either the text which this represents or something meaningful where there’re several, text is
full text, ctry is obviously the text for country with jurisdiction (CH have a constant for this too).
(CH limited the types you can upload, this should be .json). companyPrefixToType_20181029.txt (8.8 KB)