My ridiculous approach to Companies House bulk IXBRL processing using a 350GB SQLite Database - Part 2 : The technical bits

My ridiculous approach to Companies House bulk IXBRL processing using a 350GB SQLite Database - Part 2 : The technical bits

This is a technical post for my fellow geeks .

I built fastukcompanysearch.com, the unofficial Companies House app which has been quite popular since 2018. I then built convert-ixbrl.co.uk, which is an API and a web search service to find companies based on financials that uses Companies House IXBRL files to build the data, I wrote about it on Medium - How I built an IXBRL Converter for Companies House to extract financial statements from submitted accounts .

This post is a continuation of the original post about convert-ixbrl.co.uk , this time focused more on the technical implementation.


Long post alert: It’s a bit on the long side as I just didn’t want to leave anything out. I do recommend reading the last section where I describe a few ways to get your hands on free cloud computing credits worth up to Ā£8K for your own Companies House related or other computing experiments.
(Nope, NOT sponsored and this post has ZERO affiliate links)

#NoAIHere - Oh and I did not use AI to write or edit this. Sorry Claude, you are just a coding tool to me.


Background: What is this post about

Several people, over the past months have approached me asking specific questions about database implementation, pipeline processing and the like for convert-ixbrl.co.uk since its launch.

I thought my fellow geeks might find the technical details interesting, so I’ll share them here. If you are building something like this of your own, perhaps you could compare notes. Leave a comment if you think there are areas where I could have done better.

The foundations for Convert-IXBRL.co.uk

The idea is simple.

Companies house makes the accounts XBRL and IXBRL file available as a zip file. Download it, extract the XBRL and iXBRL contents, convert the files and save them in an appropriate database, setup the database for fast lookups and build an interface to query the data over convert-ixbrl.co.uk

Allow me to walk you through my process. I’ll skip over the most technically challenging part of building an IXBRL converter from scratch as I’ve described in the Medium post I linked to in the first paragraph.

Check it out after reading or atleast skimming this post as it provides a more complete view of the complete system.

Overview of the Current Tech Stack:

Backend : SQLite, .Net batch processes that make up the data conversion pipelines, a companies house monitor, .Net APIs, Serverless functions, AWS S3 , AWS SNS
Frontend: VueJS, HTML5 , CSS
Some technologies tried and abandoned: MySQL, PostgreSQL, MeiliSearch, ElasticSearch, DuckDB, DynamoDB and a few more that I have managed to forget about

Building a Batch Processor to Automate the zip file download

Arguably, this was the most straightforward part. My initial implementation was similar to a batch job processor I had built for fastukcompanysearch.com, which involved creating a job that runs once a day, checking if any new files had arrived and if they had, copying them to the server, extracting the zip and putting these in a staging area for processing.

I am a .Net developer so for me this was a console app, setup to run at a schedule using cron.

The staging area: 400 Million rows in an SQLite table

For convert-ixbrl.co.uk, the conversion begins when the extracted IXBRL files are converted to an intermediary format such that each XBRL concept and the associate value is stored in a single row in a gigantic table. A row might look like so

Id | CompanyNumber | XBRLConcept | Value 
xxxx | SC123456 | core_netassets| 100000

This was the most natural format to store the data in initially as I was figuring out how best to optimise the database for search. This meant that the queries were quite simple

Sample Query (representative - don’t do this unless you want to risk SQL injection)

Select * From CompanyTable where CompanyNumber = 'SC123456'

This would then return a few hundred rows. I would then need to group the rows by year and then for each year return the actual values and the concept, mapped to a user friendly string (e.g. core_netassets would become NetAssets) as JSON.

I initially picked MySQL to implement this. I started to convert and import the data using my converter but my biggest friction point was database portability as I worked through the bugs. I work across Windows, Mac and Linux machines and continually exporting and importing a database of that size significantly increases the build and debug cycles making it really painful to work on this application and a large part of what I do is about the development experience, it’s not all about the commercial aspects.

I work across these machines because working on my development projects in a coffee shop is one of the greatest pleasures of my life and while I prefer Windows(yes, I said it) I have to use Mac for my iOS mobile apps, but I digress.

PostgreSQL would have had the same issues, so I thought ā€œwhy not true SQLite?ā€

This initially seemed ridiculous as the title suggests. SQLite is typically used for small, local installs of relatively tiny databases but there was nothing in the documentation that said that it couldn’t accommodate a database of this size. Moving to SQLite would mean that I could just copy and paste the file across the machines, a far quicker operation on the modern SSDs.

And, it actually worked fine. Using the WAL mode of SQLite. I encourage you to read up on it. The inserts, updates and reads were blazingly fast even at the ridiculous (for SQLite) table sizes I mention.

Well, that kind of works until you need to run other types of queries

So, this worked but was not enough for the variety of queries convert-ixbrl offers and for the variety of queries users actually search for. Users also search based on partial company names. Searching for partial terms across such a large table, even with indexed columns, was just unacceptably slow, squashing my hopes of getting away with just a relatively simple implementation like this.

Enter ElasticSearch, then MeiliSearch

Need to run text based searches across largish db? ElasticSearch might be one of the top candidates for many. Just dump everything in and get querying.

The problem is
1- You need a far beefier machine. Cloud machines can get expensive fast, particularly when you consider that you need geographic redundancy. You ALWAYS need that redundancy, but the costs multiply far quicker with the beefy machines. At this stage, I wanted to keep my costs low to make the most use of my cloud credits during the build phase

2- Numeric searches are not straightforward. I’d likely need to run queries like ā€œFind companies that have netassets > 2Million within area Xā€ and the mixed queries mean I’d need to either tune ElasticSearch just right or combine the results with a traditional DB search
3-Query building is not intuitive(subjective opinion)

I also then tried MeiliSearch , a more modern alternative. The querying was nicer (again, subjective opinion) but it didn’t really fix the advanced querying issue for me. Also, it still needed a beefy machine for optimum performance and frankly I just did not feel it was battle tested enough. And ofcourse, using either ES or MeiliSearch would that meant my DB backup and restore issue would make a comeback as I build across machines.

Enter FTS

SQLite supports extensions. FTS enables Free Text Search within the same db where I had the company data using SQL like syntax.

So that worked. It wasn’t perfect as it still required writing optimised queries and performing serverside grouping to structure the text based results and merge them with the numeric results the existing traditional tables but atleast now I just had one db until…

Enter Advanced Queries

… I realised I would also need to run advanced queries like

ā€˜Find all companies within 2 miles of EC1 post code with netassets of over a million, with officers who are aged 65 years or older, that are atleast 10 years old and operate within the provided SIC code sectors’

This is a real representative example of a query that are run on convert-ixbrl.co.uk .

Back to staging area

I mentioned the staging area earlier. I now realised that no matter how much I optimise the gigantic table shown a few paragaphs ago, I will struggle to enable the users to be able to search across ANY xbrl concept . Examples include netassets, profit/loss, pension, accounting fee paid, tax + dozens of others and the flexible search was the whole reason behind building this service. This was circa 2021, before ChatGPT got mainstream and this was when I was building this on the side.

I then had to embark on a trial and error journey. Different table layouts, column layouts, indexing strategies and the like. All with the goal to make the search work efficiently, regardless of the nature of queries.

The process was essentially a repeat of "Build a structure, try the queries, find several cases where the structure didn’t work or was too slow, modify it, try again, find a few more cases where it didn’t work or was too slow, rinse and repeat until done’

I ultimately settled on two databases. One storing the ā€˜raw’ converted data shown earlier, and another with rows that were really wide, containing one xbrl concept value per column like so

CompanyNumber|YearEnd|NetAssets|EmployeeCount…. dozens of columns followed

The wide column layout was somewhat ugly but what really matters is the query speed , update/insert speed and ultimately the user experience. You have to remember, it is ultimately a product that someone will use and it is being built to deliver value, achieving technical perfectionism is a futile exercise.

It was fun but frustrating at the same time. When something worked, I’d find myself yelling’ YES! I knew I am a genious!" and then ten minutes later when I discovered that it only worked for a subset of queries I need to support and other queries took minutes to run I would throw my hands up in the air exclaiming ā€œArgh, what is even the point? Will this ever be done?ā€

Slowly but surely though, over months (remember I wasn’t working on this full time), things started to come together. One by one I worked through optimising the tables, column types, indexes, SQLite setup , querying logic, tweaking, optimising all the way until all the queries I wished to run worked, and returned results within a few hundred milliseconds (there are some exceptions depending on the query type but mostly).

There is additional processing following the retrieval of database results that happens in the application layer (.net). There are unit tests that run as part of the usual CI process, as well as integration tests and a postman test suite that is used to test the API in a semi-automatic fashion.

Companies House Monitor

The setup above worked initially but required multiple batch jobs. The basic company data bulk file and the accounts bulk data files need to be merged, as well as the PSC bulk file and the officers file that are available from Companies House servers.

Over the next few months, I repurposed a companies house monitor, a live , always on listener service which processes the Companies House stream to update the convert-ixbrl data in as close to real time as possible.

Running continuosly, it receives the events sent by Companies House streaming service , saving the data locally, backing it up to S3 (in case the hosting infrastructure fails and also for future processing) and updating the db.

Batch processes

The batch processes perform a number of functions other than just the data update. There are data backups, uploads to Snowflake (I make the data available on Snowflake too), certain SalesForce updates(I have a SalesForce LWC app launching Q1 2026), sending reminders and alerts and performing certain health checks to ensure system is operating as expected.

Serverless functions

I use serverless functions to carry out continous health checks. If a part of the system fails, an alert is triggered so that action can be taken to fix the issue. There are also additional alerts that use AWS SNS to send warning emails if any thresholds are breached, for instance consistently high CPU usage.

Web Portal

The data is made available as a giant | delimited text file, over a JSON API and also over a web search interface where dozens of search parameters can be combined to perform very specific searches.

The web portal is built using VueJS and HTML and it talks to the backend using an API, very similar to the API available publicly. Working with Javascript is my least favourite part of this process (subjective experience, you are welcome to disagree) but VueJS makes it somewhat less painful.

Not that JS is terrible to work with (or perhaps it can be, depends on who you ask), for me it just doesn’t provide as enjoyable a development experience as the pure .Net based development does. Having a web portal meant that I have spent many, many hours building, polishing, tweaking and retweaking the web interface to make it easy to use without overwhelming the user with 75+ search filters. The options are still there but the user is initially presented with a much smaller subset of filters which can be expanded to reveal the advanced options. (Check it out when you are done reading this)

Authentication

The website uses passwordless authentication. Just enter your email, get a login code in your inbox and start using the service. I don’t collect or store any card/bank information either, all of that is taken care of by Stripe. I also don’t ask for other personal information pieces like addresses, date of birth, contact number and the like.

Getting my hands on free cloud computing credits

AWS, GCP and Azure all offer limited free credits, usually about £200 and these only last a month so that was not a lot of use to me.

Earlier on, I had enrolled AWS Activate program which came with about Ā£1000 worth of credits. Sadly, I had already used that up. I instead signed up for GCP for Startups to get another Ā£1500 in credits and used these all up, switching back to AWS after the Beta was over. I’ve found AWS to be generally more reliable, and as most of my setup is scripted, the move was a few hours of work. This should be worth it depending on your setup and team size.

An alternative option is OVH Cloud with a much generous offering. I applied for their startup programme and within 24 hours received about £8K worth of cloud computing credits. I have rented dedicated servers from them before and have no complaints, either about the machines or the service.

For this project I will stick to AWS but I plan to use OVH credits for a spin-off next year.

All three programmes just require a registered UK limited company and a website to demonstrate you are actually building something.

Putting it all together

The project was delivered across 3–4 years (working on it part time), although when it finally went live after an 8 months long beta period to iron out the remaining bugs, it felt like it took forever.

It has been a really fun project to work on, and it was equally thrilling to see the positive reception and the signups. While there is a commercial aspect to it to pay for a sustainable service, it was a labour of passion.

It’s not ā€˜done’, no project is ever done. On the roadmap I have an officer tracker next, enabling tracking of officers across companies even when the name, dob or address varies. The build is complete, and I expect it to be release in the next few weeks.

Hope you found this rather long post interesting and if you did , first click the little red heart (Now I feel like a YouTuber asking people to like and comment) and then come and say hello at convert-ixbrl.co.uk

If there are any areas where you think I could have done better, feel free to leave a comment. If you have specific questions on the tech stack or anything really, feel free to DM me.

3 Likes

Congrats on getting this far!

I’ve been down a similar road.

Can I ask how you overcame some of the issues I found?

  • When companies change their year-end, so the statement is not for 12 months
  • Detecting bad data, such as 50k employees one year and 50 the next, because (for example) the scale (the xbrl attribute) is wrong, even though visually (rendering the xhtml) the report is correct
  • When key facts are revised in later statements, for a period you’ve already got in your db

Thanks Ed1

The year-end has not seem to proven to be a challenge, likely because people either just provide a company number and get the latest financials over API, or they run searches to find specific types of companies . So it is perhaps a different type of user base in this case. The actual period end doesn’t matter for them as it seems to in be your case.

Detecting bad data was one of the most difficult problems to overcome. Sadly, there is far less consistency to how documents are tagged than I would like. Sign detection, scale detection, use of wrong XBRL concept are all problems that require edge case handling at a massive scale. One of the reasons why it took nearly 2 years of work + a year of free public beta testing to produce results that met the quality bar.

For revised statements, as long as a new ixbrl file is made available, the system updates the data, matching on the year provided.