My ridiculous approach to Companies House bulk IXBRL processing using a 350GB SQLite Database - Part 2 : The technical bits
This is a technical post for my fellow geeks .
I built fastukcompanysearch.com, the unofficial Companies House app which has been quite popular since 2018. I then built convert-ixbrl.co.uk, which is an API and a web search service to find companies based on financials that uses Companies House IXBRL files to build the data, I wrote about it on Medium - How I built an IXBRL Converter for Companies House to extract financial statements from submitted accounts .
This post is a continuation of the original post about convert-ixbrl.co.uk , this time focused more on the technical implementation.
Long post alert: Itās a bit on the long side as I just didnāt want to leave anything out. I do recommend reading the last section where I describe a few ways to get your hands on free cloud computing credits worth up to Ā£8K for your own Companies House related or other computing experiments.
(Nope, NOT sponsored and this post has ZERO affiliate links)
#NoAIHere - Oh and I did not use AI to write or edit this. Sorry Claude, you are just a coding tool to me.
Background: What is this post about
Several people, over the past months have approached me asking specific questions about database implementation, pipeline processing and the like for convert-ixbrl.co.uk since its launch.
I thought my fellow geeks might find the technical details interesting, so Iāll share them here. If you are building something like this of your own, perhaps you could compare notes. Leave a comment if you think there are areas where I could have done better.
The foundations for Convert-IXBRL.co.uk
The idea is simple.
Companies house makes the accounts XBRL and IXBRL file available as a zip file. Download it, extract the XBRL and iXBRL contents, convert the files and save them in an appropriate database, setup the database for fast lookups and build an interface to query the data over convert-ixbrl.co.uk
Allow me to walk you through my process. Iāll skip over the most technically challenging part of building an IXBRL converter from scratch as Iāve described in the Medium post I linked to in the first paragraph.
Check it out after reading or atleast skimming this post as it provides a more complete view of the complete system.
Overview of the Current Tech Stack:
Backend : SQLite, .Net batch processes that make up the data conversion pipelines, a companies house monitor, .Net APIs, Serverless functions, AWS S3 , AWS SNS
Frontend: VueJS, HTML5 , CSS
Some technologies tried and abandoned: MySQL, PostgreSQL, MeiliSearch, ElasticSearch, DuckDB, DynamoDB and a few more that I have managed to forget about
Building a Batch Processor to Automate the zip file download
Arguably, this was the most straightforward part. My initial implementation was similar to a batch job processor I had built for fastukcompanysearch.com, which involved creating a job that runs once a day, checking if any new files had arrived and if they had, copying them to the server, extracting the zip and putting these in a staging area for processing.
I am a .Net developer so for me this was a console app, setup to run at a schedule using cron.
The staging area: 400 Million rows in an SQLite table
For convert-ixbrl.co.uk, the conversion begins when the extracted IXBRL files are converted to an intermediary format such that each XBRL concept and the associate value is stored in a single row in a gigantic table. A row might look like so
Id | CompanyNumber | XBRLConcept | Value
xxxx | SC123456 | core_netassets| 100000
This was the most natural format to store the data in initially as I was figuring out how best to optimise the database for search. This meant that the queries were quite simple
Sample Query (representative - donāt do this unless you want to risk SQL injection)
Select * From CompanyTable where CompanyNumber = 'SC123456'
This would then return a few hundred rows. I would then need to group the rows by year and then for each year return the actual values and the concept, mapped to a user friendly string (e.g. core_netassets would become NetAssets) as JSON.
I initially picked MySQL to implement this. I started to convert and import the data using my converter but my biggest friction point was database portability as I worked through the bugs. I work across Windows, Mac and Linux machines and continually exporting and importing a database of that size significantly increases the build and debug cycles making it really painful to work on this application and a large part of what I do is about the development experience, itās not all about the commercial aspects.
I work across these machines because working on my development projects in a coffee shop is one of the greatest pleasures of my life and while I prefer Windows(yes, I said it) I have to use Mac for my iOS mobile apps, but I digress.
PostgreSQL would have had the same issues, so I thought āwhy not true SQLite?ā
This initially seemed ridiculous as the title suggests. SQLite is typically used for small, local installs of relatively tiny databases but there was nothing in the documentation that said that it couldnāt accommodate a database of this size. Moving to SQLite would mean that I could just copy and paste the file across the machines, a far quicker operation on the modern SSDs.
And, it actually worked fine. Using the WAL mode of SQLite. I encourage you to read up on it. The inserts, updates and reads were blazingly fast even at the ridiculous (for SQLite) table sizes I mention.
Well, that kind of works until you need to run other types of queries
So, this worked but was not enough for the variety of queries convert-ixbrl offers and for the variety of queries users actually search for. Users also search based on partial company names. Searching for partial terms across such a large table, even with indexed columns, was just unacceptably slow, squashing my hopes of getting away with just a relatively simple implementation like this.
Enter ElasticSearch, then MeiliSearch
Need to run text based searches across largish db? ElasticSearch might be one of the top candidates for many. Just dump everything in and get querying.
The problem is
1- You need a far beefier machine. Cloud machines can get expensive fast, particularly when you consider that you need geographic redundancy. You ALWAYS need that redundancy, but the costs multiply far quicker with the beefy machines. At this stage, I wanted to keep my costs low to make the most use of my cloud credits during the build phase
2- Numeric searches are not straightforward. Iād likely need to run queries like āFind companies that have netassets > 2Million within area Xā and the mixed queries mean Iād need to either tune ElasticSearch just right or combine the results with a traditional DB search
3-Query building is not intuitive(subjective opinion)
I also then tried MeiliSearch , a more modern alternative. The querying was nicer (again, subjective opinion) but it didnāt really fix the advanced querying issue for me. Also, it still needed a beefy machine for optimum performance and frankly I just did not feel it was battle tested enough. And ofcourse, using either ES or MeiliSearch would that meant my DB backup and restore issue would make a comeback as I build across machines.
Enter FTS
SQLite supports extensions. FTS enables Free Text Search within the same db where I had the company data using SQL like syntax.
So that worked. It wasnāt perfect as it still required writing optimised queries and performing serverside grouping to structure the text based results and merge them with the numeric results the existing traditional tables but atleast now I just had one db untilā¦
Enter Advanced Queries
⦠I realised I would also need to run advanced queries like
āFind all companies within 2 miles of EC1 post code with netassets of over a million, with officers who are aged 65 years or older, that are atleast 10 years old and operate within the provided SIC code sectorsā
This is a real representative example of a query that are run on convert-ixbrl.co.uk .
Back to staging area
I mentioned the staging area earlier. I now realised that no matter how much I optimise the gigantic table shown a few paragaphs ago, I will struggle to enable the users to be able to search across ANY xbrl concept . Examples include netassets, profit/loss, pension, accounting fee paid, tax + dozens of others and the flexible search was the whole reason behind building this service. This was circa 2021, before ChatGPT got mainstream and this was when I was building this on the side.
I then had to embark on a trial and error journey. Different table layouts, column layouts, indexing strategies and the like. All with the goal to make the search work efficiently, regardless of the nature of queries.
The process was essentially a repeat of "Build a structure, try the queries, find several cases where the structure didnāt work or was too slow, modify it, try again, find a few more cases where it didnāt work or was too slow, rinse and repeat until doneā
I ultimately settled on two databases. One storing the ārawā converted data shown earlier, and another with rows that were really wide, containing one xbrl concept value per column like so
CompanyNumber|YearEnd|NetAssets|EmployeeCountā¦. dozens of columns followed
The wide column layout was somewhat ugly but what really matters is the query speed , update/insert speed and ultimately the user experience. You have to remember, it is ultimately a product that someone will use and it is being built to deliver value, achieving technical perfectionism is a futile exercise.
It was fun but frustrating at the same time. When something worked, Iād find myself yellingā YES! I knew I am a genious!" and then ten minutes later when I discovered that it only worked for a subset of queries I need to support and other queries took minutes to run I would throw my hands up in the air exclaiming āArgh, what is even the point? Will this ever be done?ā
Slowly but surely though, over months (remember I wasnāt working on this full time), things started to come together. One by one I worked through optimising the tables, column types, indexes, SQLite setup , querying logic, tweaking, optimising all the way until all the queries I wished to run worked, and returned results within a few hundred milliseconds (there are some exceptions depending on the query type but mostly).
There is additional processing following the retrieval of database results that happens in the application layer (.net). There are unit tests that run as part of the usual CI process, as well as integration tests and a postman test suite that is used to test the API in a semi-automatic fashion.
Companies House Monitor
The setup above worked initially but required multiple batch jobs. The basic company data bulk file and the accounts bulk data files need to be merged, as well as the PSC bulk file and the officers file that are available from Companies House servers.
Over the next few months, I repurposed a companies house monitor, a live , always on listener service which processes the Companies House stream to update the convert-ixbrl data in as close to real time as possible.
Running continuosly, it receives the events sent by Companies House streaming service , saving the data locally, backing it up to S3 (in case the hosting infrastructure fails and also for future processing) and updating the db.
Batch processes
The batch processes perform a number of functions other than just the data update. There are data backups, uploads to Snowflake (I make the data available on Snowflake too), certain SalesForce updates(I have a SalesForce LWC app launching Q1 2026), sending reminders and alerts and performing certain health checks to ensure system is operating as expected.
Serverless functions
I use serverless functions to carry out continous health checks. If a part of the system fails, an alert is triggered so that action can be taken to fix the issue. There are also additional alerts that use AWS SNS to send warning emails if any thresholds are breached, for instance consistently high CPU usage.
Web Portal
The data is made available as a giant | delimited text file, over a JSON API and also over a web search interface where dozens of search parameters can be combined to perform very specific searches.
The web portal is built using VueJS and HTML and it talks to the backend using an API, very similar to the API available publicly. Working with Javascript is my least favourite part of this process (subjective experience, you are welcome to disagree) but VueJS makes it somewhat less painful.
Not that JS is terrible to work with (or perhaps it can be, depends on who you ask), for me it just doesnāt provide as enjoyable a development experience as the pure .Net based development does. Having a web portal meant that I have spent many, many hours building, polishing, tweaking and retweaking the web interface to make it easy to use without overwhelming the user with 75+ search filters. The options are still there but the user is initially presented with a much smaller subset of filters which can be expanded to reveal the advanced options. (Check it out when you are done reading this)
Authentication
The website uses passwordless authentication. Just enter your email, get a login code in your inbox and start using the service. I donāt collect or store any card/bank information either, all of that is taken care of by Stripe. I also donāt ask for other personal information pieces like addresses, date of birth, contact number and the like.
Getting my hands on free cloud computing credits
AWS, GCP and Azure all offer limited free credits, usually about £200 and these only last a month so that was not a lot of use to me.
Earlier on, I had enrolled AWS Activate program which came with about Ā£1000 worth of credits. Sadly, I had already used that up. I instead signed up for GCP for Startups to get another Ā£1500 in credits and used these all up, switching back to AWS after the Beta was over. Iāve found AWS to be generally more reliable, and as most of my setup is scripted, the move was a few hours of work. This should be worth it depending on your setup and team size.
An alternative option is OVH Cloud with a much generous offering. I applied for their startup programme and within 24 hours received about £8K worth of cloud computing credits. I have rented dedicated servers from them before and have no complaints, either about the machines or the service.
For this project I will stick to AWS but I plan to use OVH credits for a spin-off next year.
All three programmes just require a registered UK limited company and a website to demonstrate you are actually building something.
Putting it all together
The project was delivered across 3ā4 years (working on it part time), although when it finally went live after an 8 months long beta period to iron out the remaining bugs, it felt like it took forever.
It has been a really fun project to work on, and it was equally thrilling to see the positive reception and the signups. While there is a commercial aspect to it to pay for a sustainable service, it was a labour of passion.
Itās not ādoneā, no project is ever done. On the roadmap I have an officer tracker next, enabling tracking of officers across companies even when the name, dob or address varies. The build is complete, and I expect it to be release in the next few weeks.
Hope you found this rather long post interesting and if you did , first click the little red heart (Now I feel like a YouTuber asking people to like and comment) and then come and say hello at convert-ixbrl.co.uk
If there are any areas where you think I could have done better, feel free to leave a comment. If you have specific questions on the tech stack or anything really, feel free to DM me.
