we’re trying to get a local copy of the UK companies by crawling the items from the advanced-search/companies endpoint, 5000 per page with something similar to:
I don’t have the answer to your specific question, but in general this service isn’t designed for bulk downloads/crawling/scraping etc. I don’t have the link on me but I suspect it may be against the terms-of-use of the api. I’m not sure if the advanced search API is actually officially live yet either so there may be more documentation to come.
Companies House has some bulk data products and resources that may be a lot more appropriate for what you’re trying to do: Companies House
i’ll give my shot with the bulk data product, but still would be great if i can scrape it through the API (bulk data updates monthly, maybe an update can trigger other API calls for officers, etc, maybe having some other logic around it)
Better to choose a batch filter and to reduce down the size of each batch. A good approach would be to limit the incorporation date to a range and reduce the range size down if more than 200 results are returned.
i can see the point of Mark Williams, but my opinion is closer to David Bond. i believe that by design an API should never return 500 and if used according to the rate limits, we should be able to scroll the full database, not a part of it
it’s not an error, I ran into the same issue and now having to rebuild my logic to suit the API
The issue here is that your query returns lets say 20,000 results, the API request returns the first 5,000 of the 20,000 and your start_index is for the 5,000 results that the API returns not for the entire 20,000 matches.
You are basically telling the API to order your results from number 10,000 with only 5,000 results returned thus the 500 error.
What you should be doing is making sure that your query returns 5000 results or less and do not use the start_index as it’s only for the starting point in your returned results and not the total results that meet your parameters.
Above is returning Internal Server error. How start_index & size works? First time if I get 5000 results, to get second set results how should I pass parameters?
Please help. We are trying to get incremental update for companies based on date range, any API is available for this?