[Solved] XBRL insead of PDF documents

You’ve said what you need! The solution is to ask for what you want by setting the Accept header when requesting the document content.

Currently filings may be available in:

  1. No data at all - you might see some variant of “unavailable” or “please contact us for this”.
  2. PDF - the “standard”.
  3. XBRL - only a few and these are - as far as I’m aware - only when the company has filed in this format. CH don’t generate this info.

…so you got lucky a few times.

To see what’s available (per the document API docs) the “resources” member of the call the document metadata object lists these. (The the metadata endpoint can be found in the links in the filing history list / filing history item object). Example:

For company 00197009, the filing history item https://api.companieshouse.gov.uk/company/00197009/filing-history/MzE5Mzc0OTc3MGFkaXF6a2N44

If you request the metadata from this with:
https://document-api.companieshouse.gov.uk/document/T53BLYf734zxeBWyvna131JtREqLsBgclFME-v6rxI84

You get:

{
    "company_number": "00197009",
    ...
    "resources": {
        "application/pdf": {
            "content_length": 26343    
        },
        "application/xhtml+xml": {
            "content_length": 20364    
        }    
    }
    "links": {
        "self": "https://document-api.companieshouse.gov.uk/document/T53BLYf734zxeBWyvna131JtREqLsBgclFME-v6rxI8",
        "document": "https://document-api.companieshouse.gov.uk/document/T53BLYf734zxeBWyvna131JtREqLsBgclFME-v6rxI8/content"    
    },    
}

I don’t think there’s any “non-determinism” - unless you’ve repeatedly tried the same document endpoint with different results. Getting different formats is likely the result of randomly hitting on accounts where there is another format than the PDF data e.g. XBRL. It may be that you always get XBRL if you don’t specify “Accept” with a type and both it and PDF are available or it may be you get these in the order “they come in” - which may or may not be fixed. Or some other reason. Just set the “Accept” header.