Machine-Readable Formats

Text-based standards --- such as JSON, YAML, XML, CSV, and TOML --- that organise data into predictable structures a program can parse, query, and act on without human intervention.

What is it?

Humans communicate in prose --- sentences, paragraphs, stories. Prose is flexible and expressive, but a computer cannot reliably extract meaning from a free-form paragraph the way it can from a row in a spreadsheet or a key-value pair in a configuration file. A machine-readable format is any notation that encodes data according to strict, publicly documented rules so that software can parse it automatically.¹

The idea is not new. Punch cards in the 1960s were a machine-readable format. What has changed is the variety: modern formats range from lightweight text files you can open in Notepad (JSON, YAML, CSV) to highly structured markup languages (XML) and even binary encodings optimised for speed (Protocol Buffers).² Despite their differences, every machine-readable format shares three properties: a defined syntax (rules for how data must be written), deterministic parsing (any compliant parser will produce the same result), and explicit structure (relationships between data points are encoded, not implied).

Why does this matter? Because automation, integration, and AI all depend on machines being able to consume data without guessing. An API returns JSON so the client knows exactly where to find each field. A CI/CD pipeline reads YAML so the build server knows which steps to execute. A knowledge graph stores relationships in RDF or JSON-LD so a search engine can traverse them. Without machine-readable formats, every system would need custom, fragile logic to interpret every other system’s output.³

In plain terms

Machine-readable formats are like standardised shipping containers. No matter what is inside --- electronics, furniture, food --- the container has fixed dimensions, a label in a known position, and a locking mechanism that every crane in the world understands. The container lets the port (the machine) move cargo (data) efficiently without opening it first to figure out what it is.

At a glance

The spectrum from unstructured to fully structured (click to expand)
graph LR
    A[Prose / Free Text] -->|add rules| B[Semi-Structured]
    B -->|add schema| C[Machine-Readable Formats]
    C -->|add constraints| D[Databases / Typed Schemas]
    style C fill:#4a9ede,color:#fff
Key: Data exists on a spectrum. Pure prose has no formal structure. Semi-structured data (like an email with subject, date, and body) has some predictable fields. Machine-readable formats impose consistent syntax and explicit structure. Databases add type constraints, indexing, and query languages on top. This card focuses on the middle layer --- the formats themselves.

How does it work?

1. What makes a format machine-readable

A format qualifies as machine-readable when it satisfies three conditions:¹

Defined syntax --- a published grammar that specifies which characters are allowed and what they mean. JSON uses braces, colons, and commas; YAML uses indentation and dashes; CSV uses commas and newlines.
Deterministic parsing --- any compliant parser, in any programming language, will produce the same internal data structure from the same input. There is no ambiguity about where one value ends and the next begins.
Explicit structure --- relationships between data points are encoded in the format itself (nesting, ordering, key names), not left for the reader to infer from context.

Think of it like...

A library catalogue card. The card has fixed fields (title, author, subject, call number) in a fixed order. Any librarian, anywhere in the world, can read it and find the book. A handwritten note that says “that great novel by the woman from Nigeria” conveys the same information to a human, but no catalogue system can act on it.

2. The major formats

Each format was designed for a different primary use case. The table below summarises the most common ones:²³

Format	Structure	Comments	Best for
JSON	Key-value pairs and arrays in braces	No	APIs, data exchange, configuration
YAML	Indentation-based hierarchy	Yes	DevOps config, CI/CD pipelines
XML	Nested opening/closing tags	Yes	Enterprise systems, document markup
CSV	Comma-separated rows and columns	No	Tabular data, spreadsheets, data pipelines
TOML	Section-based key-value pairs	Yes	Application configuration

Concept to explore

See json for a detailed breakdown of JSON syntax, data types, parsing, and real-world examples.

The same data in four formats (click to expand)
JSON:
{
  "city": "Lausanne",
  "country": "Switzerland",
  "population": 140000
}
YAML:
city: Lausanne
country: Switzerland
population: 140000
XML:
<city>
  <name>Lausanne</name>
  <country>Switzerland</country>
  <population>140000</population>
</city>
CSV:
city,country,population
Lausanne,Switzerland,140000
Four notations, identical data. Each parser knows exactly where 140000 is and what it represents.

3. Serialisation and deserialisation

Serialisation is the process of converting an in-memory data structure (an object, a dictionary, a list) into a text string that can be stored in a file or sent over a network. Deserialisation is the reverse --- reading that text string back into a usable data structure.²

For example, when a weather API receives a request, it serialises the forecast data into a JSON string, sends it across the internet, and the client deserialises it back into an object it can display on screen. This round-trip is the fundamental operation that machine-readable formats enable.

Think of it like...

Flat-pack furniture. The factory has a fully assembled bookshelf (in-memory object). To ship it, they disassemble it into numbered panels and pack it in a flat box with an instruction sheet (serialisation). When it arrives, the buyer reassembles it using the instructions (deserialisation). The instruction sheet is the format specification --- it guarantees the bookshelf comes out the same on the other end.

4. Why different formats exist

If JSON can represent almost anything, why do YAML, XML, CSV, and TOML still exist? Because each format optimises for a different set of trade-offs:³⁴

Human readability vs machine parsability --- YAML’s indentation-based syntax is easier for humans to scan than JSON’s braces, but indentation errors cause silent bugs. JSON is stricter and less ambiguous for machines.
Simplicity vs expressiveness --- CSV is dead simple for flat, tabular data, but it cannot represent nesting or data types. XML can represent almost anything, including metadata via attributes and namespaces, at the cost of verbosity.
Comments --- JSON does not support comments. YAML, TOML, and XML do. For configuration files that humans edit and maintain, comments are essential for documenting intent.
Ecosystem fit --- Kubernetes mandates YAML. Web APIs standardise on JSON. Enterprise banking systems often require XML. The “best” format is frequently the one the surrounding ecosystem expects.⁴

Key distinction

No single format wins on every axis. The choice depends on who will read it (human, machine, or both), what it needs to represent (flat table, nested hierarchy, graph), and what the surrounding tools expect.

5. The structured-data spectrum

Data does not divide neatly into “structured” and “unstructured.” It sits on a spectrum:⁵

Unstructured --- prose, images, audio. Rich in meaning but opaque to machines without AI processing.
Semi-structured --- an email has predictable fields (from, to, date, subject) but a free-text body. Some structure, not enough for reliable automation.
Structured (machine-readable formats) --- JSON, YAML, XML, CSV. Explicit syntax, deterministic parsing, ready for pipelines.
Fully constrained --- relational databases with typed schemas, foreign keys, and constraints. The strictest end of the spectrum.

Machine-readable formats occupy the sweet spot: structured enough for automation, flexible enough that a human can create and edit them in a text editor.

Concept to explore

See structured-data-vs-prose for a deeper comparison of when structured formats are the right choice and when prose is genuinely better.

Why do we use it?

Key reasons

1. Automation. Pipelines, scripts, and agents can process machine-readable data without human intervention. A CI/CD system reads a YAML file and knows which tests to run. An API returns JSON and the client renders it on screen. No guessing, no manual parsing.³

2. Interoperability. When two systems agree on a format, they can exchange data seamlessly --- regardless of programming language, operating system, or vendor. JSON is readable by Python, JavaScript, Go, Java, and every other modern language.²

3. Reliability. Deterministic parsing means the same input always produces the same output. There are no ambiguities for a parser to misinterpret. A well-formed JSON document will parse identically on every machine in the world.

4. Composability. Machine-readable data can be transformed, filtered, merged, and piped between tools. You can convert YAML to JSON, flatten JSON to CSV, or enrich CSV with data from an API --- because every step reads and writes a known format.⁴

When do we use it?

When building or consuming an API that needs to send and receive structured data
When writing configuration files for applications, build systems, or infrastructure tools
When storing knowledge or metadata that software needs to traverse (e.g., frontmatter in a knowledge system)
When importing or exporting tabular data between spreadsheets, databases, and analytics tools
When defining schemas or contracts that describe the shape of data other systems will produce or consume

Rule of thumb

If a human will write the data once and machines will read it many times, use a machine-readable format. If a human will read the data many times, choose a format that balances structure with readability (YAML or TOML over raw XML).

How can I think about it?

The universal power adapter

Imagine travelling the world with a bag of devices --- phone, laptop, camera. Every country has a different wall socket. A universal power adapter is a machine-readable format.

Your device = the data (a charge of electricity, a piece of information)

The wall socket = the system that needs to receive it (an API, a database, a build server)

The adapter = the format (JSON, YAML, XML) that reshapes the data into the shape the socket expects

Standard prong shapes = the syntax rules (braces, indentation, tags)

Voltage marking on the label = metadata that tells the socket what to expect

Without the adapter, you are stuck hand-wiring a connection for every country. With it, you plug in anywhere. Machine-readable formats are the adapters that let data plug into any system.

The Rosetta Stone

The Rosetta Stone carries the same decree in three scripts: hieroglyphics, Demotic, and Greek. Each script is a “format” that a different audience can parse.

The decree = the underlying data

Hieroglyphics = XML (verbose, rich, ceremonial)

Demotic = YAML (everyday, readable, concise)

Greek = JSON (the lingua franca that the widest audience understands)

The stone itself = the file or network payload carrying the data

The genius of the Rosetta Stone was not the message --- it was encoding the same message in multiple parseable formats so that different readers could consume it. Machine-readable formats do the same thing for software.

Concepts to explore next

Concept	What it covers	Status
json	JSON syntax, data types, parsing, and real-world usage	complete
embeddings	Numerical vector representations of meaning for semantic search and similarity	complete
structured-data-vs-prose	When structured formats beat free text and vice versa	stub

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.

Check your understanding

Test yourself (click to expand)

Explain what makes a data format “machine-readable” and name the three properties every machine-readable format shares.

Name five common machine-readable formats and describe the primary use case for each.

Distinguish between serialisation and deserialisation. Give a concrete example of each.

Interpret this scenario: a team switches their API response format from XML to JSON and sees a 40% reduction in payload size. What property of JSON explains this?

Connect machine-readable formats to knowledge engineering: why does a knowledge system need its metadata in a machine-readable format rather than in prose?

Where this concept fits

Position in the knowledge graph
graph TD
    KE[Knowledge Engineering] --> MRF[Machine-Readable Formats]
    KE --> KG[Knowledge Graphs]
    MRF --> JSON[JSON]
    MRF --> EMB[Embeddings]
    MRF --> SDvP[Structured Data vs Prose]
    style MRF fill:#4a9ede,color:#fff
Related concepts:

apis --- APIs depend on machine-readable formats (typically JSON) to structure requests and responses between client and server

knowledge-graphs --- graph data (nodes, edges, properties) must be serialised into a machine-readable format for storage and exchange

databases --- databases sit at the fully-constrained end of the data-structure spectrum, adding typed schemas and query languages on top of structured formats

Explorer

Machine-Readable Formats

Machine-Readable Formats

What is it?

At a glance

How does it work?

1. What makes a format machine-readable

2. The major formats

3. Serialisation and deserialisation

4. Why different formats exist

5. The structured-data spectrum

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Machine-Readable Formats

Machine-Readable Formats

What is it?

At a glance

How does it work?

1. What makes a format machine-readable

2. The major formats

3. Serialisation and deserialisation

4. Why different formats exist

5. The structured-data spectrum

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks