Parse Functions
Render documents into a navigable structure of sections, entities, and relationships — designed to be walked by an LLM agent
Parse functions turn a document into a navigable representation of itself. Instead of returning schema-bound JSON, a Parse function emits page-aware sections, named entities (people, organizations, products, identifiers, datasets, …), and the relationships between them. Output is queryable via the File System API using Unix-shell verbs — ls, cat, grep, head, find, open, xref — so an LLM agent can browse a corpus the way a developer browses source code.
This is the alternative to a RAG pipeline for use cases where the questions aren't known up front. The agent decides what to read next; the platform's job is to keep parsed documents addressable and the entity graph fresh.
When to use
Use a Parse function when you need to:
- Stand up an agent loop over a document corpus without building a chunker, embedder, and vector store
- Extract entities and relationships from documents where the schema isn't known in advance
- Maintain a cross-document memory — one canonical record per real-world thing — across an environment
- Power retrieval that needs to reach beyond a top-K window:
grepacross the corpus,xrefto every section that mentions an entity
If you already know exactly which fields you need out of a document, an Extract function is the simpler tool.
Configuration fields
Required fields
| Field | Type | Description |
|---|---|---|
functionName | string | Unique identifier for the function (per environment) |
type | string | Must be "parse" |
Optional fields
| Field | Type | Default | Description |
|---|---|---|---|
displayName | string | — | Human-readable display name |
tags | string[] | — | Tags for organization |
parseConfig.extractEntities | boolean | true | Extract named entities and relationships in addition to sections |
parseConfig.linkAcrossDocuments | boolean | true | Link entities across documents in the environment to build a cross-doc memory |
Output structure
Every parse call emits a single Transformation whose JSON has three top-level arrays:
| Array | Always populated | What it contains |
|---|---|---|
sections | yes | Page-aware chunks of the document — labels, types (heading, paragraph, table, list, …), page numbers, and content |
entities | only when extractEntities=true | Named entities pulled out of the document, deduped by canonical name within the doc and counted by mention |
relationships | only when extractEntities=true | Relationships between entities (e.g. Author A affiliated_with Institution B) |
sections is the anchor: it's what cat, head, grep, and xref read against. entities and relationships are the per-document slice of the entity graph; the cross-environment view lives in the Memory tab of the dashboard and is reachable via find / open / xref.
Table sections
Sections of type: "table" carry both a tab-separated content string (for backward compatibility) and a structured table object so downstream LLM consumers — especially smaller models doing RAG-style retrieval — can address cells by header name without parsing separators.
{
"type": "table",
"label": "Specifications",
"page": 2,
"content": "Property\tValue\tUnit\nMass\t12.5\tkg\nLength\t3.2\tm",
"table": {
"headers": ["Property", "Value", "Unit"],
"rows": [
{ "Property": "Mass", "Value": "12.5", "Unit": "kg" },
{ "Property": "Length", "Value": "3.2", "Unit": "m" }
],
"cells": [
["Mass", "12.5", "kg"],
["Length", "3.2", "m"]
]
}
}headers— the column header texts in left-to-right order. Empty array when the table has no visible header row.rows— one keyed dict per data row; keys are the header names. Duplicate headers get a__Nsuffix on the second and later occurrences ("Price", "Price__2"); headerless tables get synthesizedcol_1,col_2, … keys. This guarantees no cell is lost when projecting to dicts.cells— the raw positional rows, parallel toheaders. Use this when your consumer needs the original column ordering regardless of header de-duplication.
All cells are emitted as strings — consumers cast as needed. Merged cells and multi-row headers are not represented; visually-merged cells are flattened by repeating the value across affected rows.
The table field is present only on type: "table" sections; paragraphs, lists, and other section types never carry it.
The two toggles, in detail
extractEntities
When true (the default), each parse call extracts entities and relationships alongside sections, and dedupes entities by canonical name within the document. When false, only sections[] is emitted; entities[] and relationships[] come back empty.
Turning entities off is rarely the right call — they're cheap on top of the section pass and they're what lets grep scope to entities or relationships later. Leave the default unless you have a specific reason to drop them.
linkAcrossDocuments
When true (the default), after each parse the platform runs a cross-document resolver that merges this document's entities with entities seen in earlier documents in the same environment, building one canonical record per real-world thing across the corpus. Surface forms (bem.ai, bem, Brilliant Enterprise Magic, Inc.) collapse onto one entityID.
This toggle:
- Doesn't change the per-call parse output — entities remain attached to the document via
entity_mentions - Is required for the memory-level File System ops (
find,open,xref); with linking off, those ops return an empty list and ahintpointing at the toggle - Requires
extractEntities=true(linking has nothing to link otherwise)
The resolver runs asynchronously after the parse event is dispatched, so the Memory tab is briefly eventually-consistent — a few seconds — while the resolver catches up.
Example: minimal Parse function
{
"functionName": "paper-parser",
"type": "parse",
"displayName": "Research Paper Parser"
}parseConfig is omitted, so both toggles default to true. This is the canonical setup for a "parse-and-navigate" pipeline.
Example: sections only, no memory
{
"functionName": "draft-parser",
"type": "parse",
"parseConfig": {
"extractEntities": false,
"linkAcrossDocuments": false
}
}Use this when you only want the navigable document structure (ls, cat, head, grep) and don't need the entity graph — for example, drafting tools that surface sections to a human reviewer.
Querying parsed output
Parsed documents are read through the File System API at POST /v3/fs. The verbs split into two groups:
- Doc-level ops (
ls,cat,head,grep,stat) — work on every parsed document, regardless of toggles. - Memory-level ops (
find,open,xref) — work on the cross-document entity graph. RequirelinkAcrossDocuments=trueon the parse function that produced the docs.
For a worked example, see the Parse and Search over Contracts cookbook.
Related
Parse and Search over Contracts
Cookbook: parse a contract, search and reason over it with the File System API
File System API reference
POST /v3/fs — every op, every flag
Create a Function API
API reference for creating functions
Extract Functions
Schema-bound extraction — the alternative when the fields are known up front