Parse Functions
Render documents into a navigable structure of sections, entities, and relationships — designed to be walked by an LLM agent
Parse functions turn a document into a navigable representation of itself. Instead of returning schema-bound JSON, a Parse function emits page-aware sections, named entities (people, organizations, products, identifiers, datasets, …), and the relationships between them. Output is queryable via the File System API using Unix-shell verbs — ls, cat, grep, head, find, open, xref — so an LLM agent can browse a corpus the way a developer browses source code.
This is the alternative to a RAG pipeline for use cases where the questions aren't known up front. The agent decides what to read next; the platform's job is to keep parsed documents addressable and the entity graph fresh.
When to use
Use a Parse function when you need to:
- Stand up an agent loop over a document corpus without building a chunker, embedder, and vector store
- Extract entities and relationships from documents where the schema isn't known in advance
- Maintain a cross-document memory — one canonical record per real-world thing — across an environment
- Power retrieval that needs to reach beyond a top-K window:
grepacross the corpus,xrefto every section that mentions an entity
If you already know exactly which fields you need out of a document, an Extract function is the simpler tool.
Configuration fields
Required fields
| Field | Type | Description |
|---|---|---|
functionName | string | Unique identifier for the function (per environment) |
type | string | Must be "parse" |
Optional fields
| Field | Type | Default | Description |
|---|---|---|---|
displayName | string | — | Human-readable display name |
tags | string[] | — | Tags for organization |
parseConfig.extractEntities | boolean | true | Extract named entities and relationships in addition to sections |
parseConfig.linkAcrossDocuments | boolean | true | Link entities across documents in the environment to build a cross-doc memory |
Output structure
Every parse call emits a single Transformation whose JSON has three top-level arrays:
| Array | Always populated | What it contains |
|---|---|---|
sections | yes | Page-aware chunks of the document — labels, types (heading, paragraph, table, list, …), page numbers, and content |
entities | only when extractEntities=true | Named entities pulled out of the document, deduped by canonical name within the doc and counted by mention |
relationships | only when extractEntities=true | Relationships between entities (e.g. Author A affiliated_with Institution B) |
sections is the anchor: it's what cat, head, grep, and xref read against. entities and relationships are the per-document slice of the entity graph; the cross-environment view lives in the Memory tab of the dashboard and is reachable via find / open / xref.
The two toggles, in detail
extractEntities
When true (the default), each parse call extracts entities and relationships alongside sections, and dedupes entities by canonical name within the document. When false, only sections[] is emitted; entities[] and relationships[] come back empty.
Turning entities off is rarely the right call — they're cheap on top of the section pass and they're what lets grep scope to entities or relationships later. Leave the default unless you have a specific reason to drop them.
linkAcrossDocuments
When true (the default), after each parse the platform runs a cross-document resolver that merges this document's entities with entities seen in earlier documents in the same environment, building one canonical record per real-world thing across the corpus. Surface forms (bem.ai, bem, Brilliant Enterprise Magic, Inc.) collapse onto one entityID.
This toggle:
- Doesn't change the per-call parse output — entities remain attached to the document via
entity_mentions - Is required for the memory-level File System ops (
find,open,xref); with linking off, those ops return an empty list and ahintpointing at the toggle - Requires
extractEntities=true(linking has nothing to link otherwise)
The resolver runs asynchronously after the parse event is dispatched, so the Memory tab is briefly eventually-consistent — a few seconds — while the resolver catches up.
Example: minimal Parse function
{
"functionName": "paper-parser",
"type": "parse",
"displayName": "Research Paper Parser"
}parseConfig is omitted, so both toggles default to true. This is the canonical setup for a "parse-and-navigate" pipeline.
Example: sections only, no memory
{
"functionName": "draft-parser",
"type": "parse",
"parseConfig": {
"extractEntities": false,
"linkAcrossDocuments": false
}
}Use this when you only want the navigable document structure (ls, cat, head, grep) and don't need the entity graph — for example, drafting tools that surface sections to a human reviewer.
Querying parsed output
Parsed documents are read through the File System API at POST /v3/fs. The verbs split into two groups:
- Doc-level ops (
ls,cat,head,grep,stat) — work on every parsed document, regardless of toggles. - Memory-level ops (
find,open,xref) — work on the cross-document entity graph. RequirelinkAcrossDocuments=trueon the parse function that produced the docs.
For a worked example, see the Parse and Search over Contracts cookbook.
Related
Parse and Search over Contracts
Cookbook: parse a contract, search and reason over it with the File System API
File System API reference
POST /v3/fs — every op, every flag
Create a Function API
API reference for creating functions
Extract Functions
Schema-bound extraction — the alternative when the fields are known up front