Schema Building Guide
The output schema you define on a respective pipeline allows bem to effectively take a myriad of different inputs to normalize into your desired output structure. General best practices and conventions for defining JSON Schema go a long way here, but below are recommendations from the bem team to improve accuracy and reliability of your outputs so you can utilize the best of what bem has to offer.
Provide descriptions for every field
This is the most powerful change you can make to affect the output and accuracy of your model. bem won't know what you don't tell it (before you provide feedback), so natural language descriptions indicating the purpose of the field, how it could be potentially presented in your input data, and example values go a long way to improve how bem semantically analyzes your input.
Ensure you have types set on every field
Typing of fields helps bem better analyze your input against how you intend to utilize your output data. In addition to helping provide you which reliably structured output you can depend on in a variety of contexts, typing can also help surface bad input or invalid data that may be coming through your pipeline which get highlighted as “invalid properties” that you can build your own error handling and fault tolerance around.
Specify which fields you consider are required
Similar to ensuring typing, setting required
fields will help bem know to focus on identifying context around those fields during the transformation process.
Just like with typing on fields, fields that are required
but can't be filled in from the input data will get highlighted back to you as “invalid properties” that you can build your own error handling and fault tolerance around;
we can help you find those needles in the haystack as you sort through a deluge of inputs.
Provide formatting hints for fields if necessary
If a field you want to populate has a fixed format, you can either specify the format as a regular expression in the conventional JSON Schema pattern
field.
We've also seen great results from specifying a pattern in natural language in description fields, but if you have more stringent formatting expectations we'd recommend setting a regex pattern
.
For date strings, we only support formatting in the ISO 8601 standard and do not support regex patterns at the moment. As an example, the time '1/01/2024' in a given input will be formatted as '2024-01-01'.
Set enums for fields you know have a certain set of possible values
Enum values help constrain the set of valid values that bem transforms into a given property. At the moment, this isn't a “strict” constraint but generally helps bem understand the intent behind the desired transformation. Think of it as a stronger way to indicate desired output than providing example values in a description.
Example Schema
Below is an example schema showcasing the above best practices that can be used to normalize inputs from a variety of commercial vehicle electronic logging device (ELD) providers.
{
"type": "object",
"title": "Fleet Trip Summary",
"$schema": "http://json-schema.org/draft-07/schema#",
"required": ["fleetId", "tripSummary", "compliance", "operationalEfficiency"],
"properties": {
"fleetId": {
"type": "string",
"description": "Unique identifier for Upa; fleet."
},
"compliance": {
"type": "object",
"required": ["hoursOfServiceCompliance", "notes"],
"properties": {
"notes": {
"type": "string",
"description": "Additional notes on compliance."
},
"hoursOfServiceCompliance": {
"type": "string",
"description": "Compliance status with hours of service."
}
}
},
"tripSummary": {
"type": "object",
"required": [
"tripId",
"vehicle",
"driver",
"start",
"end",
"distanceCovered",
"fuelUsage",
"incidents"
],
"properties": {
"end": {
"type": "object",
"required": ["time", "location", "odometerEnd"],
"properties": {
"time": {
"type": "string",
"format": "date-time",
"description": "End time of the trip."
},
"location": {
"type": "string",
"description": "End location of the trip."
},
"odometerEnd": {
"type": "integer",
"description": "Odometer reading at the end of the trip."
}
}
},
"start": {
"type": "object",
"required": ["time", "location", "odometerStart"],
"properties": {
"time": {
"type": "string",
"format": "date-time",
"description": "Start time of the trip."
},
"location": {
"type": "string",
"description": "Start location of the trip."
},
"odometerStart": {
"type": "integer",
"description": "Odometer reading at the start of the trip."
}
}
},
"driver": {
"type": "object",
"required": ["id", "name"],
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the driver."
},
"name": {
"type": "string",
"description": "Name of the driver."
}
}
},
"tripId": {
"type": "string",
"description": "Unique identifier for the trip."
},
"vehicle": {
"type": "object",
"required": ["id", "details"],
"properties": {
"id": {
"type": "string",
"description": "Unique identifier for the vehicle."
},
"details": {
"type": "string",
"description": "Description of the vehicle including make, model, and year."
}
}
},
"fuelUsage": {
"type": "object",
"required": ["totalGallons", "averagePricePerGallon", "totalCost"],
"properties": {
"totalCost": {
"type": "string",
"description": "Total cost of fuel."
},
"totalGallons": {
"type": "number",
"description": "Total gallons of fuel used."
},
"averagePricePerGallon": {
"type": "string",
"description": "Average price per gallon of fuel."
}
}
},
"incidents": {
"type": "array",
"items": {
"type": "object",
"required": ["type", "time", "location", "details"],
"properties": {
"time": {
"type": "string",
"format": "date-time",
"description": "Time of the incident."
},
"type": {
"type": "string",
"description": "Type of incident."
},
"details": {
"type": "string",
"description": "Detailed description of the incident."
},
"location": {
"type": "string",
"description": "Location of the incident."
}
}
}
},
"distanceCovered": {
"type": "string",
"description": "Total distance covered during the trip."
}
}
},
"operationalEfficiency": {
"type": "object",
"required": ["totalEngineHours", "idleTime", "efficiencyRating"],
"properties": {
"idleTime": {
"type": "string",
"description": "Total idle time during the trip."
},
"efficiencyRating": {
"type": "string",
"description": "Efficiency rating of the trip."
},
"totalEngineHours": {
"type": "string",
"description": "Total engine hours for the trip."
}
}
}
}
}
Avoid Positional Schemas
Positional schemas rely on array indices or specific positions to convey meaning, which creates brittle and unreliable data extraction. These patterns should be avoided because:
- Poor LLM Performance: Language models work best with semantic relationships rather than positional dependencies
- Fragile Structure: Small changes in input format can break the entire extraction
- Unclear Intent: Position-based schemas are harder to understand and maintain
- Error Prone: Missing or reordered elements cause incorrect data mapping
{
"type": "object",
"properties": {
"rates": {
"type": "array",
"description": "Shipping rates",
"items": {
"type": "number"
}
},
"weights": {
"type": "array",
"description": "Weights for each shipping rate",
"items": {
"type": "number"
}
}
}
}
Note how rates
and weights
are both arrays. This schema attempts to correlate them by position. This is brittle and not recommended.
Preferred: Semantic Object Pattern
{
"type": "object",
"properties": {
"shippingRates": {
"type": "array",
"description": "Shipping rates with weights",
"items": {
"type": "object",
"properties": {
"rate": {
"type": "number",
"description": "Shipping rate"
},
"weight": {
"type": "number",
"description": "Weight"
}
}
}
}
}
}
Note that rate
and weight
are directly associated by being in the same object.
Defaults
Defaults may be provided using the default
key. The default value will be
inserted if no value was able to be extracted from the input.
{
"type": "object",
"properties": {
"currency": {
"type": "string",
"enum": ["USD","EUR","GBP"],
"default": "USD"
}
},
}