Schema Building Guide

Best practices for designing the JSON Schema you give to Extract and Join functions

Hand off to an LLM

The outputSchema you give to an Extract or Join function is what bem uses to normalize many different inputs into one consistent shape. The recommendations below come from running these schemas in production — they meaningfully improve accuracy and reduce the volume of follow-up corrections you have to ship.

Provide descriptions for every field

The single highest-leverage thing you can do. bem only knows what you tell it before you start providing feedback, so a one-line natural-language description — the field's purpose, how it might appear in source documents, and an example value — measurably improves extraction quality. Treat field descriptions as prompts, not as JavaDoc.

Set a type on every field

Typing helps bem ground its extraction. It also gives you reliably structured output downstream and surfaces bad source data: any field that has a type but couldn't be populated comes back as an "invalid property" that you can build error handling around.

Mark genuinely required fields as required

required tells bem which fields to anchor on. Required fields that can't be populated from the input also surface as "invalid properties," so you can fail fast on documents that are missing critical data instead of silently extracting partial records.

Provide formatting hints for fields if necessary

If a field you want to populate has a fixed format, you can either specify the format as a regular expression in the conventional JSON Schema pattern field. We've also seen great results from specifying a pattern in natural language in description fields, but if you have more stringent formatting expectations we'd recommend setting a regex pattern.

For date strings, we only support formatting in the ISO 8601 standard and do not support regex patterns at the moment. As an example, the time '1/01/2024' in a given input will be formatted as '2024-01-01'.

Set enums for fields you know have a certain set of possible values

Enum values help constrain the set of valid values that bem transforms into a given property. At the moment, this isn't a “strict” constraint but generally helps bem understand the intent behind the desired transformation. Think of it as a stronger way to indicate desired output than providing example values in a description.

Example Schema

Below is an example schema showcasing the above best practices that can be used to normalize inputs from a variety of commercial vehicle electronic logging device (ELD) providers.

{
  "type": "object",
  "title": "Fleet Trip Summary",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "required": ["fleetId", "tripSummary", "compliance", "operationalEfficiency"],
  "properties": {
    "fleetId": {
      "type": "string",
      "description": "Unique identifier for the fleet."
    },
    "compliance": {
      "type": "object",
      "required": ["hoursOfServiceCompliance", "notes"],
      "properties": {
        "notes": {
          "type": "string",
          "description": "Additional notes on compliance."
        },
        "hoursOfServiceCompliance": {
          "type": "string",
          "description": "Compliance status with hours of service."
        }
      }
    },
    "tripSummary": {
      "type": "object",
      "required": [
        "tripId",
        "vehicle",
        "driver",
        "start",
        "end",
        "distanceCovered",
        "fuelUsage",
        "incidents"
      ],
      "properties": {
        "end": {
          "type": "object",
          "required": ["time", "location", "odometerEnd"],
          "properties": {
            "time": {
              "type": "string",
              "format": "date-time",
              "description": "End time of the trip."
            },
            "location": {
              "type": "string",
              "description": "End location of the trip."
            },
            "odometerEnd": {
              "type": "integer",
              "description": "Odometer reading at the end of the trip."
            }
          }
        },
        "start": {
          "type": "object",
          "required": ["time", "location", "odometerStart"],
          "properties": {
            "time": {
              "type": "string",
              "format": "date-time",
              "description": "Start time of the trip."
            },
            "location": {
              "type": "string",
              "description": "Start location of the trip."
            },
            "odometerStart": {
              "type": "integer",
              "description": "Odometer reading at the start of the trip."
            }
          }
        },
        "driver": {
          "type": "object",
          "required": ["id", "name"],
          "properties": {
            "id": {
              "type": "string",
              "description": "Unique identifier for the driver."
            },
            "name": {
              "type": "string",
              "description": "Name of the driver."
            }
          }
        },
        "tripId": {
          "type": "string",
          "description": "Unique identifier for the trip."
        },
        "vehicle": {
          "type": "object",
          "required": ["id", "details"],
          "properties": {
            "id": {
              "type": "string",
              "description": "Unique identifier for the vehicle."
            },
            "details": {
              "type": "string",
              "description": "Description of the vehicle including make, model, and year."
            }
          }
        },
        "fuelUsage": {
          "type": "object",
          "required": ["totalGallons", "averagePricePerGallon", "totalCost"],
          "properties": {
            "totalCost": {
              "type": "string",
              "description": "Total cost of fuel."
            },
            "totalGallons": {
              "type": "number",
              "description": "Total gallons of fuel used."
            },
            "averagePricePerGallon": {
              "type": "string",
              "description": "Average price per gallon of fuel."
            }
          }
        },
        "incidents": {
          "type": "array",
          "items": {
            "type": "object",
            "required": ["type", "time", "location", "details"],
            "properties": {
              "time": {
                "type": "string",
                "format": "date-time",
                "description": "Time of the incident."
              },
              "type": {
                "type": "string",
                "description": "Type of incident."
              },
              "details": {
                "type": "string",
                "description": "Detailed description of the incident."
              },
              "location": {
                "type": "string",
                "description": "Location of the incident."
              }
            }
          }
        },
        "distanceCovered": {
          "type": "string",
          "description": "Total distance covered during the trip."
        }
      }
    },
    "operationalEfficiency": {
      "type": "object",
      "required": ["totalEngineHours", "idleTime", "efficiencyRating"],
      "properties": {
        "idleTime": {
          "type": "string",
          "description": "Total idle time during the trip."
        },
        "efficiencyRating": {
          "type": "string",
          "description": "Efficiency rating of the trip."
        },
        "totalEngineHours": {
          "type": "string",
          "description": "Total engine hours for the trip."
        }
      }
    }
  }
}

Avoid positional schemas

Positional schemas rely on array indices to carry meaning — for example, treating rates[0] as "the rate that goes with the first weight." These are brittle:

  • The model has no semantic anchor — it works best when relationships are explicit
  • Small changes in input layout (one extra row, one missing entry) silently corrupt every downstream pairing
  • The intent isn't readable in the schema itself, so future maintainers can't see what's correlated
{
  "type": "object",
  "properties": {
    "rates": {
      "type": "array",
      "description": "Shipping rates",
      "items": {
        "type": "number"
      }
    },
    "weights": {
      "type": "array",
      "description": "Weights for each shipping rate",
      "items": {
        "type": "number"
      }
    }
  }
}

Note how rates and weights are both arrays. This schema attempts to correlate them by position. This is brittle and not recommended.

Preferred: Semantic Object Pattern

{
  "type": "object",
  "properties": {
    "shippingRates": {
      "type": "array",
      "description": "Shipping rates with weights",
      "items": {
        "type": "object",
        "properties": {
          "rate": {
            "type": "number",
            "description": "Shipping rate"
          },
          "weight": {
            "type": "number",
            "description": "Weight"
          }
        }
      }
    }
  }
}

Note that rate and weight are directly associated by being in the same object.

Defaults

Defaults may be provided using the default key. The default value will be inserted if no value was able to be extracted from the input.

{
  "type": "object",
  "properties": {
    "currency": {
      "type": "string",
      "enum": ["USD","EUR","GBP"],
      "default": "USD"
    }
  },
}

On this page