Accuracy

Label, review, and regression-test your function outputs

Hand off to an LLM

The Accuracy page is the hub for improving your functions. It shows every function in your environment with headline quality metrics, and exposes three tools per function — Label, Review, and Regression Testing — for different stages of the quality loop.

Accuracy list

Function list

  • Filter by function name — the search bar at the top matches on the function's displayName (substring, case-insensitive).
  • Each row shows aggregate quality counters for the function:
    • Labeled Outputs — how many outputs you've manually labeled or corrected.
    • Total Outputs — how many outputs the function has produced.
    • False Negatives / False Positives / True Positives — confusion-matrix style counts computed against your labels.
  • Hovering a row exposes three action buttons: Label, Review, and Regression Testing — clicking any of them jumps into the corresponding sub-tool for that function.

Label

Correction editor

The Label tab is a three-pane correction editor for supplying ground-truth values:

  • Transformations list — dropdown of every transformation for the selected function. Use ⌘ ← / → to move between adjacent ones. A filter button narrows the list by date, confidence, or status.
  • Input preview — renders the original input file (PDF, image, etc.) so you can cross-reference values while correcting.
  • Correction editor — the structured JSON output, with per-field confidence badges in the gutter. Fields with low confidence are highlighted so you know what to focus on.
    • Order matching toggle — when on, the editor snaps your edits to the canonical field order from the output schema; when off, it preserves the model's original ordering.
    • Confirm Output (⌘ ↵) — saves your corrections as a label, which then flows into the function's accuracy metrics and becomes available as regression-test data.

Review

Accuracy review

The Review tab is a quality dashboard for the selected function:

  • Margin of Error / Confidence Level — statistical bounds for the metrics shown below. Raise the confidence level or shrink the margin for stricter estimates.
  • Function Version / Evaluation Version — control which function version is being evaluated and which LLM-judge version is doing the grading.
  • Is Regression toggle — filter to only the outputs that are part of a regression-testing run (see below).
  • Dataset Overview — counts of total, labeled, and unlabeled transformations, plus a labeling-progress bar.
  • Model Performance — headline PR-AUC (precision-recall area under curve) for the function, with an explanation of how to read it.
  • Confidence Distribution — breakdown of outputs by confidence bucket (High ≥ 80%, Medium 60–80%, Low < 60%).
  • Run Review — kicks off a fresh judge pass over the dataset, which repopulates the metrics above.

Regression Testing

Regression testing

Regression Testing lets you compare two versions of the same function against the same set of inputs, so you can see whether a configuration change improved or regressed quality before promoting it.

  • Baseline Version / Comparison Version — dropdowns to pick the two function versions you want to pit against each other.
  • 1. Run Regression Tests — creates regression transformation samples by re-running the baseline version over historical labeled inputs. Click Run Tests to start.
  • 2. Apply Corrections — once samples exist, apply corrections to them either Automatically (using the existing labels) or Manually (by labeling them yourself in the Label tab).
  • 3. Inspect Results — summarizes Baseline Transformations vs Comparison Transformations so you can see exactly which fields improved or regressed. Use Rerun Comparison to refresh after labelling more samples.

Learn more

On this page