Back to Submissions

Comparison Metrics

Understanding the results from the Compare Data step

Quick Reference

FieldWhat it means
total_comparedNumber of rows in the expected data (calculated from your CSV files)
rows_comparedRows where we found a match in both expected AND extracted data
rows_matchedRows that matched perfectly (found in both, zero value differences)
rows_with_differencesRows found in both datasets but have at least one value mismatch
rows_missingRows in expected data with no matching row in extracted data
rows_extraRows in extracted data with no matching row in expected data
difference_countTotal number of individual value differences (one row can have multiple)

Common Questions

What does "rows_missing" mean?

We calculated an expected row from your raw data, but the AI could not find or extract a matching row from the PDF. This typically means:

  • The question or segment is missing from the PDF entirely
  • The AI failed to extract that particular row
  • The row key didn't match (e.g., question text differs slightly)

What does "rows_extra" mean?

The AI extracted a row from the PDF that we didn't expect based on our calculations. This typically means:

  • The PDF contains data we didn't calculate (e.g., different time period)
  • The AI misread something as a data row
  • Our expected data generation is missing something

What's the difference between "rows_with_differences" and "difference_count"?

A single row comparing 5 columns could have 3 mismatches. That would be 1 row_with_differences but 3 difference_count.

How the Numbers Add Up

The metrics always satisfy this equation:

rows_matched+rows_with_differences+rows_missing=total_compared

This ensures the numbers always make sense from your perspective.

Example

If you see these results:

total_compared: 20
rows_matched: 15
rows_with_differences: 3
rows_missing: 2
difference_count: 7

This means: Out of 20 expected rows, 15 matched perfectly, 3 were found but had value differences (totaling 7 individual mismatches), and 2 couldn't be found in the extracted data at all.