Stabilizing LLM-Assisted Résumé–Job Description Matching Through Controlled Evaluation
Companion blog post: Improving an AI Résumé–JD Matching System: What I Fixed, Why It Worked, and What Still Needs Testing
Abstract
Large language models (LLMs) are increasingly used in résumé–job description (JD) matching systems to extract requirements, identify evidence, and score candidate alignment. However, without explicit architectural controls, these systems can exhibit non-deterministic behavior that undermines trust, auditability, and fairness.
This paper presents an empirical case study of an LLM-assisted résumé–JD gap analyzer that initially produced unstable and inflated scores under identical inputs. We identify two primary failure modes—criteria drift and evidence misattribution—and demonstrate how freezing evaluation artifacts and enforcing evidence provenance restore deterministic behavior. Results show that architectural constraints, rather than model size, are the primary determinant of evaluation reliability.
1. Introduction
Automated résumé evaluation systems are often treated as decision aids, yet they are rarely evaluated with the rigor applied to other measurement systems. In practice, many AI-assisted matching tools combine multiple LLM stages—requirement extraction, semantic matching, and scoring—without controlling how variability propagates across the pipeline.
This work examines a résumé–JD matching system designed to surface evidence-based alignment signals rather than make hiring decisions. Despite this limited scope, early testing revealed significant score variance for identical inputs, raising concerns about reproducibility and trustworthiness.
The goal of this study is not to optimize predictive accuracy, but to stabilize measurement. We treat the system as an experimental apparatus and apply basic scientific controls to identify and eliminate sources of variance.
2. System Overview
The system evaluates a candidate résumé against a job description using three sequential stages:
- Requirement Extraction – Generate a list of job requirements from the JD
- Evidence Matching – Identify résumé evidence for each requirement
- Scoring – Compute an aggregate match score
The system intentionally avoids:
- Ranking candidates
- Probabilistic scoring
- Weighted requirements
Instead, it operates on binary evidence checks.
3. Evaluation Model
3.1 Job Requirements as Measurement Criteria
Each job description is converted into a finite set of requirements. Each requirement represents a binary proposition:
"The résumé contains evidence satisfying this requirement."
The number of extracted requirements defines the measurement resolution of the system. This count directly determines the denominator of the final score.
3.2 Scoring Function
The match score is computed as:
Score = (Number of matched requirements ÷ Total requirements) × 100
This formulation was chosen deliberately to make the system:
- Transparent
- Auditable
- Easy to reason about
No probabilistic thresholds or confidence scores are introduced at the scoring stage.
4. Observed Failure Modes
4.1 Criteria Drift (Unstable Denominator)
In the initial design, job requirements were extracted independently on each run. Repeated evaluations of the same JD produced between 15 and 18 requirements, depending on model phrasing and semantic grouping.
This resulted in score variance between 12% and 27% for identical résumés and JDs.
Importantly, this variance did not originate from résumé interpretation but from shifting evaluation criteria. The system effectively changed the test each time it was run.
4.2 Evidence Misattribution (Inflated Numerator)
After introducing frozen requirements, the system was evaluated using a smaller LLM to reduce cost and rate-limit pressure. Under this configuration, match scores jumped to 81%.
Manual inspection revealed that the model was echoing job description language and returning it as résumé evidence. Because the system only verified that evidence fields were non-empty—not that they originated from the résumé—these hallucinated matches were accepted.
This misattribution occurred primarily with the smaller model. The larger model, using the same frozen artifact, continued to produce scores in the 31–38% range.
5. Corrective Controls
To address these failure modes, two architectural controls were introduced.
5.1 Frozen Job Requirements Artifacts
Each job description is now processed once to produce a Job Requirements Artifact, which includes:
- A fixed list of requirements
- A version identifier
- A content hash
All subsequent evaluations reuse this artifact. This freezes the denominator and eliminates criteria drift.
5.2 Evidence Provenance Validation
A strict provenance rule was added:
All evidence must appear verbatim in the résumé text.
Any quoted evidence not found in the résumé is invalidated and excluded from scoring. This prevents job description text or model-generated paraphrases from inflating the numerator.
6. Experimental Validation
6.1 Repeatability Testing
To validate the controls, a repeatability harness was introduced:
- Same résumé
- Same job description
- Same frozen requirements artifact
- Ten consecutive runs
After applying both controls:
- All ten runs produced the same validated match score
- The same set of validated matches was returned
- Results were identical across both small (8B) and large (70B) models
- The final validated score for the evaluated role was 20%.
Raw LLM outputs continued to vary, but post-validation results were deterministic.
6.2 Summary of Results
PhaseScore RangePre-controls12–27%Frozen requirements (70B)31–38%Misattribution phase (8B)81%Post-controls (validated)20%7. Discussion
7.1 Model Size vs. System Design
This study demonstrates that model size alone does not guarantee reliability. Smaller models are more prone to shortcut behaviors, but these behaviors only become problematic when system-level guardrails are absent.
Once constraints were enforced, both small and large models behaved consistently.
7.2 LLMs as Signal Generators, Not Judges
The system performs best when LLMs are treated as signal extractors rather than evaluators. Deterministic scoring must be handled by explicit logic external to the model.
7.3 Implications for AI-Assisted Evaluation Systems
Any AI system used to evaluate people—whether for hiring, admissions, or eligibility—must prioritize:
- Stable criteria
- Verifiable evidence
- Repeatable outcomes
Without these properties, outputs may appear intelligent while remaining scientifically unsound.
8. Threats to Validity
Several limitations remain:
- Requirement quality may vary across industries and JD writing styles
- Résumé formatting and OCR artifacts may affect evidence detection
- Human alignment studies have not yet been conducted at scale
These represent areas for future work rather than unresolved flaws.
9. Conclusion
This work shows that instability in LLM-assisted evaluation systems often arises from architectural design choices rather than model capability. By freezing evaluation criteria and enforcing evidence provenance, deterministic and auditable behavior can be restored.
The central lesson is simple:
Trustworthy AI systems are built through controls, not confidence scores.
Figures
Figure 1 — Evaluation Pipeline Overview
Figure 1. End-to-end résumé–job description evaluation pipeline with control points highlighted.
Job Description (JD)
│
▼
┌─────────────────────────┐
│ Requirement Extraction │
│ (LLM-assisted, once) │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Job Requirements │
│ Artifact (Frozen) │
│ - Requirement list │
│ - Version │
│ - Hash │
└─────────────────────────┘
│
│ reused across runs
▼
┌─────────────────────────┐
│ Evidence Matching │
│ (LLM-assisted) │
│ - Propose evidence │
│ - Quote candidate text │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Evidence Validation │◄─── CONTROL POINT
│ - Verbatim résumé check │
│ - Invalid quote removal │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Deterministic Scoring │
│ matched ÷ total × 100 │
└─────────────────────────┘
│
▼
Match Score + Gaps
Key design principle: All nondeterminism is isolated to proposal stages. Acceptance and scoring are deterministic.
Figure 2 — Match Score Calculation (Numerator / Denominator)
Figure 2. Deterministic scoring model used for résumé–JD alignment.
Total Job Requirements (Denominator)
┌─────────────────────────────────────┐
│ REQ-01 │ Python experience │
│ REQ-02 │ Production systems │
│ REQ-03 │ Regulated environments │
│ ... (26 total)
│ REQ-26 │ Cross-functional work │
└─────────────────────────────────────┘
▲
│ fixed, frozen
│
Matched Requirements (Numerator)
┌─────────────────────────────────────┐
│ REQ-01 ✔ Evidence present │
│ REQ-02 ✖ No evidence │
│ REQ-03 ✔ Evidence present │
│ ... │
│ REQ-26 ✖ No evidence │
└─────────────────────────────────────┘
Score = (Number of ✔) ÷ 26 × 100
Important properties: The denominator is fixed by the artifact. The numerator can only increase with verified résumé evidence. No weighting, interpolation, or probabilistic adjustment.
Figure 3 — Failure Mode Illustration (Pre-Fix vs Post-Fix)
Figure 3. How instability arises without controls and how it is eliminated.
PRE-FIX (Uncontrolled)
- Run 1: 15 requirements → 2 matches → 13%
- Run 2: 18 requirements → 2 matches → 11%
- Run 3: 15 requirements → 4 matches → 27%
POST-FIX (Controlled)
- Run 1–10: 26 requirements → 5 matches → 20%
Interpretation: The earlier variance was caused by changing measurement criteria, not résumé interpretation.
Methods Appendix (For Reviewers)
Appendix A — Experimental Setup
A.1 Inputs
- Job Descriptions: User-provided JDs (PDF or text)
- Résumés: User-provided résumés (PDF or text)
- Models Evaluated: Large model (70B), Small model (8B)
Each experiment fixes: JD content, Résumé content, Requirements artifact version.
Appendix B — Requirement Extraction Protocol
A single extraction pass is performed per JD. Extracted requirements are normalized, de-duplicated, and serialized into an artifact. The artifact is assigned a content hash and version identifier. All subsequent runs reuse this artifact without modification. This ensures a fixed denominator across evaluations.
Appendix C — Evidence Matching Protocol
For each requirement the model proposes one or more evidence quotes. Quotes are treated as hypotheses, not facts. No evidence is accepted at this stage.
Appendix D — Evidence Provenance Validation
Each proposed quote must satisfy all of the following:
- Appears verbatim in the résumé text
- Meets a minimum length threshold
- Is associated with the correct requirement ID
If any condition fails: the requirement is marked matched = false; all associated evidence is discarded. This validation step is deterministic and model-agnostic.
Appendix E — Scoring Function
The final score is computed as:
Score = (Validated matched requirements ÷ Total requirements) × 100
Where: Total requirements = size of the frozen artifact; Matched requirements = count of validated matches only.
Appendix F — Repeatability Testing
F.1 Protocol: Same JD, same résumé, same requirements artifact, ten consecutive runs.
F.2 Acceptance Criteria: Identical validated match score across runs; identical set of validated requirement IDs; zero invalid evidence accepted. Raw LLM output variability is permitted; validated output variability is not.
Appendix G — Threats to Validity
JD writing quality may affect requirement granularity. Résumé formatting and OCR artifacts may reduce evidence recall. Human alignment studies have not yet been conducted. These limitations are acknowledged and scoped for future work.
Reviewer Note
This system does not claim predictive validity for hiring outcomes. It is explicitly scoped as a measurement and signal-extraction system, not a decision-making tool.
