Stabilizing LLM-Assisted Résumé–Job Description Matching Through Controlled Evaluation

Abstract

Large language models (LLMs) are increasingly used in résumé–job description (JD) matching systems to extract requirements, identify evidence, and score candidate alignment. However, without explicit architectural controls, these systems can exhibit non-deterministic behavior that undermines trust, auditability, and fairness.

This paper presents an empirical case study of an LLM-assisted résumé–JD gap analyzer that initially produced unstable and inflated scores under identical inputs. We identify two primary failure modes—criteria drift and evidence misattribution—and demonstrate how freezing evaluation artifacts and enforcing evidence provenance restore deterministic behavior. Results show that architectural constraints, rather than model size, are the primary determinant of evaluation reliability.

1. Introduction

Automated résumé evaluation systems are often treated as decision aids, yet they are rarely evaluated with the rigor applied to other measurement systems. In practice, many AI-assisted matching tools combine multiple LLM stages—requirement extraction, semantic matching, and scoring—without controlling how variability propagates across the pipeline.

This work examines a résumé–JD matching system designed to surface evidence-based alignment signals rather than make hiring decisions. Despite this limited scope, early testing revealed significant score variance for identical inputs, raising concerns about reproducibility and trustworthiness.

The goal of this study is not to optimize predictive accuracy, but to stabilize measurement. We treat the system as an experimental apparatus and apply basic scientific controls to identify and eliminate sources of variance.

2. System Overview

The system evaluates a candidate résumé against a job description using three sequential stages:

Requirement Extraction – Generate a list of job requirements from the JD
Evidence Matching – Identify résumé evidence for each requirement
Scoring – Compute an aggregate match score

The system intentionally avoids:

Ranking candidates
Probabilistic scoring
Weighted requirements

Instead, it operates on binary evidence checks.

3. Evaluation Model

3.1 Job Requirements as Measurement Criteria

Each job description is converted into a finite set of requirements. Each requirement represents a binary proposition:

"The résumé contains evidence satisfying this requirement."

The number of extracted requirements defines the measurement resolution of the system. This count directly determines the denominator of the final score.

3.2 Scoring Function

The match score is computed as:

Score = (Number of matched requirements ÷ Total requirements) × 100

This formulation was chosen deliberately to make the system:

Transparent
Auditable
Easy to reason about

No probabilistic thresholds or confidence scores are introduced at the scoring stage.

4. Observed Failure Modes

4.1 Criteria Drift (Unstable Denominator)

In the initial design, job requirements were extracted independently on each run. Repeated evaluations of the same JD produced between 15 and 18 requirements, depending on model phrasing and semantic grouping.

This resulted in score variance between 12% and 27% for identical résumés and JDs.

Importantly, this variance did not originate from résumé interpretation but from shifting evaluation criteria. The system effectively changed the test each time it was run.

4.2 Evidence Misattribution (Inflated Numerator)

After introducing frozen requirements, the system was evaluated using a smaller LLM to reduce cost and rate-limit pressure. Under this configuration, match scores jumped to 81%.

Manual inspection revealed that the model was echoing job description language and returning it as résumé evidence. Because the system only verified that evidence fields were non-empty—not that they originated from the résumé—these hallucinated matches were accepted.

This misattribution occurred primarily with the smaller model. The larger model, using the same frozen artifact, continued to produce scores in the 31–38% range.

5. Corrective Controls

To address these failure modes, two architectural controls were introduced.

5.1 Frozen Job Requirements Artifacts

Each job description is now processed once to produce a Job Requirements Artifact, which includes:

A fixed list of requirements
A version identifier
A content hash

All subsequent evaluations reuse this artifact. This freezes the denominator and eliminates criteria drift.

5.2 Evidence Provenance Validation

A strict provenance rule was added:

All evidence must appear verbatim in the résumé text.

Any quoted evidence not found in the résumé is invalidated and excluded from scoring. This prevents job description text or model-generated paraphrases from inflating the numerator.

6. Experimental Validation

6.1 Repeatability Testing

To validate the controls, a repeatability harness was introduced:

Same résumé
Same job description
Same frozen requirements artifact
Ten consecutive runs

After applying both controls:

All ten runs produced the same validated match score
The same set of validated matches was returned
Results were identical across both small (8B) and large (70B) models
The final validated score for the evaluated role was 20%.

Raw LLM outputs continued to vary, but post-validation results were deterministic.

6.2 Summary of Results

PhaseScore RangePre-controls12–27%Frozen requirements (70B)31–38%Misattribution phase (8B)81%Post-controls (validated)20%

7. Discussion

7.1 Model Size vs. System Design

This study demonstrates that model size alone does not guarantee reliability. Smaller models are more prone to shortcut behaviors, but these behaviors only become problematic when system-level guardrails are absent.

Once constraints were enforced, both small and large models behaved consistently.

7.2 LLMs as Signal Generators, Not Judges

The system performs best when LLMs are treated as signal extractors rather than evaluators. Deterministic scoring must be handled by explicit logic external to the model.

7.3 Implications for AI-Assisted Evaluation Systems

Any AI system used to evaluate people—whether for hiring, admissions, or eligibility—must prioritize:

Stable criteria
Verifiable evidence
Repeatable outcomes

Without these properties, outputs may appear intelligent while remaining scientifically unsound.

8. Threats to Validity

Several limitations remain:

Requirement quality may vary across industries and JD writing styles
Résumé formatting and OCR artifacts may affect evidence detection
Human alignment studies have not yet been conducted at scale

These represent areas for future work rather than unresolved flaws.

9. Conclusion

This work shows that instability in LLM-assisted evaluation systems often arises from architectural design choices rather than model capability. By freezing evaluation criteria and enforcing evidence provenance, deterministic and auditable behavior can be restored.

The central lesson is simple:

Trustworthy AI systems are built through controls, not confidence scores.

Figures

Figure 1 — Evaluation Pipeline Overview

Figure 1. End-to-end résumé–job description evaluation pipeline with control points highlighted.

Job Description (JD)
        │
        ▼
┌─────────────────────────┐
│ Requirement Extraction  │
│ (LLM-assisted, once)    │
└─────────────────────────┘
        │
        ▼
┌─────────────────────────┐
│ Job Requirements        │
│ Artifact (Frozen)        │
│ - Requirement list       │
│ - Version                │
│ - Hash                   │
└─────────────────────────┘
        │
        │ reused across runs
        ▼
┌─────────────────────────┐
│ Evidence Matching       │
│ (LLM-assisted)           │
│ - Propose evidence       │
│ - Quote candidate text   │
└─────────────────────────┘
        │
        ▼
┌─────────────────────────┐
│ Evidence Validation     │◄─── CONTROL POINT
│ - Verbatim résumé check │
│ - Invalid quote removal  │
└─────────────────────────┘
        │
        ▼
┌─────────────────────────┐
│ Deterministic Scoring    │
│ matched ÷ total × 100    │
└─────────────────────────┘
        │
        ▼
   Match Score + Gaps

Key design principle: All nondeterminism is isolated to proposal stages. Acceptance and scoring are deterministic.

Figure 2 — Match Score Calculation (Numerator / Denominator)

Figure 2. Deterministic scoring model used for résumé–JD alignment.

Total Job Requirements (Denominator)
┌─────────────────────────────────────┐
│ REQ-01  │ Python experience         │
│ REQ-02  │ Production systems        │
│ REQ-03  │ Regulated environments    │
│ ...                                   (26 total)
│ REQ-26  │ Cross-functional work     │
└─────────────────────────────────────┘
                 ▲
                 │ fixed, frozen
                 │
Matched Requirements (Numerator)
┌─────────────────────────────────────┐
│ REQ-01 ✔ Evidence present           │
│ REQ-02 ✖ No evidence                │
│ REQ-03 ✔ Evidence present           │
│ ...                                 │
│ REQ-26 ✖ No evidence                │
└─────────────────────────────────────┘

Score = (Number of ✔) ÷ 26 × 100

Important properties: The denominator is fixed by the artifact. The numerator can only increase with verified résumé evidence. No weighting, interpolation, or probabilistic adjustment.

Figure 3 — Failure Mode Illustration (Pre-Fix vs Post-Fix)

Figure 3. How instability arises without controls and how it is eliminated.

PRE-FIX (Uncontrolled)

Run 1: 15 requirements → 2 matches → 13%
Run 2: 18 requirements → 2 matches → 11%
Run 3: 15 requirements → 4 matches → 27%

POST-FIX (Controlled)

Run 1–10: 26 requirements → 5 matches → 20%

Interpretation: The earlier variance was caused by changing measurement criteria, not résumé interpretation.

Methods Appendix (For Reviewers)

Appendix A — Experimental Setup

A.1 Inputs

Job Descriptions: User-provided JDs (PDF or text)
Résumés: User-provided résumés (PDF or text)
Models Evaluated: Large model (70B), Small model (8B)

Each experiment fixes: JD content, Résumé content, Requirements artifact version.

Appendix B — Requirement Extraction Protocol

A single extraction pass is performed per JD. Extracted requirements are normalized, de-duplicated, and serialized into an artifact. The artifact is assigned a content hash and version identifier. All subsequent runs reuse this artifact without modification. This ensures a fixed denominator across evaluations.

Appendix C — Evidence Matching Protocol

For each requirement the model proposes one or more evidence quotes. Quotes are treated as hypotheses, not facts. No evidence is accepted at this stage.

Appendix D — Evidence Provenance Validation

Each proposed quote must satisfy all of the following:

Appears verbatim in the résumé text
Meets a minimum length threshold
Is associated with the correct requirement ID

If any condition fails: the requirement is marked matched = false; all associated evidence is discarded. This validation step is deterministic and model-agnostic.

Appendix E — Scoring Function

The final score is computed as:

Score = (Validated matched requirements ÷ Total requirements) × 100

Where: Total requirements = size of the frozen artifact; Matched requirements = count of validated matches only.

Appendix F — Repeatability Testing

F.1 Protocol: Same JD, same résumé, same requirements artifact, ten consecutive runs.

F.2 Acceptance Criteria: Identical validated match score across runs; identical set of validated requirement IDs; zero invalid evidence accepted. Raw LLM output variability is permitted; validated output variability is not.

Appendix G — Threats to Validity

JD writing quality may affect requirement granularity. Résumé formatting and OCR artifacts may reduce evidence recall. Human alignment studies have not yet been conducted. These limitations are acknowledged and scoped for future work.

Reviewer Note

This system does not claim predictive validity for hiring outcomes. It is explicitly scoped as a measurement and signal-extraction system, not a decision-making tool.