Science / Explainable AI
No Black Boxes. No Hidden Logic.
When someone asks "why did this candidate get this score?" — you have an answer. Every LayersRank evaluation traces from final score back to specific evidence. See exactly what the models saw, how they weighted it, and why the number is what it is.
The Black Box Problem
Most AI hiring tools work like this:
Candidate data goes in. A number comes out. Nobody knows what happened in between.
The vendor might say “our proprietary algorithm” or “machine learning model” or “neural network trained on millions of data points.” But ask them to explain why Candidate A scored 74 and Candidate B scored 71, and they can’t tell you. Not won’t — can’t. The model is opaque even to its creators.
This creates serious problems.
Legal Risk
Employment decisions must be defensible. When a rejected candidate files a complaint, you need to explain the basis for the decision. “Our AI said no” is not a defense. Courts and regulators want to know what criteria were applied and why this candidate didn’t meet them.
Bias Concealment
Black-box models can encode biases invisibly. A model trained on historical hiring data might learn that certain names, schools, or speech patterns correlate with past decisions — and perpetuate those patterns without anyone knowing. You can’t audit what you can’t see.
No Path to Improvement
When a black-box model makes mistakes, you can’t fix them. You don’t know why it made the decision, so you don’t know what to change. Should you add more training data? Change a feature? Adjust a weight? Without visibility, improvement becomes trial and error.
Candidate Distrust
Candidates increasingly ask how they were evaluated. “An AI scored you” without further explanation feels arbitrary and unfair — especially for candidates who were rejected. Providing meaningful feedback requires understanding what the evaluation measured.
The LayersRank Approach
LayersRank is explainable by design, not as an afterthought.
We don’t use end-to-end neural networks that consume raw data and produce scores. We use a structured pipeline where each step is interpretable:
Response Capture
Candidate answers are transcribed and stored
Component Scoring
Multiple interpretable models score specific aspects
Aggregation
Component scores combine via transparent weighted formulas
Uncertainty Quantification
Fuzzy logic produces confidence levels
Dimension Rollup
Question scores aggregate to dimension scores
Final Score
Dimension scores aggregate to overall assessment
At every step, inputs and outputs are visible. The logic connecting them is documented. The whole chain is auditable.
Complete Walkthrough
Tracing a Score: Complete Example
Let’s trace through exactly how a candidate score is derived.
Candidate
Priya — Senior Backend Engineer
Final Score
Technical: 82, 91% confidence
How did we get there?
Level 1
Dimension Score
The Technical dimension score (82) aggregates from individual question scores:
| Question | Type | Weight | Score | Confidence |
|---|---|---|---|---|
| Q4: System Design | Video | 30% | 85 | 94% |
| Q5: Debugging | Video | 25% | 81 | 89% |
| Q6: Technical Depth | Text | 25% | 79 | 88% |
| Q7: Trade-offs | Text | 20% | 83 | 93% |
Weighted Calculation
(85 × 0.30) + (81 × 0.25) + (79 × 0.25) + (83 × 0.20) = 25.5 + 20.25 + 19.75 + 16.6 = 82.1 → 82 Confidence: min(94, 89, 88, 93) adjusted upward for multiple confirming signals = 91%
Audit Point: You can see exactly which questions contributed and how much each weighted.
Level 2
Question Score
Let’s drill into Q4: System Design, which scored 85.
The Question
“Walk through how you’d design a notification service handling 10 million daily users. Consider delivery guarantees, scale, and failure scenarios.”
The Response (summarized)
Candidate proposed multi-tier architecture with separate ingestion, processing, and delivery layers. Discussed WebSocket for real-time vs. batch for email. Addressed failure modes with dead-letter queues. Quantified throughput estimates.
Model Evaluations
| Model | Score | Rationale |
|---|---|---|
| Semantic Similarity | 0.87 | High match with reference strong answers on architecture patterns |
| Lexical Alignment | 0.81 | Appropriate terminology (dead-letter queue, horizontal sharding, etc.) |
| LLM Reasoning | 0.86 | Clear logical structure, unprompted failure consideration, quantified reasoning |
| Relevance | 0.89 | Directly addressed all three prompt components |
Agreement Analysis
Scores: [0.87, 0.81, 0.86, 0.89]
Std Dev: 0.03 (low)
Refusal (R): 0.12
Models agree strongly
Score Derivation
Aggregate signal: 0.86
Scaled to 0-100: 86
Adjusted for confidence: 85
Confidence: 94%
Audit Point: You can see each model’s contribution and why they agreed.
Level 3
Model Rationale
Let’s drill into why the LLM Reasoning model scored 8.6/10.
Model Prompt (simplified)
“Evaluate this response to a system design question. Score 1–10 on: logical structure, depth of analysis, consideration of trade-offs, handling of failure scenarios.”
Model Output
Logical Structure
9/10Response follows clear progression: requirements clarification → high-level architecture → component details → scale considerations → failure handling. Easy to follow.
Depth of Analysis
8/10Good depth on delivery layer trade-offs. Quantified throughput estimates show practical experience. Could have gone deeper on data model and consistency guarantees.
Trade-off Consideration
9/10Explicitly discussed WebSocket vs. batch trade-offs. Acknowledged latency vs. reliability tension. Unprompted consideration of eventual consistency.
Failure Scenarios
8/10Mentioned dead-letter queues and retry logic. Could have addressed cascading failures or circuit breakers. Good but not exceptional.
Overall
8.6/10Strong response demonstrating practical system design experience.
Audit Point: You can see exactly what the model evaluated and why it gave each sub-score.
Level 4
Reference Comparisons
The Semantic Similarity model (0.87) compares against reference responses. What references?
Reference Set for System Design Questions
- 15 curated strong responses from validated high-performers
- Embedding vectors stored for each reference
- New responses compared via cosine similarity to reference set
- Score = average similarity to top-5 closest references
Specific Match Analysis
Candidate response was most similar to:
Reference #7
Also proposed tiered architecture with similar component breakdown
Reference #3
Also emphasized failure handling with queue-based recovery
Reference #11
Also quantified scale estimates
The 0.87 score reflects strong alignment with known-good responses.
Audit Point: You can see what “good” looks like and how the candidate compared.
Impact
What Explainability Enables
Compliant Decision-Making
Documented criteria for each role. Consistent application — every candidate gets the same questions. Traceable decisions linking every score to specific evidence. This shifts the legal conversation from “can you prove you didn’t discriminate?” to “here’s exactly how every decision was made.”
Meaningful Candidate Feedback
Instead of “Unfortunately, you weren’t selected,” you can provide: “Your technical assessment showed strong system design thinking (85th percentile) but our behavioral evaluation identified concerns about stakeholder management (62nd percentile).” Candidates appreciate specific feedback. It reflects well on your employer brand.
Continuous Improvement
Questions that don’t differentiate candidates can be replaced. Models that disagree with human judgment can be recalibrated. Scoring weights can be adjusted based on what actually predicts success. Black boxes don’t improve. Transparent systems do.
Hiring Manager Trust
Hiring managers often distrust AI recommendations because they can’t understand them. With LayersRank, a skeptical hiring manager can drill into any score, see the candidate’s actual response, and form their own judgment. This builds trust through transparency rather than demanding blind faith.
Audit Trail Structure
Every LayersRank assessment generates a complete audit trail:
Assessment Metadata
- Candidate identifier (anonymized)
- Role template used
- Questions administered
- Completion & processing timestamps
Response Data
- Full text/transcript for each response
- Video files (per your data policy)
- Response duration
- Behavioral signals (typing patterns, pauses)
Scoring Data
- Individual model scores per response
- Model rationales (for LLM models)
- Agreement metrics
- Fuzzy components (T, F, R)
Aggregation Data
- Question-to-dimension aggregation
- Dimension weights applied
- Final score calculation
- Confidence aggregation
Decision Data
- Threshold comparisons
- Verdict determination
- Any human overrides
- Final recommendation
All of this is queryable via API, exportable for compliance review, and retained according to your data retention policy.
Explainability vs. Interpretability
Technical distinction worth noting:
Interpretability
You can understand how a model works in general. “This model uses decision trees based on these features with these splits.”
Explainability
You can understand why a model produced a specific output. “This candidate scored 74 because of these specific factors in their responses.”
LayersRank provides both:
- Interpretable architecture: The pipeline is documented, the aggregation formulas are known, the model types are understood
- Explainable outputs: Every individual score traces to specific evidence for that candidate
Many AI systems are interpretable (you know how they work in theory) but not explainable (you can’t trace a specific decision). LayersRank is both.
Frequently Asked Questions
Can candidates see their explanations?
You control this. Some organizations share detailed feedback with candidates. Others provide summary feedback. Others provide none. The explanation exists regardless — you decide who sees it.
How much storage does full audit logging require?
Approximately 50-100KB per assessment for text data. Video storage is additional if retained. At 10,000 assessments/year, that's roughly 500MB-1GB of audit data annually.
Can explanations be used against us in litigation?
Consult your legal team, but generally: documented consistent processes are protective in litigation. "We evaluated every candidate using these specific criteria" is a strong defense. The risk is usually in NOT having documentation, not in having it.
What if we disagree with a model's reasoning?
Flag it. We investigate disagreements between model reasoning and human judgment. Sometimes the model is wrong — we improve it. Sometimes it caught something humans missed — that's valuable. Continuous feedback improves the system.
How do you handle explanations for rejected candidates who request them?
We recommend having a process for candidate feedback requests. LayersRank provides the data; your team decides what to share and how to frame it. We can provide guidance on candidate communication best practices.
Decisions You Can Explain and Defend
See what complete audit trails look like. Download a sample assessment with full explanation at every level.