LayersRank

Science / Explainable AI

No Black Boxes. No Hidden Logic.

When someone asks "why did this candidate get this score?" — you have an answer. Every LayersRank evaluation traces from final score back to specific evidence. See exactly what the models saw, how they weighted it, and why the number is what it is.

The Black Box Problem

Most AI hiring tools work like this:

Candidate data goes in. A number comes out. Nobody knows what happened in between.

The vendor might say “our proprietary algorithm” or “machine learning model” or “neural network trained on millions of data points.” But ask them to explain why Candidate A scored 74 and Candidate B scored 71, and they can’t tell you. Not won’t — can’t. The model is opaque even to its creators.

This creates serious problems.

Legal Risk

Employment decisions must be defensible. When a rejected candidate files a complaint, you need to explain the basis for the decision. “Our AI said no” is not a defense. Courts and regulators want to know what criteria were applied and why this candidate didn’t meet them.

Bias Concealment

Black-box models can encode biases invisibly. A model trained on historical hiring data might learn that certain names, schools, or speech patterns correlate with past decisions — and perpetuate those patterns without anyone knowing. You can’t audit what you can’t see.

No Path to Improvement

When a black-box model makes mistakes, you can’t fix them. You don’t know why it made the decision, so you don’t know what to change. Should you add more training data? Change a feature? Adjust a weight? Without visibility, improvement becomes trial and error.

Candidate Distrust

Candidates increasingly ask how they were evaluated. “An AI scored you” without further explanation feels arbitrary and unfair — especially for candidates who were rejected. Providing meaningful feedback requires understanding what the evaluation measured.

The LayersRank Approach

LayersRank is explainable by design, not as an afterthought.

We don’t use end-to-end neural networks that consume raw data and produce scores. We use a structured pipeline where each step is interpretable:

1

Response Capture

Candidate answers are transcribed and stored

2

Component Scoring

Multiple interpretable models score specific aspects

3

Aggregation

Component scores combine via transparent weighted formulas

4

Uncertainty Quantification

Fuzzy logic produces confidence levels

5

Dimension Rollup

Question scores aggregate to dimension scores

6

Final Score

Dimension scores aggregate to overall assessment

At every step, inputs and outputs are visible. The logic connecting them is documented. The whole chain is auditable.

Complete Walkthrough

Tracing a Score: Complete Example

Let’s trace through exactly how a candidate score is derived.

Candidate

Priya — Senior Backend Engineer

Final Score

Technical: 82, 91% confidence

How did we get there?

1

Level 1

Dimension Score

The Technical dimension score (82) aggregates from individual question scores:

QuestionTypeWeightScoreConfidence
Q4: System DesignVideo30%8594%
Q5: DebuggingVideo25%8189%
Q6: Technical DepthText25%7988%
Q7: Trade-offsText20%8393%

Weighted Calculation

(85 × 0.30) + (81 × 0.25) + (79 × 0.25) + (83 × 0.20)
= 25.5 + 20.25 + 19.75 + 16.6
= 82.1 → 82

Confidence: min(94, 89, 88, 93)
  adjusted upward for multiple confirming signals
  = 91%

Audit Point: You can see exactly which questions contributed and how much each weighted.

2

Level 2

Question Score

Let’s drill into Q4: System Design, which scored 85.

The Question

“Walk through how you’d design a notification service handling 10 million daily users. Consider delivery guarantees, scale, and failure scenarios.”

The Response (summarized)

Candidate proposed multi-tier architecture with separate ingestion, processing, and delivery layers. Discussed WebSocket for real-time vs. batch for email. Addressed failure modes with dead-letter queues. Quantified throughput estimates.

Model Evaluations

ModelScoreRationale
Semantic Similarity0.87High match with reference strong answers on architecture patterns
Lexical Alignment0.81Appropriate terminology (dead-letter queue, horizontal sharding, etc.)
LLM Reasoning0.86Clear logical structure, unprompted failure consideration, quantified reasoning
Relevance0.89Directly addressed all three prompt components

Agreement Analysis

Scores: [0.87, 0.81, 0.86, 0.89]

Std Dev: 0.03 (low)

Refusal (R): 0.12

Models agree strongly

Score Derivation

Aggregate signal: 0.86

Scaled to 0-100: 86

Adjusted for confidence: 85

Confidence: 94%

Audit Point: You can see each model’s contribution and why they agreed.

3

Level 3

Model Rationale

Let’s drill into why the LLM Reasoning model scored 8.6/10.

Model Prompt (simplified)

“Evaluate this response to a system design question. Score 1–10 on: logical structure, depth of analysis, consideration of trade-offs, handling of failure scenarios.”

Model Output

Logical Structure

9/10

Response follows clear progression: requirements clarification → high-level architecture → component details → scale considerations → failure handling. Easy to follow.

Depth of Analysis

8/10

Good depth on delivery layer trade-offs. Quantified throughput estimates show practical experience. Could have gone deeper on data model and consistency guarantees.

Trade-off Consideration

9/10

Explicitly discussed WebSocket vs. batch trade-offs. Acknowledged latency vs. reliability tension. Unprompted consideration of eventual consistency.

Failure Scenarios

8/10

Mentioned dead-letter queues and retry logic. Could have addressed cascading failures or circuit breakers. Good but not exceptional.

Overall

8.6/10

Strong response demonstrating practical system design experience.

Audit Point: You can see exactly what the model evaluated and why it gave each sub-score.

4

Level 4

Reference Comparisons

The Semantic Similarity model (0.87) compares against reference responses. What references?

Reference Set for System Design Questions

  • 15 curated strong responses from validated high-performers
  • Embedding vectors stored for each reference
  • New responses compared via cosine similarity to reference set
  • Score = average similarity to top-5 closest references

Specific Match Analysis

Candidate response was most similar to:

0.91

Reference #7

Also proposed tiered architecture with similar component breakdown

0.88

Reference #3

Also emphasized failure handling with queue-based recovery

0.86

Reference #11

Also quantified scale estimates

The 0.87 score reflects strong alignment with known-good responses.

Audit Point: You can see what “good” looks like and how the candidate compared.

Impact

What Explainability Enables

Compliant Decision-Making

Documented criteria for each role. Consistent application — every candidate gets the same questions. Traceable decisions linking every score to specific evidence. This shifts the legal conversation from “can you prove you didn’t discriminate?” to “here’s exactly how every decision was made.”

Meaningful Candidate Feedback

Instead of “Unfortunately, you weren’t selected,” you can provide: “Your technical assessment showed strong system design thinking (85th percentile) but our behavioral evaluation identified concerns about stakeholder management (62nd percentile).” Candidates appreciate specific feedback. It reflects well on your employer brand.

Continuous Improvement

Questions that don’t differentiate candidates can be replaced. Models that disagree with human judgment can be recalibrated. Scoring weights can be adjusted based on what actually predicts success. Black boxes don’t improve. Transparent systems do.

Hiring Manager Trust

Hiring managers often distrust AI recommendations because they can’t understand them. With LayersRank, a skeptical hiring manager can drill into any score, see the candidate’s actual response, and form their own judgment. This builds trust through transparency rather than demanding blind faith.

Audit Trail Structure

Every LayersRank assessment generates a complete audit trail:

Assessment Metadata

  • Candidate identifier (anonymized)
  • Role template used
  • Questions administered
  • Completion & processing timestamps

Response Data

  • Full text/transcript for each response
  • Video files (per your data policy)
  • Response duration
  • Behavioral signals (typing patterns, pauses)

Scoring Data

  • Individual model scores per response
  • Model rationales (for LLM models)
  • Agreement metrics
  • Fuzzy components (T, F, R)

Aggregation Data

  • Question-to-dimension aggregation
  • Dimension weights applied
  • Final score calculation
  • Confidence aggregation

Decision Data

  • Threshold comparisons
  • Verdict determination
  • Any human overrides
  • Final recommendation

All of this is queryable via API, exportable for compliance review, and retained according to your data retention policy.

Explainability vs. Interpretability

Technical distinction worth noting:

Interpretability

You can understand how a model works in general. “This model uses decision trees based on these features with these splits.”

Explainability

You can understand why a model produced a specific output. “This candidate scored 74 because of these specific factors in their responses.”

LayersRank provides both:

  • Interpretable architecture: The pipeline is documented, the aggregation formulas are known, the model types are understood
  • Explainable outputs: Every individual score traces to specific evidence for that candidate

Many AI systems are interpretable (you know how they work in theory) but not explainable (you can’t trace a specific decision). LayersRank is both.

Frequently Asked Questions

Can candidates see their explanations?

You control this. Some organizations share detailed feedback with candidates. Others provide summary feedback. Others provide none. The explanation exists regardless — you decide who sees it.

How much storage does full audit logging require?

Approximately 50-100KB per assessment for text data. Video storage is additional if retained. At 10,000 assessments/year, that's roughly 500MB-1GB of audit data annually.

Can explanations be used against us in litigation?

Consult your legal team, but generally: documented consistent processes are protective in litigation. "We evaluated every candidate using these specific criteria" is a strong defense. The risk is usually in NOT having documentation, not in having it.

What if we disagree with a model's reasoning?

Flag it. We investigate disagreements between model reasoning and human judgment. Sometimes the model is wrong — we improve it. Sometimes it caught something humans missed — that's valuable. Continuous feedback improves the system.

How do you handle explanations for rejected candidates who request them?

We recommend having a process for candidate feedback requests. LayersRank provides the data; your team decides what to share and how to frame it. We can provide guidance on candidate communication best practices.

Decisions You Can Explain and Defend

See what complete audit trails look like. Download a sample assessment with full explanation at every level.