Research

SLM Reasoning Layers

A new architecture for trustworthy AI — small language models as deterministic verification agents for large model systems.

Kapil Chandwani, ANRAK AI · March 2026

Read the paper Try ANRAK AI

0ms

Verification latency

Cheaper than LLM-as-judge

Cop types supported

0%+

Target accuracy per cop

Abstract

Large Language Models hallucinate. They fabricate facts, ignore constraints, and confidently present fiction as truth. Current approaches — prompt engineering, RLHF, constitutional AI, RAG — all ask the model to police itself. This is fundamentally flawed.

We propose Small Language Models (500M–7B parameters) deployed as independent, specialized verification layers that monitor larger models in real time. These SLMs are trained on narrow tasks with high reliability, incapable of the creative deception that makes large models untrustworthy, and fast enough to operate at inference time.

This is a new class of neurosymbolic AI where the “symbolic” layer is itself a neural network — but one small enough and deterministic enough to function as a reliable reasoning primitive.

The Problem

LLMs lie in five predictable ways

These are not edge cases. They are systematic, emergent properties of how large neural networks process and generate language.

Tool Use Fabrication

Claims to have used a tool or searched a database when it hasn't

Source Attribution

"Based on the document..." — then generates content not in the source

Retroactive Reasoning

Generates the answer first, then constructs reasoning to justify it

Confident Uncertainty

Presents uncertain info with the same conviction as known facts

Constraint Performance

Performs compliance rather than achieving it — finds creative workarounds

Architecture

The SLM Cop Framework

Multiple small models run in parallel between your primary model and the end user. Each checks one property. A deterministic verdict engine aggregates their outputs.

User Input

Query or prompt

Primary Model

8B–70B+ generates response

SLM Cop Squad — runs in parallel (~100ms)

Grounding

1B model

Domain

1B model

Consistency

1B model

Reasoning

3B model

Verdict Engine

Deterministic aggregation

All Pass

Serve response

Any Fail

Regenerate with feedback

Key Insight

Less capability, more reliability

Small models are better judges because they lack the capacity for creative deception. They can't construct elaborate lies — they can only check the one thing they were trained on.

🔒

Can't lie convincingly

A 500M model lacks the representational capacity to construct multi-layered, contextually appropriate deceptions. It checks a property and reports.

🎯

No sycophancy training

Never trained on conversations or human preferences. Has no concept of "pleasing the user." Trained only on (input, judgment) pairs.

🛡️

Resistant to prompt injection

Doesn't process input as "instructions" — processes it as features for classification. No instruction-following pathway to exploit.

⚡

Deterministic by specialization

All 500M parameters dedicated to one task. Converges to near-deterministic behavior — the same input produces the same output.

The Honesty Spectrum

500M–1B

95%

1B–3B

90%

3B–7B

85%

7B–13B

65%

30B–70B

40%

70B+

25%

Reliability on narrow verification tasks

Cop Taxonomy

Seven specialized verification agents

Each cop answers exactly one question. This specificity is what makes them reliable.

📋

Grounding Cop

1B–3B

Is every claim traceable to the provided context? Catches hallucinated facts, wrong attributions, subtle distortions.

🚧

Domain Constraint Cop

1B–3B

Does the response violate domain rules? Catches medical advice from receptionists, legal opinions from chatbots.

🔄

Consistency Cop

1B–3B

Does this contradict anything said before? Catches 'We close at 5' followed by 'Open until 8.'

🧠

Reasoning Cop

3B–7B

Does the chain-of-thought support the conclusion? Catches logical jumps and circular reasoning.

🔧

Tool Use Cop

500M–1B

Did the model actually call the tools it claims? Does the response match tool outputs?

📏

Instruction Cop

500M–1B

Does the response follow all system prompt constraints? Formatting, tone, length, behavior.

⚙️

Custom Cop

Any

Your own verification logic for domain-specific checks not covered by built-in types.

Dual Deployment

Guards training data and production responses

📊

At Generation Time

Cops verify each sample during dataset creation. Failed samples are regenerated with the cop's feedback — so your training data is clean before it enters the pipeline.

Teacher generates sample

Cops verify against rules + KB

Failed? Regenerate with feedback

Passed? Include in dataset

🚀

At Inference Time

In production, cops check every response before it reaches the user. If rejected, the model regenerates with the cop's critique. Critical failures return a safe fallback.

Model generates response

Cops verify in parallel (~100ms)

Failed? Up to 2 regeneration attempts

Verified response served to user

Comparison

How SLM Cops compare

Approach

Limitation

SLM Cops Advantage

Guardrails (NeMo, etc.)

Relies on expensive LLM calls for judgment

Fine-tuned small models — cheap, reliable, fast

Constitutional AI

Model evaluates itself — same biases

Independent external models — no correlation

Process Reward Models

Reasoning steps only, training time only

Any verifiable property, training + inference

Mixture of Agents

Large models collaborating — expensive

Small models verifying — 50x cheaper

Regex / Rule-based

Semantically blind — misses meaning

Understands 9 AM = 9:00 AM, catches subtlety

Implementation

How cops connect to any LLM

The cop system is a verification loop — not a modification to the LLM itself. It works with any model, any API, any framework.

Call LLM

Claude, GPT, or your fine-tuned model generates a response

Cops verify

Small models check the response in parallel (~100ms total)

All cops pass

Serve the response to the user

Any cop fails

Feed cop's reason back to the LLM → regenerate

Loop back to Step 1

The cop is just another LLM call

A cop model is a small language model (1B-3B parameters) running on any inference server — Ollama locally, a GPU server, or a cloud endpoint. You send it the primary model's response plus context, and it returns a JSON judgment. That's it.

The entire pattern

response = call_primary_llm(messages)     # Claude, GPT, your model

for cop in cop_squad:
    judgment = call_cop_model(cop, response, context)
    if not judgment["pass"]:
        messages += [
            {"role": "assistant", "content": response},
            {"role": "user", "content": f"Rejected: {judgment['reason']}. Regenerate."}
        ]
        response = call_primary_llm(messages)  # retry with feedback

return response

Where cops run

💻

Ollama (local)

Free, private, ~50ms. Run cops on your laptop alongside your app.

☁️

ANRAK API

Deploy cops on ANRAK, call via API. Same infrastructure as your primary model.

🖥️

Any GPU server

vLLM, TGI, or llama.cpp. Self-hosted on any cloud provider.

⚡

Serverless (Modal, Replicate)

Pay per call. Scales to zero when idle.

Build it with any tool

The orchestration is simple. The value is in the trained cop models — small models fine-tuned to reliably detect hallucinations, rule violations, and inconsistencies in your specific domain.

🔄

n8n / Make

Webhook → HTTP node (LLM) → HTTP nodes (cops in parallel) → IF node (verdict) → loop on failure. Visual, no code.

🔗

LangChain / CrewAI

Primary agent generates, cop agents verify. Orchestrator manages the feedback loop. Each cop is a tool or agent.

🐍

Raw Python / cURL

40 lines of code. Call LLM API, call cop API, check JSON, loop. Works in any language, any framework.

“A genius who sometimes lies needs a simple, honest cop. Not a smarter genius.”

The path to trustworthy AI runs through building smaller, less capable models and ensuring they behave — then using them to police the large ones.

Build trustworthy AI today

Train your own verification models on ANRAK AI. Your domain expertise becomes an executable guardrail.

Get Started Free Read the Docs