Prompt injection attacks
Malicious instructions hidden in user inputs, retrieved documents, or emails that hijack model behaviour, override system prompts, and exfiltrate sensitive data.
A purpose-trained model that validates every LLM input before it reaches your stack. Industry-leading detection accuracy. Zero data leaves your infrastructure.
Large language models do not separate instructions from data — by design, any untrusted text reaching the model can become a command. Static rules and keyword filters cannot close this gap; new vectors are catalogued weekly.
Malicious instructions hidden in user inputs, retrieved documents, or emails that hijack model behaviour, override system prompts, and exfiltrate sensitive data.
Adversarial prompts crafted to bypass content policies, extract system instructions, or force agents to execute actions outside their intended scope.
Crafted inputs that extract PII, trade secrets, or internal system context from your enterprise AI — triggering compliance violations and potential breaches.
Filters and allow-lists cannot enumerate a growing attack surface. The only defensible answer is a model trained on the full corpus of known injection patterns.
Bastion is a self-hosted prompt-injection scanner that validates every input before it reaches your LLM, agent, or tool-calling workflow.
Pull the model artefact, run it on CPU or GPU within your own infrastructure. No outbound calls, no telemetry, fully air-gappable.
Route untrusted text — user messages, emails, documents, web content, RAG context — through Bastion before it enters your model. Typically 5–10 ms per check.
Your team has a benchmarked, audit-ready answer for security review, procurement, and regulators: a documented, measurable control on LLM01.
# Host Bastion as a microservice within your environment import json, urllib.request req = urllib.request.Request( "https://bastion.internal/protect", data=json.dumps({"prompt": user_input}).encode(), headers={"Content-Type": "application/json"}, method="POST" ) with urllib.request.urlopen(req) as resp: if json.loads(resp.read())["risk"] >= 0.5: raise ValueError("Prompt blocked")
# Embed the SDK directly in your application from bastion_prompt_protection import Guard guard = Guard() result = guard.protect(user_input) if result.is_attack: raise ValueError("Prompt blocked")
# Run inference directly with the ONNX model weights import json import numpy as np import onnxruntime from tokenizers import Tokenizer MODEL_DIR = "binary-bastion-prompt-protection-deberta-v3-xsmall-v1" session = onnxruntime.InferenceSession(f"{MODEL_DIR}/onnx/model_quantized.onnx") tokenizer = Tokenizer.from_file(f"{MODEL_DIR}/tokenizer.json") temperature = json.loads(open(f"{MODEL_DIR}/temperature.json").read())["temperature"] enc = tokenizer.encode("Ignore previous instructions") logits = session.run(None, { "input_ids": np.array([enc.ids], dtype=np.int64), "attention_mask": np.array([enc.attention_mask], dtype=np.int64), })[0][0] / temperature shifted = logits - logits.max() risk = float(np.exp(shifted)[1] / np.exp(shifted).sum())
Average AUC of 0.984 across four public prompt-injection benchmarks — the highest among detectors evaluated. Methodology and raw results are published on HuggingFace.
| Detector | Average | xTRam1 | S-Labs | JBB | rogue |
|---|---|---|---|---|---|
| bastion-prompt-protection | 0.984 | 0.997 | 0.996 | 0.970 | 0.972 |
| hlyn judge | 0.950 | 0.995 | 0.891 | 0.934 | 0.980 |
| protectai v2 | 0.850 | 0.992 | 0.978 | 0.600 | 0.830 |
| deepset injection | 0.766 | 0.666 | 0.961 | 0.649 | 0.787 |
| meta prompt-guard | 0.299 | 0.186 | 0.362 | 0.332 | 0.314 |
No tokens. No transactions. No outbound connections. A model artefact you own, deploy, and control end-to-end.
Deploy as a model artefact and run air-gapped on your own infrastructure. Bastion never calls out, never sees your prompts, never depends on a remote service.
5–10 ms per prompt on standard server CPUs. Sits inline in any production workflow — chat, agents, RAG pipelines, document ingestion — without adding user-visible latency.
Raw model weights, a 10-line Python script, the open-source SDK, or the full private microservice. Choose the integration that fits your stack and review process.
Every classification returns a calibrated probability score, not a binary flag. Tune thresholds per use case. Supports EU AI Act human-oversight and logging requirements out of the box.
Five reasons enterprise teams choose Bastion over generic content moderation or in-house heuristics.
Not a ruleset. Not a generic safety classifier. A model that has seen the full prompt-injection attack surface and detects over 98% of attempts before they reach your LLM.
Top-ranked across publicly accessible detectors on HuggingFace. Any engineer can re-run the benchmarks against their own data. No marketing claims behind closed doors.
Deployed as a model artefact inside your environment. No per-request fees. No prompts shared with any third party. No outbound network dependency.
Optimised for standard server CPUs. Runs inline in your pipeline without adding user-visible latency or requiring dedicated GPU capacity.
Governed under EU law. Directly applicable for GDPR, EU AI Act, and high-stakes deployments in finance, healthcare, and public sector.
* DETECTION RATE AS OF MAY 2026
No per-request fees. No metered token costs. Volume and group-wide pricing on request.
Need an offline evaluation, a security questionnaire, or a Soft-letter for procurement? Get in touch.
If your security or procurement team needs documentation we don't list here, just ask — we'll send a complete pack.
Tell us about your stack and your security posture. We'll come back within one business day with a tailored deployment plan, the documents your procurement team needs, and a path to a proof-of-value in your environment.