Research system for MCP skill cold-start

ReliaSkill

Pre-deployment validation, repair, and gating for reliable MCP-style tool-use skills.

ReliaSkill converts raw MCP-like schemas and sparse documentation into compact skill artifacts with explicit use boundaries, schema-faithful examples, validation reports, repair traces, reliability scores, and deployability decisions before downstream LLM agents see them.

290 MCP-like tools
2,900 controls
5,800 routing examples
raw_mcp
{
  "tool": "write_file",
  "args": {
    "path": "string",
    "content": "string"
  }
}
usage boundary missing
ToolIR++ Validate Repair Gate
governed reliability layer
skill artifact
purpose summary when to use when not to use argument template schema-faithful examples
score auditable
DEPLOY REPAIR REJECT

Problem

Raw schemas are interfaces, not reliable agent skills

MCP schemas tell a model what arguments exist. They usually do not specify when the tool should fire, when it should abstain, or how an agent should interpret adjacent requests that look plausible but are out of scope.

01

Missing use boundary

Sparse docs leave trigger conditions implicit, so a model has to infer the policy from a thin interface.

02

Unsupported argument

Fluent generated skills may introduce fields that the underlying schema never accepted.

03

Over-triggering

Agents can call a tool for adjacent requests where explanation, search, or abstention is the safer behavior.

04

Invalid activation

Side-effect tools need explicit boundaries before deployment, not trust after one generated prompt.

Diagram explaining why raw MCP schemas fail: missing use boundaries, sparse documentation, over-triggering, and invented arguments.
Raw schemas list arguments, but not the conditions for safe and correct tool use.

Pipeline

A governed representation layer before tool exposure

ReliaSkill treats generated skills as candidates. Each candidate moves through normalization, generation, validation, behavior tests, targeted repair, and a final deployment gate.

01
IR

ToolIR++ Normalization

Preserves the original schema and adds provenance, complexity, ambiguity, side-effect, and safety metadata.

normalized
02
SK

Compact Skill Generation

Creates purpose, use boundaries, non-use boundaries, argument templates, and examples.

candidate
03
VA

Structural Validation

Checks unsupported arguments, required fields, enum values, examples, contradictions, and compactness.

inspected
04
BE

Behavior-Grounded Evaluation

Runs positive controls and adjacent negative controls to measure utility and over-triggering risk.

tested
05
RP

Targeted Repair

Patches localized failing sections instead of defaulting to full skill regeneration.

patched
06
GT

Deployment Gating

Outputs DEPLOY, REPAIR, or REJECT using explicit reliability evidence and repair traces.

gated
Six-stage ReliaSkill pipeline: ToolIR++ normalization, compact skill generation, structural validation, behavior-grounded evaluation, targeted repair, and deployment gating.
Generated skills are candidates, not trusted artifacts by default.

Artifact

The generated skill is an inspected package, not a trusted prompt

ReliaSkill packages a compact agent-facing representation with machine-checkable evidence about schema faithfulness, behavior controls, repair history, and deployability.

SKILL.md candidate under review

Agent-facing content

  • Purpose summary
  • When-to-use guidance
  • When-not-to-use guidance
  • Canonical argument template
  • Schema-faithful examples

Reliability evidence

  • Validation report
  • Behavior report
  • Repair trace
  • Reliability score
  • Deployment decision
DEPLOY REPAIR REJECT
Anatomy of a ReliaSkill artifact with purpose summary, use boundaries, argument template, examples, validation report, behavior report, repair trace, reliability score, and deployment decision.
A ReliaSkill artifact packages agent-facing guidance with inspection evidence.

Evaluation

Utility and risk are measured together

ReliaSkill evaluates whether a representation helps models produce the correct call, select the right hidden tool, and abstain on adjacent negative controls.

Structured-call prediction

Checks whether the predicted tool call matches the gold call and whether arguments are parseable and schema-faithful.

Positive controls

Exercise intended use cases with gold tools and gold arguments across difficulty tiers.

Adjacent negative controls

Test abstention on near-miss, explanation-versus-action, read-versus-write, and missing-information cases.

Hidden-tool routing

Measures tool selection and joint route-plus-argument correctness among candidate tools with hard distractors.

Utility Exact call match

Does the representation help the model assemble the right call?

Risk Negative-control abstention

Does the representation avoid activating on adjacent out-of-scope requests?

Evaluation protocol diagram covering structured-call prediction, hidden-tool routing, and positive plus adjacent negative controls.
The evaluation protocol treats tool-use utility and abstention reliability as coupled requirements.

Results

Reported results across seven predictors

The evaluation compares five tool-facing representations while holding the downstream predictor fixed within each comparison. Boundary-first is the primary ReliaSkill rendering, with verbose docs reported as a close secondary variant.

Structured-call Exact Match (%)

Condition Llama3.2-1B Qwen2.5-1.5B Gemma2-2B Phi-3.5-mini Qwen2.5-7B Llama3.1-8B Gemma2-9B Mean
raw_mcp33.4238.5843.8033.9753.0239.9348.8841.66
generated_skill_base37.7634.3143.9342.4463.3256.8156.7547.90
curated_schema_reference38.7137.2930.3732.6153.8337.6946.7839.61
skill_prompt_boundary_first52.8161.6352.0763.7370.3758.3763.3960.34
skill_prompt_verbose_docs52.2056.1452.5462.1767.8657.9060.0758.41

Boundary-first reaches a 60.34% seven-model mean, a 44.8% relative improvement over raw MCP exposure.

Reported result highlights comparing Raw MCP, generated skills, boundary-first, and verbose-doc skill prompts.
Reported highlights: seven-model means, Qwen2.5-7B spotlight, and the reliability-component ablation.

Implementation

Research code for reliable tool representation experiments

The repository includes parsing, generation, validation, controls, repair, gating, routing, conversion, live sandbox, and analysis components.

ParsingMCP/tool schema parsing and normalization
ToolIR++Reliability metadata and schema-complexity features
GenerationPrompt-template and compactness variants
ValidationStructural artifact checks
ControlsPositive and adjacent negative controls
RepairTargeted patching and regeneration baselines
GatingRule-based reliability scoring and decisions
RoutingHidden-tool candidate evaluation
ConvertersBFCL/API-style and ToolBench-style utilities
SandboxFilesystem, SQLite, and git-like live subset
AnalysisSlice analysis and scientific comparison extraction
ReproductionSaved logs and cached table regeneration

Quick start

Run the static research harness locally

The commands below are copied from the README and use the repository's existing scripts.

Install

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Reliability pipeline

python scripts\run_reliability_pipeline.py --config configs\experiment.reliability.heuristic.sample.json

Benchmark evaluation

python scripts\run_benchmark_eval.py
python scripts\run_routing_eval.py

Tests

python -m unittest discover -s tests -v

GitHub Pages

Static by design

This showcase lives in docs/ and can be served by GitHub Pages from the main / docs source. It uses plain HTML, CSS, and JavaScript, with no backend and no build step.

docs/index.html docs/styles.css docs/script.js