Research system for MCP skill cold-start

ReliaSkill

Pre-deployment validation, repair, and gating for reliable MCP-style tool-use skills.

ReliaSkill converts raw MCP-like schemas and sparse documentation into compact skill artifacts with explicit use boundaries, schema-faithful examples, validation reports, repair traces, reliability scores, and deployability decisions before downstream LLM agents see them.

View Repository View Pipeline See Results

290 MCP-like tools

2,900 controls

5,800 routing examples

GITHUB REPOSITORY

raw_mcp

{
  "tool": "write_file",
  "args": {
    "path": "string",
    "content": "string"
  }
}

usage boundary missing

ToolIR++ Validate Repair Gate

governed reliability layer

skill artifact

purpose summary when to use when not to use argument template schema-faithful examples

score auditable

DEPLOY REPAIR REJECT

Problem

Raw schemas are interfaces, not reliable agent skills

MCP schemas tell a model what arguments exist. They usually do not specify when the tool should fire, when it should abstain, or how an agent should interpret adjacent requests that look plausible but are out of scope.

Missing use boundary

Sparse docs leave trigger conditions implicit, so a model has to infer the policy from a thin interface.

Unsupported argument

Fluent generated skills may introduce fields that the underlying schema never accepted.

Over-triggering

Agents can call a tool for adjacent requests where explanation, search, or abstention is the safer behavior.

Invalid activation

Side-effect tools need explicit boundaries before deployment, not trust after one generated prompt.

Diagram explaining why raw MCP schemas fail: missing use boundaries, sparse documentation, over-triggering, and invented arguments. — Raw schemas list arguments, but not the conditions for safe and correct tool use.

Pipeline

A governed representation layer before tool exposure

ReliaSkill treats generated skills as candidates. Each candidate moves through normalization, generation, validation, behavior tests, targeted repair, and a final deployment gate.

ToolIR++ Normalization

Preserves the original schema and adds provenance, complexity, ambiguity, side-effect, and safety metadata.

normalized

Compact Skill Generation

Creates purpose, use boundaries, non-use boundaries, argument templates, and examples.

candidate

Structural Validation

Checks unsupported arguments, required fields, enum values, examples, contradictions, and compactness.

inspected

Behavior-Grounded Evaluation

Runs positive controls and adjacent negative controls to measure utility and over-triggering risk.

tested

Targeted Repair

Patches localized failing sections instead of defaulting to full skill regeneration.

patched

Deployment Gating

Outputs DEPLOY, REPAIR, or REJECT using explicit reliability evidence and repair traces.

gated

Six-stage ReliaSkill pipeline: ToolIR++ normalization, compact skill generation, structural validation, behavior-grounded evaluation, targeted repair, and deployment gating. — Generated skills are candidates, not trusted artifacts by default.

Artifact

The generated skill is an inspected package, not a trusted prompt

ReliaSkill packages a compact agent-facing representation with machine-checkable evidence about schema faithfulness, behavior controls, repair history, and deployability.

SKILL.md candidate under review

Agent-facing content

Purpose summary
When-to-use guidance
When-not-to-use guidance
Canonical argument template
Schema-faithful examples

Reliability evidence

Validation report
Behavior report
Repair trace
Reliability score
Deployment decision

DEPLOY REPAIR REJECT

Anatomy of a ReliaSkill artifact with purpose summary, use boundaries, argument template, examples, validation report, behavior report, repair trace, reliability score, and deployment decision. — A ReliaSkill artifact packages agent-facing guidance with inspection evidence.

Evaluation

Utility and risk are measured together

ReliaSkill evaluates whether a representation helps models produce the correct call, select the right hidden tool, and abstain on adjacent negative controls.

Structured-call prediction

Checks whether the predicted tool call matches the gold call and whether arguments are parseable and schema-faithful.

Positive controls

Exercise intended use cases with gold tools and gold arguments across difficulty tiers.

Adjacent negative controls

Test abstention on near-miss, explanation-versus-action, read-versus-write, and missing-information cases.

Hidden-tool routing

Measures tool selection and joint route-plus-argument correctness among candidate tools with hard distractors.

Utility Exact call match

Does the representation help the model assemble the right call?

Risk Negative-control abstention

Does the representation avoid activating on adjacent out-of-scope requests?

Evaluation protocol diagram covering structured-call prediction, hidden-tool routing, and positive plus adjacent negative controls. — The evaluation protocol treats tool-use utility and abstention reliability as coupled requirements.

Results

Reported results across seven predictors

The evaluation compares five tool-facing representations while holding the downstream predictor fixed within each comparison. Boundary-first is the primary ReliaSkill rendering, with verbose docs reported as a close secondary variant.

Structured-call Exact Match (%)

Condition	Llama3.2-1B	Qwen2.5-1.5B	Gemma2-2B	Phi-3.5-mini	Qwen2.5-7B	Llama3.1-8B	Gemma2-9B	Mean
`raw_mcp`	33.42	38.58	43.80	33.97	53.02	39.93	48.88	41.66
`generated_skill_base`	37.76	34.31	43.93	42.44	63.32	56.81	56.75	47.90
`curated_schema_reference`	38.71	37.29	30.37	32.61	53.83	37.69	46.78	39.61
`skill_prompt_boundary_first`	52.81	61.63	52.07	63.73	70.37	58.37	63.39	60.34
`skill_prompt_verbose_docs`	52.20	56.14	52.54	62.17	67.86	57.90	60.07	58.41

Boundary-first reaches a 60.34% seven-model mean, a 44.8% relative improvement over raw MCP exposure.

Hidden-tool Routing Joint Exact (%)

Condition	Llama3.2-1B	Qwen2.5-1.5B	Gemma2-2B	Phi-3.5-mini	Qwen2.5-7B	Llama3.1-8B	Gemma2-9B	Mean
`raw_mcp`	22.24	14.44	21.02	25.22	34.31	27.86	34.03	25.59
`generated_skill_base`	26.24	17.02	26.98	31.32	42.24	39.39	38.71	31.70
`curated_schema_reference`	24.00	13.69	15.19	23.73	32.20	26.64	33.69	24.16
`skill_prompt_boundary_first`	37.76	31.53	31.80	45.29	45.42	42.03	44.07	39.70
`skill_prompt_verbose_docs`	35.19	33.02	34.03	42.71	44.75	40.07	41.29	38.72

Boundary-first reaches a 39.70% routing Joint Exact mean, while verbose docs is close at 38.72% and wins for Qwen2.5-1.5B and Gemma2-2B.

Qwen2.5-7B component ablation

System	Joint EM	Argument Validity	Selection Accuracy
Full ReliaSkill	21.12%	52.78%	31.39%
w/o Repair	20.41%	53.05%	27.39%
w/o Validation	18.85%	50.07%	27.36%
w/o Examples	15.73%	41.22%	25.80%

Full ReliaSkill improves Joint Exact Match from raw MCP at 17.15% to 21.12%, with Argument Validity rising from 43.66% to 52.78%.

Supported takeaways

skill_prompt_boundary_first is the primary ReliaSkill variant in the paper and has the best seven-model mean on both main metrics.

Representation matters

Raw MCP exposure is the weakest main interface on average, and generated skills improve results before further rendering choices are applied.

Rendering matters

Boundary-first and verbose-doc variants share underlying skill content; their differences isolate prompt rendering policy.

Safety framing

No observed harmful activation on held-out negative controls is a benchmark result, not a deployment guarantee.

Reported result highlights comparing Raw MCP, generated skills, boundary-first, and verbose-doc skill prompts. — Reported highlights: seven-model means, Qwen2.5-7B spotlight, and the reliability-component ablation.

Implementation

Research code for reliable tool representation experiments

The repository includes parsing, generation, validation, controls, repair, gating, routing, conversion, live sandbox, and analysis components.

ParsingMCP/tool schema parsing and normalization

ToolIR++Reliability metadata and schema-complexity features

GenerationPrompt-template and compactness variants

ValidationStructural artifact checks

ControlsPositive and adjacent negative controls

RepairTargeted patching and regeneration baselines

GatingRule-based reliability scoring and decisions

RoutingHidden-tool candidate evaluation

ConvertersBFCL/API-style and ToolBench-style utilities

SandboxFilesystem, SQLite, and git-like live subset

AnalysisSlice analysis and scientific comparison extraction

ReproductionSaved logs and cached table regeneration

Quick start

Run the static research harness locally

The commands below are copied from the README and use the repository's existing scripts.

Install

python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Reliability pipeline

python scripts\run_reliability_pipeline.py --config configs\experiment.reliability.heuristic.sample.json

Benchmark evaluation

python scripts\run_benchmark_eval.py
python scripts\run_routing_eval.py

Tests

python -m unittest discover -s tests -v

GitHub Pages

Static by design

This showcase lives in docs/ and can be served by GitHub Pages from the main / docs source. It uses plain HTML, CSS, and JavaScript, with no backend and no build step.

docs/index.html docs/styles.css docs/script.js

ReliaSkill RelaSkll