Orchestra-Research/AI-Research-SKILLs

Add Semantic Routing or Mixture-of-Models skill to Emerging Techniques

Open

#23 opened on 2026年2月9日

GitHub で見る
 (2 comments) (0 reactions) (0 assignees)TeX (649 forks)batch import
enhancementhelp wanted

Repository metrics

Stars
 (8,430 stars)
PR merge metrics
 (平均マージ 42d 1h) (30d で 6 merged PRs)

説明

Issue Description

Overview

Add a new skill for Semantic Routing or Mixture-of-Models (vLLM Semantic Router) to the 19-emerging-techniques category. Semantic Routing provides system-level intelligence for Mixture-of-Models (MoM) through signal-driven decision engine and plugin chain architecture for intelligent LLM routing, security, and optimization.

What is Semantic Routing?

Semantic Routing is an intelligent routing layer that uses signal-driven decisions and plugin chains to:

  1. Route queries intelligently across multiple specialized models (math → Qwen-Math, code → DeepSeek-Coder)
  2. Optimize costs by using smaller models for simple tasks, larger models for complex ones
  3. Secure LLM systems with built-in jailbreak, PII, and hallucination detection
  4. Reduce latency through semantic caching (10-100× speedup)
  5. Enable model collaboration through Mixture-of-Models (MoM) architecture

Key Features

Signal-Driven Decision Engine:

  • 10 signal types: keyword , embedding, domain/MMLU, fact_check, user_feedback, preference, language, latency (TPOT/TTFT), context, complexity
  • Flexible combination: AND/OR operators for complex routing logic
  • Multi-signal fusion: Combine signals for higher accuracy than single classifiers

Plugin Chain Architecture:

  • semantic-cache - 10-100× latency reduction for similar queries
  • jailbreak - Adversarial prompt detection and blocking
  • pii - Personally identifiable information detection
  • system_prompt - Dynamic system prompt injection per route
  • header_mutation - HTTP header manipulation for routing control
  • hallucination - Token-level hallucination detection during generation

Model Training: https://huggingface.co/llm-semantic-router

Why This Belongs in Emerging Techniques

  1. Novel approach: System-level intelligence for MoM (vs. model-level MoE)
  2. Production-ready: Used in real-world vLLM deployments
  3. Research-backed: NeurIPS 2025 MLForSys paper, ICLR 2026 RouterArena #1 ranking
  4. Cost-effective: 80-90% cost reduction vs. always using largest model
  5. Active development: Regular releases, bi-weekly community meetings, AMD partnership

Proposed Skill Structure

19-emerging-techniques/semantic-routing/
├── SKILL.md                    # 200-500 lines main guidance
├── references/
│   ├── README.md              # Architecture overview
│   ├── signals.md             # 10 signal types deep dive
│   ├── plugins.md             # Plugin chain architecture
│   ├── training.md            # ModernBERT + LoRA training guide
│   ├── deployment.md          # Docker/Kubernetes deployment
│   ├── api.md                 # API reference
│   └── issues.md              # Common issues and solutions
└── examples/
    ├── basic-routing.yaml     # Simple keyword routing
    ├── multi-signal.yaml      # Complex signal combination
    └── production-stack.yaml  # Full production setup

Content Outline

SKILL.md (200-500 lines):

  1. When to Use

    • Multi-model collaboration scenarios
    • Cost optimization needs
    • Security requirements (jailbreak/PII/hallucination)
    • Semantic caching for latency reduction
  2. Quick Start

    pip install vllm-sr
    vllm-sr serve
    
  3. Core Concepts

    • Mixture of Models (MoM) vs. Mixture of Experts (MoE)
    • Signal-Driven Decisions (10 signal types overview)
    • Plugin Chain Architecture (6 plugins overview)
  4. Two Complete Workflows with Checklists

    • Workflow 1: Basic Multi-Model Routing

      • Define signals (keyword + domain)
      • Configure decision rules (AND/OR)
      • Set model mappings
      • Test routing decisions
      • Validate routing accuracy
    • Workflow 2: Production Deployment

      • Configure security plugins (jailbreak + PII)
      • Enable semantic cache
      • Set up monitoring metrics
      • Configure multiple backend models
      • Load testing
      • Deploy to Kubernetes
  5. When to Use vs Alternatives

    • vs. LiteLLM (simple routing only)
    • vs. LangChain Router (slow LLM-based routing)
    • vs. Hand-written if-else (hard to maintain)
  6. Common Issues

    • Signal conflicts resolution
    • Inaccurate routing decisions
    • High latency troubleshooting
    • Low cache hit rate optimization
    • Model loading failures

references/ (300KB+ target):

  • signals.md: Detailed documentation of all 10 signal types with configuration examples, latency comparison, use cases, and combination strategies
  • plugins.md: Deep dive into 6 plugins, plugin development guide, execution order
  • training.md: Why ModernBERT, 4 classifier models, LoRA training methodology, datasets, performance metrics
  • deployment.md: Docker Compose, Kubernetes + Helm, production configuration, performance tuning, observability
  • api.md: OpenAI-compatible API, routing API, classification API, configuration API
  • issues.md: Real GitHub issues, common errors and solutions, debugging methods

examples/:

  • basic-routing.yaml: Simple keyword-based routing
  • multi-signal.yaml: Multi-signal combination (keyword + domain + embedding)
  • production-stack.yaml: Full production config with plugins, monitoring, multiple models

Key Highlights to Emphasize

Why Use Semantic Router?

  • Cost optimization: Use Llama-3-8B for simple queries, GPT-4 for complex ones
  • Quality improvement: Route math to Qwen-Math, code to DeepSeek-Coder
  • Security built-in: Jailbreak, PII, hallucination detection
  • Performance boost: 10-100× latency reduction via semantic cache

Core Advantages:

  1. Multi-signal fusion: 10 signals combined > single classifier
  2. Low latency: keyword 1ms, embedding 10-50ms, domain 50-100ms
  3. Extensible: Plugin architecture for custom signals and processing
  4. Production-ready: Kubernetes-native, Prometheus metrics, OpenTelemetry tracing

Resources

Acceptance Criteria

  • SKILL.md with proper YAML frontmatter (name: semantic-routing)
  • 200-500 lines of focused guidance in SKILL.md
  • 300KB+ reference documentation from official sources
  • At least 2 complete workflows with checklists
  • Code examples with language tags (yaml, bash, ```python)
  • "When to use vs alternatives" section
  • Common issues and solutions section
  • References one level deep from SKILL.md (no nested references)
  • Examples directory with 3 runnable configuration files

Related Skills

  • 12-inference-serving/vllm - vLLM inference engine (backend for semantic router)
  • 14-agents/langchain - Agent frameworks that can benefit from intelligent routing
  • 15-rag - RAG systems that benefit from semantic caching and routing
  • 16-prompt-engineering/dspy - Prompt optimization with routing decisions

Labels: enhancement, new-skill, emerging-techniques, documentation

コントリビューターガイド