Add Semantic Routing or Mixture-of-Models skill to Emerging Techniques · Orchestra-Research/AI-Research-SKILLs#23

(2 comments) (0 reactions) (0 assignees)TeX (649 forks)batch import

enhancementhelp wanted

Repository metrics

Stars: (8,430 stars)
PR merge metrics: (平均マージ 42d 1h) (30d で 6 merged PRs)

説明

Issue Description

Overview

Add a new skill for Semantic Routing or Mixture-of-Models (vLLM Semantic Router) to the 19-emerging-techniques category. Semantic Routing provides system-level intelligence for Mixture-of-Models (MoM) through signal-driven decision engine and plugin chain architecture for intelligent LLM routing, security, and optimization.

What is Semantic Routing?

Semantic Routing is an intelligent routing layer that uses signal-driven decisions and plugin chains to:

Route queries intelligently across multiple specialized models (math → Qwen-Math, code → DeepSeek-Coder)
Optimize costs by using smaller models for simple tasks, larger models for complex ones
Secure LLM systems with built-in jailbreak, PII, and hallucination detection
Reduce latency through semantic caching (10-100× speedup)
Enable model collaboration through Mixture-of-Models (MoM) architecture

Key Features

Signal-Driven Decision Engine:

10 signal types: keyword , embedding, domain/MMLU, fact_check, user_feedback, preference, language, latency (TPOT/TTFT), context, complexity
Flexible combination: AND/OR operators for complex routing logic
Multi-signal fusion: Combine signals for higher accuracy than single classifiers

Plugin Chain Architecture:

semantic-cache - 10-100× latency reduction for similar queries
jailbreak - Adversarial prompt detection and blocking
pii - Personally identifiable information detection
system_prompt - Dynamic system prompt injection per route
header_mutation - HTTP header manipulation for routing control
hallucination - Token-level hallucination detection during generation

Model Training: https://huggingface.co/llm-semantic-router

Why This Belongs in Emerging Techniques

Novel approach: System-level intelligence for MoM (vs. model-level MoE)
Production-ready: Used in real-world vLLM deployments
Research-backed: NeurIPS 2025 MLForSys paper, ICLR 2026 RouterArena #1 ranking
Cost-effective: 80-90% cost reduction vs. always using largest model
Active development: Regular releases, bi-weekly community meetings, AMD partnership

Proposed Skill Structure

19-emerging-techniques/semantic-routing/
├── SKILL.md                    # 200-500 lines main guidance
├── references/
│   ├── README.md              # Architecture overview
│   ├── signals.md             # 10 signal types deep dive
│   ├── plugins.md             # Plugin chain architecture
│   ├── training.md            # ModernBERT + LoRA training guide
│   ├── deployment.md          # Docker/Kubernetes deployment
│   ├── api.md                 # API reference
│   └── issues.md              # Common issues and solutions
└── examples/
    ├── basic-routing.yaml     # Simple keyword routing
    ├── multi-signal.yaml      # Complex signal combination
    └── production-stack.yaml  # Full production setup

Content Outline

SKILL.md (200-500 lines):

When to Use
- Multi-model collaboration scenarios
- Cost optimization needs
- Security requirements (jailbreak/PII/hallucination)
- Semantic caching for latency reduction
Quick Start
```
pip install vllm-sr
vllm-sr serve
```
Core Concepts
- Mixture of Models (MoM) vs. Mixture of Experts (MoE)
- Signal-Driven Decisions (10 signal types overview)
- Plugin Chain Architecture (6 plugins overview)
Two Complete Workflows with Checklists
- Workflow 1: Basic Multi-Model Routing
  - Define signals (keyword + domain)
  - Configure decision rules (AND/OR)
  - Set model mappings
  - Test routing decisions
  - Validate routing accuracy
- Workflow 2: Production Deployment
  - Configure security plugins (jailbreak + PII)
  - Enable semantic cache
  - Set up monitoring metrics
  - Configure multiple backend models
  - Load testing
  - Deploy to Kubernetes
When to Use vs Alternatives
- vs. LiteLLM (simple routing only)
- vs. LangChain Router (slow LLM-based routing)
- vs. Hand-written if-else (hard to maintain)
Common Issues
- Signal conflicts resolution
- Inaccurate routing decisions
- High latency troubleshooting
- Low cache hit rate optimization
- Model loading failures

references/ (300KB+ target):

signals.md: Detailed documentation of all 10 signal types with configuration examples, latency comparison, use cases, and combination strategies
plugins.md: Deep dive into 6 plugins, plugin development guide, execution order
training.md: Why ModernBERT, 4 classifier models, LoRA training methodology, datasets, performance metrics
deployment.md: Docker Compose, Kubernetes + Helm, production configuration, performance tuning, observability
api.md: OpenAI-compatible API, routing API, classification API, configuration API
issues.md: Real GitHub issues, common errors and solutions, debugging methods

examples/:

basic-routing.yaml: Simple keyword-based routing
multi-signal.yaml: Multi-signal combination (keyword + domain + embedding)
production-stack.yaml: Full production config with plugins, monitoring, multiple models

Key Highlights to Emphasize

Why Use Semantic Router?

Cost optimization: Use Llama-3-8B for simple queries, GPT-4 for complex ones
Quality improvement: Route math to Qwen-Math, code to DeepSeek-Coder
Security built-in: Jailbreak, PII, hallucination detection
Performance boost: 10-100× latency reduction via semantic cache

Core Advantages:

Multi-signal fusion: 10 signals combined > single classifier
Low latency: keyword 1ms, embedding 10-50ms, domain 50-100ms
Extensible: Plugin architecture for custom signals and processing
Production-ready: Kubernetes-native, Prometheus metrics, OpenTelemetry tracing

Resources

GitHub: https://github.com/vllm-project/semantic-router (513 source files)
Documentation: https://vllm-semantic-router.com (24,000+ lines)
Paper: When to Reason: Semantic Router for vLLM (NeurIPS 2025)
Blog: https://blog.vllm.ai/2025/09/11/semantic-router.html
Community: vLLM Slack #semantic-router channel

Acceptance Criteria

SKILL.md with proper YAML frontmatter (name: semantic-routing)
200-500 lines of focused guidance in SKILL.md
300KB+ reference documentation from official sources
At least 2 complete workflows with checklists
Code examples with language tags (yaml, bash, ```python)
"When to use vs alternatives" section
Common issues and solutions section
References one level deep from SKILL.md (no nested references)
Examples directory with 3 runnable configuration files

Related Skills

12-inference-serving/vllm - vLLM inference engine (backend for semantic router)
14-agents/langchain - Agent frameworks that can benefit from intelligent routing
15-rag - RAG systems that benefit from semantic caching and routing
16-prompt-engineering/dspy - Prompt optimization with routing decisions

Labels: enhancement, new-skill, emerging-techniques, documentation

コントリビューターガイド

調査方針: vllm semantic routerリポジトリとドキュメントを調査します。シグナル駆動ルーティングとプラグインアーキテクチャの概念を理解します。問題で概説された構造に従ってSKILL.mdと参照ファイルを作成します。すべての受け入れ基準を満たしてください。
技術スタック: python
領域: aidocumentation
Issue 種別: ドキュメント
難度: 2
推定時間: 1-2日
活動状況: アクティブ
明確さ: 明確
前提条件: PythonGit
初心者向け度: 60