AI SummaryA skill for adapting and optimizing Hugging Face or custom LLM models to run efficiently on vLLM with Ascend NPU support, enabling developers to validate and deploy models with deterministic testing and single-commit delivery.
Install
Copy this and paste it into Claude Code, Cursor, or any AI assistant:
I want to install the "vllm-ascend-model-adapter" skill in my project. Please run this command in my terminal: # Install skill into your project (6 files) mkdir -p .claude/skills/vllm-ascend-model-adapter && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/SKILL.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/SKILL.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/deliverables.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/troubleshooting.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/workflow-checklist.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md" Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.
Description
Adapt and debug existing or new models for vLLM on Ascend NPU. Implement in /vllm-workspace/vllm and /vllm-workspace/vllm-ascend, validate via direct vllm serve from /workspace, and deliver one signed commit in the current repo.
Overview
Adapt Hugging Face or local models to run on vllm-ascend with minimal changes, deterministic validation, and single-commit delivery. This skill is for both already-supported models and new architectures not yet registered in vLLM.
6) Validate inference and features
• Send GET /v1/models first. • Send at least one OpenAI-compatible text request. • For multimodal models, require at least one text+image request. • Validate architecture registration and loader path with logs (no unresolved architecture, no fatal missing-key errors). • Try feature-first validation: EP + ACLGraph path first; eager path as fallback/isolation. • If startup succeeds but first request crashes (false-ready), treat as runtime failure and continue root-cause isolation. • For torch._dynamo + interpolate + NPU contiguous failures on VL paths, try TORCHDYNAMO_DISABLE=1 as diagnostic/stability fallback. • For multimodal processor API mismatch (for example skip_tensor_conversion signature mismatch), use text-only isolation (--limit-mm-per-prompt set image/video/audio to 0) to separate processor issues from core weight loading issues. • Capacity baseline by default (single machine): max-model-len=128k + max-num-seqs=16. • Then expand concurrency (e.g., 32/64) if requested or feasible.
Read order
• Start with references/workflow-checklist.md. • Read references/multimodal-ep-aclgraph-lessons.md (feature-first checklist). • If startup/inference fails, read references/troubleshooting.md. • If checkpoint is fp8-on-NPU, read references/fp8-on-npu-lessons.md. • Before handoff, read references/deliverables.md.
Hard constraints
• Never upgrade transformers. • Primary implementation roots are fixed by Dockerfile: • /vllm-workspace/vllm • /vllm-workspace/vllm-ascend • Start vllm serve from /workspace with direct command by default. • Default API port is 8000 unless user explicitly asks otherwise. • Feature-first default: try best to validate ACLGraph / EP / flashcomm1 / MTP / multimodal out-of-box. • --enable-expert-parallel and flashcomm1 checks are MoE-only; for non-MoE models mark as not-applicable with evidence. • If any feature cannot be enabled, keep evidence and explain reason in final report. • Do not rely on PYTHONPATH=<modified-src>:$PYTHONPATH unless debugging fallback is strictly needed. • Keep code changes minimal and focused on the target model. • Final deliverable commit must be one single signed commit in the current working repo (git commit -sm ...). • Keep final docs in Chinese and compact. • Dummy-first is encouraged for speed, but dummy is NOT fully equivalent to real weights. • Never sign off adaptation using dummy-only evidence; real-weight gate is mandatory.
Discussion
Health Signals
My Fox Den
Community Rating
Sign in to rate this booster