How do I install vllm-ascend-model-adapter?

vllm-ascend-model-adapter is a Skill hosted on GitHub at https://github.com/vllm-project/vllm-ascend. Visit the ImAiFox page at https://imaifox.com/boosters/vllm-project-vllm-ascend-vllm-ascend-model-adapter for the AI-ready install prompt you can copy directly into Claude Code, Cursor, or Windsurf.

How popular is vllm-ascend-model-adapter?

vllm-ascend-model-adapter has 1,961 GitHub stars and 1,104 forks. It is actively maintained with recent commits.

Is vllm-ascend-model-adapter free?

Yes — vllm-ascend-model-adapter is open source and free to use under the Apache-2.0 license. The source code is publicly available on GitHub at https://github.com/vllm-project/vllm-ascend.

Skill

vllm-ascend-model-adapter

Name: vllm-ascend-model-adapter
Author: vllm-project

by vllm-project

AI Summary

A skill for adapting and optimizing Hugging Face or custom LLM models to run efficiently on vLLM with Ascend NPU support, enabling developers to validate and deploy models with deterministic testing and single-commit delivery.

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "vllm-ascend-model-adapter" skill in my project.

Please run this command in my terminal:
# Install skill into your project (6 files)
mkdir -p .claude/skills/vllm-ascend-model-adapter && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/SKILL.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/SKILL.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/deliverables.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/deliverables.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/fp8-on-npu-lessons.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/multimodal-ep-aclgraph-lessons.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/troubleshooting.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/troubleshooting.md" && mkdir -p .claude/skills/vllm-ascend-model-adapter/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/vllm-ascend-model-adapter/references/workflow-checklist.md "https://raw.githubusercontent.com/vllm-project/vllm-ascend/main/.agents/skills/vllm-ascend-model-adapter/references/workflow-checklist.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

ascend inference llm llm-serving llmops mlops model-serving transformer vllm llm

Description

Adapt and debug existing or new models for vLLM on Ascend NPU. Implement in /vllm-workspace/vllm and /vllm-workspace/vllm-ascend, validate via direct vllm serve from /workspace, and deliver one signed commit in the current repo.

Overview

Adapt Hugging Face or local models to run on vllm-ascend with minimal changes, deterministic validation, and single-commit delivery. This skill is for both already-supported models and new architectures not yet registered in vLLM.

6) Validate inference and features

• Send GET /v1/models first. • Send at least one OpenAI-compatible text request. • For multimodal models, require at least one text+image request. • Validate architecture registration and loader path with logs (no unresolved architecture, no fatal missing-key errors). • Try feature-first validation: EP + ACLGraph path first; eager path as fallback/isolation. • If startup succeeds but first request crashes (false-ready), treat as runtime failure and continue root-cause isolation. • For torch._dynamo + interpolate + NPU contiguous failures on VL paths, try TORCHDYNAMO_DISABLE=1 as diagnostic/stability fallback. • For multimodal processor API mismatch (for example skip_tensor_conversion signature mismatch), use text-only isolation (--limit-mm-per-prompt set image/video/audio to 0) to separate processor issues from core weight loading issues. • Capacity baseline by default (single machine): max-model-len=128k + max-num-seqs=16. • Then expand concurrency (e.g., 32/64) if requested or feasible.

Read order

• Start with references/workflow-checklist.md. • Read references/multimodal-ep-aclgraph-lessons.md (feature-first checklist). • If startup/inference fails, read references/troubleshooting.md. • If checkpoint is fp8-on-NPU, read references/fp8-on-npu-lessons.md. • Before handoff, read references/deliverables.md.

Hard constraints

• Never upgrade transformers. • Primary implementation roots are fixed by Dockerfile: • /vllm-workspace/vllm • /vllm-workspace/vllm-ascend • Start vllm serve from /workspace with direct command by default. • Default API port is 8000 unless user explicitly asks otherwise. • Feature-first default: try best to validate ACLGraph / EP / flashcomm1 / MTP / multimodal out-of-box. • --enable-expert-parallel and flashcomm1 checks are MoE-only; for non-MoE models mark as not-applicable with evidence. • If any feature cannot be enabled, keep evidence and explain reason in final report. • Do not rely on PYTHONPATH=<modified-src>:$PYTHONPATH unless debugging fallback is strictly needed. • Keep code changes minimal and focused on the target model. • Final deliverable commit must be one single signed commit in the current working repo (git commit -sm ...). • Keep final docs in Chinese and compact. • Dummy-first is encouraged for speed, but dummy is NOT fully equivalent to real weights. • Never sign off adaptation using dummy-only evidence; real-weight gate is mandatory.

Discussion

0/2000

Loading comments...

Health Signals

MaintenanceCommitted 2mo ago

● Active

Adoption1K+ stars on GitHub

2.0k ★ · Popular

DocsREADME + description

Well-documented

GitHub Signals

Stars2.0k

Forks1.1k

Issues1741

Updated2mo ago

View on GitHub

Apache-2.0 License

My Fox Den

Community Rating

Works With

Claude Code

Related Skills

Openclaw MCP Server

MCP Server

receiving-code-review

Skill

dispatching-parallel-agents

Skill

using-git-worktrees

Skill

View all Skills →