AI SummaryFast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends. HQQ uses to define quantization parameters: The core quantized layer that replaces :
Install
Copy this and paste it into Claude Code, Cursor, or any AI assistant:
I want to install the "hqq-quantization" skill in my project. Please run this command in my terminal: # Install skill into your project (3 files) mkdir -p .claude/skills/hqq && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/hqq/SKILL.md "https://raw.githubusercontent.com/Orchestra-Research/AI-Research-SKILLs/main/10-optimization/hqq/SKILL.md" && mkdir -p .claude/skills/hqq/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/hqq/references/advanced-usage.md "https://raw.githubusercontent.com/Orchestra-Research/AI-Research-SKILLs/main/10-optimization/hqq/references/advanced-usage.md" && mkdir -p .claude/skills/hqq/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/hqq/references/troubleshooting.md "https://raw.githubusercontent.com/Orchestra-Research/AI-Research-SKILLs/main/10-optimization/hqq/references/troubleshooting.md" Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.
Description
Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.
HQQ - Half-Quadratic Quantization
Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.
When to use HQQ
Use HQQ when: • Quantizing models without calibration data (no dataset needed) • Need fast quantization (minutes vs hours for GPTQ/AWQ) • Deploying with vLLM or HuggingFace Transformers • Fine-tuning quantized models with LoRA/PEFT • Experimenting with extreme quantization (2-bit, 1-bit) Key advantages: • No calibration: Quantize any model instantly without sample data • Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference • Flexible precision: 8/4/3/2/1-bit with configurable group sizes • Framework integration: Native HuggingFace and vLLM support • PEFT compatible: Fine-tune quantized models with LoRA Use alternatives instead: • AWQ: Need calibration-based accuracy, production serving • GPTQ: Maximum accuracy with calibration data available • bitsandbytes: Simple 8-bit/4-bit without custom backends • llama.cpp/GGUF: CPU inference, Apple Silicon deployment
Installation
`bash pip install hqq
With specific backend
pip install hqq[torch] # PyTorch backend pip install hqq[torchao] # TorchAO int4 backend pip install hqq[bitblas] # BitBlas backend pip install hqq[marlin] # Marlin backend `
Discussion
Health Signals
My Fox Den
Community Rating
Sign in to rate this booster