Skip to content
Skill

hqq-quantization

by Orchestra-Research

AI Summary

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends. HQQ uses to define quantization parameters: The core quantized layer that replaces :

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "hqq-quantization" skill in my project.

Please run this command in my terminal:
# Install skill into your project (3 files)
mkdir -p .claude/skills/hqq && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/hqq/SKILL.md "https://raw.githubusercontent.com/Orchestra-Research/AI-Research-SKILLs/main/10-optimization/hqq/SKILL.md" && mkdir -p .claude/skills/hqq/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/hqq/references/advanced-usage.md "https://raw.githubusercontent.com/Orchestra-Research/AI-Research-SKILLs/main/10-optimization/hqq/references/advanced-usage.md" && mkdir -p .claude/skills/hqq/references && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/hqq/references/troubleshooting.md "https://raw.githubusercontent.com/Orchestra-Research/AI-Research-SKILLs/main/10-optimization/hqq/references/troubleshooting.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

Description

Half-Quadratic Quantization for LLMs without calibration data. Use when quantizing models to 4/3/2-bit precision without needing calibration datasets, for fast quantization workflows, or when deploying with vLLM or HuggingFace Transformers.

HQQ - Half-Quadratic Quantization

Fast, calibration-free weight quantization supporting 8/4/3/2/1-bit precision with multiple optimized backends.

When to use HQQ

Use HQQ when: • Quantizing models without calibration data (no dataset needed) • Need fast quantization (minutes vs hours for GPTQ/AWQ) • Deploying with vLLM or HuggingFace Transformers • Fine-tuning quantized models with LoRA/PEFT • Experimenting with extreme quantization (2-bit, 1-bit) Key advantages: • No calibration: Quantize any model instantly without sample data • Multiple backends: PyTorch, ATEN, TorchAO, Marlin, BitBlas for optimized inference • Flexible precision: 8/4/3/2/1-bit with configurable group sizes • Framework integration: Native HuggingFace and vLLM support • PEFT compatible: Fine-tune quantized models with LoRA Use alternatives instead: • AWQ: Need calibration-based accuracy, production serving • GPTQ: Maximum accuracy with calibration data available • bitsandbytes: Simple 8-bit/4-bit without custom backends • llama.cpp/GGUF: CPU inference, Apple Silicon deployment

Installation

`bash pip install hqq

With specific backend

pip install hqq[torch] # PyTorch backend pip install hqq[torchao] # TorchAO int4 backend pip install hqq[bitblas] # BitBlas backend pip install hqq[marlin] # Marlin backend `

Discussion

0/2000
Loading comments...

Health Signals

MaintenanceCommitted 4d ago
Active
Adoption1K+ stars on GitHub
9.9k ★ · Popular
DocsREADME + description
Well-documented

GitHub Signals

Stars9.9k
Forks739
Issues6
Updated4d ago
View on GitHub
MIT License

My Fox Den

Community Rating

Sign in to rate this booster

Works With

Claude Code