Skip to content
Skill

local-vault

by genli-ai

AI Summary

Turn a folder of raw files into a Markdown vault that an LLM can grep, and then answer questions over that vault responsibly. source file, carrying retrieval frontmatter (abstract / tags / synonyms) + a

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "local-vault" skill in my project.

Please run this command in my terminal:
# Install skill into your project
mkdir -p .claude/skills/local-vault && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/local-vault/SKILL.md "https://raw.githubusercontent.com/genli-ai/market-research-skills/main/skills/local-vault/SKILL.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

Description

Build and query a local Markdown knowledge base ("vault"). TWO functions — (1) CONVERT raw files (PDF, Word/docx, PowerPoint/pptx, Excel/xlsx, csv/tsv, images, html, md/txt, json/yaml/code, audio/video) into clean Markdown with retrieval-friendly frontmatter; local-first (pandoc / python-pptx / openpyxl / pymupdf4llm / whisper), with cloud OCR (MinerU) only as a fallback. (2) ANSWER questions over the resulting vault with retrieval discipline — self-monitor coverage, flag missing/lossy content, and propose Maps-of-Content (MOCs). Triggers: "build/sync my local knowledge base", "convert these files to markdown for AI", "整理我的资料库", "把文件转成 md 给 AI 读", "本地知识库", "读我的本地 vault 回答", "这个主题我的资料里怎么说". Not for: one-off web research, or files that are already in a single doc you can read directly.

local-vault

Turn a folder of raw files into a Markdown vault that an LLM can grep, and then answer questions over that vault responsibly. Mental model: SOURCE = raw files (source of truth). VAULT = one .md per source file, carrying retrieval frontmatter (abstract / tags / synonyms) + a source backlink. The vault is the layer the LLM reads; the raw files are where the user goes to verify. There are two distinct jobs — figure out which the user wants: • A. Convert / sync — they dropped files in and want them in the vault → run the pipeline (scripts/sync.py). • B. Retrieve / answer — they want answers from an existing vault → follow the Retrieval & feedback protocol below. Do not run the pipeline for this. ---

One-time setup (do this for the user if not already done)

• Python deps (user-level, no venv): ` python3 -m pip install --user requests python-dotenv pypdf pymupdf4llm openpyxl python-pptx ` • pandoc (for docx/rtf/odt/epub): brew install pandoc (macOS) / distro pkg. • ffmpeg (only for audio/video transcription): brew install ffmpeg (macOS) / distro pkg. The whisper engine is auto-selected by platform — mlx-whisper on Apple Silicon (GPU), faster-whisper elsewhere (cross-platform CPU/CUDA) — and auto-installed after the user consents at the first-run prompt (no manual pip needed). On that first run with audio/video present, the tool shows the model-size options (tiny ~75 MB / small ~480 MB / turbo ~1.6 GB / large-v3 ~3 GB) and lets the user pick or skip; the choice is saved to .env (KB_WHISPER_MODEL) so it never re-asks. Fully local — no token/quota; the model downloads once, then offline. • claude CLI on PATH — the pipeline shells out to claude -p for frontmatter enrichment and PPT-image OCR. If absent, those steps are skipped (not fatal). • Configure paths — two ways: • Guided (recommended for the user): just run python3 scripts/sync.py in a terminal. On first run (when paths aren't configured yet) it launches an interactive wizard: it asks for the raw-files folder + the vault folder (+ optional MinerU token), creates them, writes scripts/.env, and prints how to use the tool. Then they re-run to convert. • Manual: copy scripts/.env.example → scripts/.env and set KB_SOURCE_DIR (raw files) and KB_TARGET_DIR (the Markdown vault), both absolute. MINERU_TOKEN is optional (only for legacy .doc/.ppt, .html, scanned PDFs, images — get one at https://mineru.net). • When you (Claude) run the setup for the user, prefer the manual path: ask them for the two folders, then write scripts/.env directly (the wizard only fires on an interactive TTY, which a claude -p subprocess is not).

Run it

` python3 scripts/sync.py ` On macOS, the first run (wizard or any normal run) also drops a clickable sync.command into the knowledge-base root — the parent of the SOURCE folder, with the absolute path to sync.py baked in (tool and data live apart — under /plugin install the script sits in ~/.claude/plugins/cache/…, far from the data folders, so a relative launcher can't work). After that the daily loop is: drop files into SOURCE → double-click sync.command → read the .md in VAULT. The launcher is idempotent; a stale auto-generated copy left in the SOURCE folder by an older version is removed automatically (a user-written one is never touched). If a different sync.command already exists at the root, an interactive terminal prompts update / skip; non-interactively, our own out-of-date launcher self-heals silently while a user-customized one is left alone. First terminal run with no config → the setup wizard (above). Once .env exists: • Incremental: only files in SOURCE without a matching .md in VAULT are processed. To force a re-convert, delete that .md first, then re-run. • No MinerU token needed for the local paths (xlsx/csv/docx/pptx/md/txt/code + digital PDF). Token is validated lazily, only when a file actually needs MinerU. • Orphan staging: if a source file is deleted, its tool-generated .md — together with its attachments/<stem>/ images — is moved to an orphaned/<date>/ folder (never hard-deleted — the user may have added notes), and the now-empty attachments/ is pruned. User-written .md (no converter marker) is never touched.

Routing (which tool per file type)

| Type | Tool | Notes | |---|---|---| | .xlsx | openpyxl dual-read | per sheet: value grid (with A/B/C + row coords) + formulas list | | .csv / .tsv | csv → Markdown table | truncates past CSV_MAX_ROWS | | .pdf (digital) | pymupdf4llm | local, fast, no quota; if PYMUPDF4LLM_WRITE_IMAGES (default on), images ≥ PYMUPDF4LLM_IMAGE_SIZE_LIMIT (12% of page) → attachments/, then filtered by min-bytes + de-dup. If pymupdf4llm crashes (e.g. missing-font), a local plain-text pass is tried before MinerU | | .pdf (scanned) | MinerU vlm (fallback) | triggered when chars/page is too low | | .docx/.rtf/.odt/.epub | pandoc | images extracted to attachments/ | | .html/.htm | pandoc (local) | style/class/id attrs + layout div/section/span stripped first, so only content survives; tables kept lossless. No MinerU/token needed | | .pptx | python-pptx | title/body/tables/charts/notes + images; smart OCR (see below) | | .md/.markdown/.txt | passthrough | copied verbatim; only frontmatter added, body untouched | | .json/.yaml/.py/… | code passthrough | wrapped in a fenced code block + frontmatter | | audio .mp3/.m4a/.wav/… + video .mp4/.mov/.m4v | whisper (local; engine auto-selected: mlx-whisper on Apple Silicon, else faster-whisper) | speech-to-text, no token/quota; first run asks which model (shows sizes) + auto-installs the engine on consent (a model already cached on this machine is reused without re-asking); per-segment [mm:ss] timestamps + detected language; video = audio-track only (ffmpeg pulls it from the container). Needs ffmpeg; best on clear speech — songs/music transcribe poorly | | legacy .doc/.ppt, images | MinerU (cloud) | local libs can't read these | | anything else (numbers/pages/zip/…) | skipped | reported at the end with a fix hint — never silently dropped |

Discussion

0/2000
Loading comments...

Health Signals

MaintenanceCommitted 11d ago
Active
AdoptionUnder 100 stars
44 ★ · Niche
DocsREADME + description
Well-documented

GitHub Signals

Stars44
Forks5
Issues1
Updated11d ago
View on GitHub
MIT License

My Fox Den

Community Rating

Sign in to rate this booster

Works With

Claude Code