Skip to content
Skill

metadata-extraction

by kreuzberg-dev

AI Summary

The html-to-markdown library provides comprehensive, single-pass metadata extraction during HTML-to-Markdown conversion. This enables content analysis, SEO optimization, document indexing, and structured data processing without extra parsing passes. Metadata extraction uses the same pattern as inli

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "metadata-extraction" skill in my project.

Please run this command in my terminal:
# Install skill into your project
mkdir -p .claude/skills/metadata-extraction && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/metadata-extraction/SKILL.md "https://raw.githubusercontent.com/kreuzberg-dev/html-to-markdown/main/.codex/skills/metadata-extraction/SKILL.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

Description

Metadata Extraction for html-to-markdown

Overview

The html-to-markdown library provides comprehensive, single-pass metadata extraction during HTML-to-Markdown conversion. This enables content analysis, SEO optimization, document indexing, and structured data processing without extra parsing passes.

Use Cases

• Table of contents generation: Build TOC from headers and IDs • Document outline: Create hierarchical structure • SEO analysis: Verify H1 presence, hierarchy correctness • Navigation: Generate internal anchor links

Single-Pass Collection

Metadata extraction uses the same MetadataCollector pattern as inline image collection: `rust // From lib.rs line 445 let metadata_collector = Rc::new(RefCell::new(metadata::MetadataCollector::new(metadata_cfg))); // Passed to converter during tree walk let markdown = converter::convert_html_with_metadata( normalized_html.as_ref(), &options, Rc::clone(&metadata_collector) )?; // After conversion, recover and return metadata let metadata = metadata_collector.finish(); ` Key Benefits: • Zero overhead when disabled: Entire module compilable out via feature flags • Single tree traversal: No separate metadata extraction pass • Memory efficient: Pre-allocated buffers (typical: 32 headers, 64 links, 16 images) • Configurable granularity: Extract only needed metadata types

MetadataConfig Structure

Located in /crates/html-to-markdown/src/metadata.rs: `rust pub struct MetadataConfig { pub extract_document: bool, // <head> meta tags, title, etc. pub extract_headers: bool, // h1-h6 with hierarchy pub extract_links: bool, // All hyperlinks with classification pub extract_images: bool, // All images with dimensions pub extract_structured_data: bool, // JSON-LD, Microdata, RDFa pub max_structured_data_size: usize, // Prevent memory exhaustion } `

Discussion

0/2000
Loading comments...

Health Signals

MaintenanceCommitted 6d ago
Active
Adoption100+ stars on GitHub
604 ★ · Growing
DocsMissing or thin
Undocumented

GitHub Signals

Stars604
Forks52
Issues3
Updated6d ago
View on GitHub
MIT License

My Fox Den

Community Rating

Sign in to rate this booster

Works With

Claude Code