How do I install deduplication?

deduplication is a Skill hosted on GitHub at https://github.com/dadbodgeoff/drift. Visit the ImAiFox page at https://imaifox.com/boosters/dadbodgeoff-drift-deduplication for the AI-ready install prompt you can copy directly into Claude Code, Cursor, or Windsurf.

How popular is deduplication?

deduplication has 757 GitHub stars and 59 forks. The repository has not had recent commits.

Is deduplication free?

Yes — deduplication is open source and free to use. The source code is publicly available on GitHub at https://github.com/dadbodgeoff/drift.

Skill

deduplication

Name: deduplication
Author: dadbodgeoff

by dadbodgeoff

AI Summary

A deduplication skill that intelligently groups and selects canonical versions of duplicate events across multiple data sources using reputation scoring and hash-based matching. Ideal for developers building data aggregation systems, search engines, or multi-source content platforms.

Install

Copy this and paste it into Claude Code, Cursor, or any AI assistant:

I want to install the "deduplication" skill in my project.

Please run this command in my terminal:
# Install skill into the correct directory
mkdir -p .claude/skills/deduplication && curl --retry 3 --retry-delay 2 --retry-all-errors -o .claude/skills/deduplication/SKILL.md "https://raw.githubusercontent.com/dadbodgeoff/drift/main/drift v1 depreciated/skills/deduplication/SKILL.md"

Then restart Claude Code (or reload the window in Cursor) so the skill is picked up.

ai-tools cli code-quality csharp java mcp mcp-server model-context-protocol pattern-detection php python typescript vscode-extension

Description

Event deduplication with canonical selection, reputation scoring, and hash-based grouping for multi-source data aggregation. Handles both ID-based and content-based deduplication.

Event Deduplication

Canonical selection with reputation scoring and hash-based grouping for multi-source data.

When to Use This Skill

• Aggregating data from multiple sources (news, events, products) • Same content appears from different outlets/sources • Need to pick the "best" version from duplicates • Tracking deduplication metrics for optimization

Core Concepts

Simple URL deduplication isn't enough. Production needs: • Grouping by semantic similarity (same story, different outlets) • Canonical selection (pick the "best" version) • Reputation scoring (prefer authoritative sources) • Both ID-based and content-based deduplication Two modes: • ID-based: When sources have unique IDs, keep the "best" version when IDs collide • Content-based: Group by semantic similarity, select canonical from each group

TypeScript

`typescript import { createHash } from 'crypto'; interface DeduplicationResult<T> { items: T[]; originalCount: number; dedupedCount: number; reductionPercent: number; duplicateGroups?: number; } // ============================================ // ID-Based Deduplication // ============================================ function deduplicateById<T extends { id: string }>( items: T[], preferFn: (existing: T, candidate: T) => T ): DeduplicationResult<T> { const seen = new Map<string, T>(); for (const item of items) { const existing = seen.get(item.id); if (existing) { seen.set(item.id, preferFn(existing, item)); } else { seen.set(item.id, item); } } const dedupedItems = Array.from(seen.values()); const reductionPercent = items.length > 0 ? Math.round((1 - dedupedItems.length / items.length) * 100) : 0; return { items: dedupedItems, originalCount: items.length, dedupedCount: dedupedItems.length, reductionPercent, }; } // ============================================ // Content-Based Deduplication // ============================================ interface Article { title: string; url: string; domain: string; publishedAt: string; tone?: number; } /** • Generate deduplication key from content • Groups by: normalized title + source country + date */ function generateDedupKey(article: Article): string { const normalizedTitle = article.title .toLowerCase() .replace(/[^\w\s]/g, '') .trim() .slice(0, 50); const dateStr = article.publishedAt?.slice(0, 10).replace(/-/g, '') || 'unknown'; return ${normalizedTitle}|${dateStr}; } /** • Generate unique ID from URL */ function generateEventId(url: string): string { return createHash('md5').update(url).digest('hex').slice(0, 12); } /** • Source reputation scoring */ function getReputationScore(domain: string): number { // Tier 1: Wire services and major international const tier1 = ['reuters.com', 'apnews.com', 'bbc.com', 'bbc.co.uk', 'aljazeera.com', 'france24.com', 'dw.com']; if (tier1.some(r => domain.includes(r))) return 100; // Tier 2: Major newspapers const tier2 = ['nytimes.com', 'washingtonpost.com', 'theguardian.com', 'ft.com', 'economist.com', 'wsj.com']; if (tier2.some(r => domain.includes(r))) return 75; // Tier 3: Regional/national const tier3 = ['cnn.com', 'foxnews.com', 'nbcnews.com', 'abcnews.go.com']; if (tier3.some(r => domain.includes(r))) return 50; return 10; } /** • Select canonical article from duplicate group */ function selectCanonical<T extends Article>( group: { item: T; source: string }[] ): { item: T; source: string } { return group.reduce((best, current) => { const bestScore = getReputationScore(best.item.domain) + Math.abs(best.item.tone || 0); const currentScore = getReputationScore(current.item.domain) + Math.abs(current.item.tone || 0); return currentScore > bestScore ? current : best; }); } /** • Deduplicate articles from multiple sources */ function deduplicateArticles<T extends Article>( sourceResults: { sourceName: string; articles: T[] }[] ): DeduplicationResult<T & { source: string }> { const groups = new Map<string, { item: T; source: string }[]>(); let totalArticles = 0; // Group articles by dedup key for (const { sourceName, articles } of sourceResults) { for (const article of articles) { totalArticles++; const key = generateDedupKey(article); if (!groups.has(key)) { groups.set(key, []); } groups.get(key)!.push({ item: article, source: sourceName }); } } // Select canonical article from each group const items: (T & { source: string })[] = []; for (const group of groups.values()) { const canonical = selectCanonical(group); items.push({ ...canonical.item, source: canonical.source }); } const reductionPercent = totalArticles > 0 ? Math.round((1 - items.length / totalArticles) * 100) : 0; console.log([Dedup] ${totalArticles} → ${items.length} (${reductionPercent}% reduction)); return { items, originalCount: totalArticles, dedupedCount: items.length, reductionPercent, duplicateGroups: groups.size, }; } `

Discussion

0/2000

Loading comments...

Health Signals

MaintenanceCommitted 3mo ago

◐ Stale

Adoption100+ stars on GitHub

757 ★ · Growing

DocsREADME + description

Well-documented

GitHub Signals

Stars757

Forks59

Issues13

Updated3mo ago

View on GitHub

No License

My Fox Den

Community Rating

Works With

Claude Code

Related Skills

Openclaw MCP Server

MCP Server

everything-claude-code

Plugin

receiving-code-review

Skill

systematic-debugging

Skill

View all Skills →