About
Every language is a universe
of thought.
A LingHacks VII edition for keeping them alive.
A language dies every two weeks. By 2100, UNESCO estimates half of the world’s ~7,000 languages will be extinct — each taking with it centuries of irreplaceable knowledge, oral history, and cultural identity. The resources to preserve these languages exist, but they’re scattered across obscure PDFs, YouTube videos, academic papers, and dictionary websites. LangSafe deploys AI agents that autonomously discover, extract, and cross-reference these scattered fragments into a unified, searchable archive. This LingHacks build adds community review and lesson generation so preservation can become revitalization.
At a glance
How it works
Discover
Autonomous agents scour the web for dictionaries, grammars, recordings, and academic papers in endangered languages.
Extract
AI-powered extraction pulls vocabulary, grammar patterns, and audio from diverse sources into structured archives.
Cross-Reference
Intelligent verification links entries across sources, validating accuracy and building comprehensive language records.
The pipeline
Discovery
Featherless-powered agents plan 6-tier dynamic queries and combine priority archives, verified public resource patterns, and optional SERP APIs, generating up to 24 targeted discovery paths per language.
Crawl
Each source is fetched through a 3-tier cascade: specialized crawlers, BrightData Web Unlocker for protected content, and Stagehand headless browser.
Extraction
Featherless processes each source in a schema-guided tool loop, extracting structured vocabulary entries, grammar patterns, IPA transcriptions, and conjugations.
Cross-Reference
A second Featherless agent searches for duplicate entries across sources, merging definitions and calculating reliability scores.
Archive
All data flows into Elasticsearch with Jina AI embeddings for semantic search, reranking, and knowledge graph generation.
Revitalize
Community reviewers validate entries, flag sensitive material, and generate classroom-ready lesson packs from the archive.
Data sources
Glottolog
The world's most comprehensive catalog of languages, with data on 5,352 endangered languages including geographic coordinates, endangerment status, and language family classification.
Endangered Languages Project
A collaborative platform documenting the world's endangered languages, providing endangerment assessments and preservation resources.
Community Sources
Dictionaries, academic papers, YouTube content, government archives, and wiki resources discovered autonomously by our AI agents.
Built with