Skip to content
@cocoindex-io

CocoIndex

Real-time data transformation framework for AI

Enterprise corpus — codebase, Slack, meeting notes, and documentation — flowing continuously through the CocoIndex incremental sync engine into a production AI agent with always-fresh context. Only the Δ (delta) is reprocessed on every change.

Your agents deserve fresh context.

Star us ❤️ → Star CocoIndex on GitHub  ·  cocoindex.io homepage  ·  CocoIndex documentation  ·  Join the CocoIndex Discord

CocoIndex turns codebases, meeting notes, inboxes, Slack, PDFs, and videos into live, continuously fresh context for your AI agents and LLM apps to reason over effectively — with minimal incremental processing. Get your production AI agent ready in 10 minutes with reliable, continuously fresh data — no stale batches, no context gap.

Incremental · only the delta  ·  Any scale · parallel by default  ·  Declarative · Python, 5 min

stars downloads pypi python rust license discord


Built with CocoIndex ❤️

CocoIndex-code — flagship MCP server for AI coding agents. AST-aware incremental semantic code index that keeps live call graphs, symbols, vectors, and chunks fresh on every commit. 70% fewer tokens per turn, 80-90% cache hits on re-index, sub-second freshness. Supports Python, TypeScript, Rust, and Go. Features: Δ-only incremental processing, semantic search by meaning (not grep), call graphs and blast-radius analysis, global repo view for duplicates and architecture. Build coding agents (generate, refactor) and code-review agents (catch, approve). One install — Claude Code, Cursor, and other MCP-aware agents see your whole repository instantly. Keywords: MCP server, coding agent, code intelligence, AST chunking, semantic code search, call graph, vector embedding, repository context, Claude Code, Cursor, incremental indexing, blast radius.

See all 20+ examples · updated every week →



React — for data engineering

React — for data engineering. The CocoIndex mental model: Target = F(Source). A persistent-state-driven dataflow where you declare the desired target state and the engine keeps it in sync with the latest source data and code, forever, at low latency and low cost.

What happens when either side changes — CocoIndex tracks per-row provenance so the Δ propagates at minimum cost. Source change re-syncs only the affected target dot; code change re-runs only dots whose outputs depend on the changed code.

See the React ↔ CocoIndex mental model →



Incremental engine for long-horizon agents

Data transformation for any engineer, designed for AI workloads —
with a smart incremental engine for always-fresh, explainable data.

Learn the concept

CocoIndex's Python-native transformation flows connect 8 source categories through the incremental engine out to 6 target stores. Only the Δ is reprocessed — unchanged src hits the cache, changed src re-runs split() and Δ → re-embed.



Why incremental?

Your agents are only as good as the data they see.
Batch pipelines drift stale. CocoIndex stays live — and only runs the Δ.

Why incremental? Sub-second fresh, 10× cheaper at scale, explainable by default, production-grade Rust core with retries, back-off, dead-letter queues, and no-data-loss guarantees.



What can you build?

See all 20+ examples · updated every week →

Working starters from the examples tree — clone, plug your source, ship.

Real-time code index — walk a git repo, AST-chunk source files, embed with sentence-transformers, upsert to pgvector / LanceDB, incremental on every commit.

PDF → RAG index — ingest PDFs from local, S3, or GDrive, extract + chunk text, embed chunks, upsert to pgvector / LanceDB. Classic retrieval-augmented-generation stack, incremental.

HN trending topics — pull Hacker News threads via Algolia, recursively parse comments, LLM-extract topics with Gemini 2.5 Flash, rank by weighted hit count, store in Postgres. Incremental.

Conversation → knowledge graph — LLM extracts people, topics, decisions, action items from transcripts and upserts into Neo4j / Kuzu. Live graph, incremental.

Multi-repo summarization — walk N git repos, extract structure, LLM-summarize per-repo + a rolled-up org summary, refresh on every push.

Structured extraction — BAML / DSPy typed schema extraction from forms, PDFs, intakes, invoices into Postgres / warehouse. Incremental.

Podcast → knowledge graph — transcribe YouTube / Spotify audio with speaker diarization, LLM-extract speakers and statements, resolve entities across episodes, store in SurrealDB / Neo4j.

CSV → Kafka live — watch a folder of CSV files, publish each row as a JSON message to a Kafka topic via CocoIndex's Kafka target connector. Incremental, sub-second, no producer loop.


Share what you build

Building something with CocoIndex? We want to see it.
Tag @cocoindex_io on X or drop a link in #showcase on Discord. We'll boost it. 🥥


Community

Join the CocoIndex Discord community — live chat with maintainers and users, showcase your projects, get help building RAG pipelines and knowledge graphs Subscribe to the CocoIndex YouTube channel — video tutorials, live demos, architecture deep dives, and AI agent recipes Read the CocoIndex blog — engineering deep dives, release notes, RAG and knowledge graph tutorials, and case studies Follow @cocoindex_io on X (formerly Twitter) for release notes, demos, launches, and AI data pipeline updates

📝 Contributing guide  ·  🐛 good first issues  ·  💬 Say hi on Discord


Apache 2.0 · © CocoIndex contributors 🥥

Popular repositories Loading

  1. cocoindex cocoindex Public

    Incremental engine for long horizon agents 🌟 Star if you like it!

    Python 7.1k 506

  2. cocoindex-code cocoindex-code Public

    A super light-weight embedded code search engine CLI (AST based) that just works - saves 70% token and improves speed for coding agent 🌟 Star if you like it!

    Python 1.5k 105

  3. cocoindex-claude cocoindex-claude Public

    ✨ CocoIndex Claude Code Skill ✨

    58 7

  4. realtime-codebase-indexing realtime-codebase-indexing Public

    build codebase index with tree-sitter. works with large codebases, and can be updated in near real-time with incremental processing - only reprocess what's changed.

    Python 49 3

  5. meeting-notes-knowledge-graph meeting-notes-knowledge-graph Public

    Build a meeting knowledge graph from Google Drive using LLM extraction and graph database, with automatic continuous updates.

    Python 31 1

  6. patient-intake-extraction patient-intake-extraction Public

    Patient Intake Form Extraction using llm

    Python 15 2

Repositories

Showing 10 of 15 repositories

Top languages

Loading…

Most used topics

Loading…