repolex

From Weekend Hack to 173 Languages

Rob Kunkle

x/twitter: @lux
linkedin: linkedin.com/in/robkunkle

ASIMOV DevLabs — Circuit Launch, Oakland
March 2026

How It Started

"I need a tool that lets an agent actually understand a codebase—not just grep through it."

How Agents See Code Today

Current Approach With Structured Graphs
grep -r "def " *.py ?func a python:function_definition
String matching, regex, glob Typed nodes with relationships
No understanding of structure AST + LSP + dataflow
Context window = search results Precise graph queries
Hopes the right file is nearby Traverses cross-file dependencies

Markdown, grep, glob, regex—these are approximations. Graphs are the actual structure.

So we built it...

1 week. Claude + Python.

We built a parser that turned Python repos into RDF graphs, queryable with SPARQL.

.py files Python AST + inspect RDF triples SPARQL

It Worked!

SELECT ?func ?name ?startRow
WHERE {
  ?func a python:function_definition ;
        python:text ?name ;
        python:startRow ?startRow .
}
ORDER BY ?startRow

An agent could ask structured questions about code and get precise answers. No hallucination. No guessing file locations.

Agents are extremely adept at using SPARQL for generalized queries.

Find all functions, trace imports, map class hierarchies—all as graph queries.

What This Unlocked

Dead code & uncalled functions

Query for functions that exist but are never referenced anywhere. Instant cleanup list.

Cross-repo comparison

"Compare our repo to our competitor's." Same graph schema, same queries, side by side.

Great documentation

Agents can generate accurate docs because they have the full structural picture, not just string matches.

General grounded analysis

Everything grounded in actual code facts.

I even asked my friend Dan to try it ...

"Sorry, our codebase is Ruby."

And just like that, the real project began.

The Journey Begins

I knew this was a powerful idea. Store codebases as RDF, give agents SPARQL, and they get clear vision of the full repo.

But how could I make this readily available to everyone?

Not just Python. Not just my repos. A universal code intelligence layer that any agent could use against any codebase, regardless of the language.

The Competitive Analysis ca. 6 months ago

Is anyone else even doing this already?

It turned out: No, nobody is using SPARQL for code understanding.

Instead I found:

  • VS Code plugins — various tooling for working on a local repo. Tightly coupled to one IDE, not composable.
  • RAG / embedding tools — quick similarity search over code. Good for finding related snippets, but no structural understanding.
  • MCP tools (e.g. Serena) — leverage LSP for code navigation, but a pain to install and use. Not pre-computed or distributable.
  • Context aggregators (e.g. Context7) — pull info from the web to give agents more context. Often rely on LLM aggregation, not grounded in structure.
  • And interestingly... a defunct Java project called "Web of Code" that sought to link all of Java code into a single linked semantic graph.

Problems I Didn't Want to Solve

Not another coding assistant

Cursor, Copilot, Claude Code—that space is covered.

Not realtime live graphs

IDE plugins that parse on every keystroke? That's an LSP's job.

Not another free MCP plugin

How would I even benefit from this?

Not a molecule graph view of code

That no one actually ever uses.

Problems I Did Want to Solve

Look at a repo, at a given commit, and give an agent the full picture.

Agent superpowers

  • Seeing structure across files
  • Seeing dead code / unused files
  • See what functions call a given function
  • Control flow and dataflow throughout the repo

User ergonomics

  • No complicated installs
  • As fast as possible

Universal / Cross-repo

  • Works with any language
  • Dependency code included in analysis

Built for LLMs

  • Tooling developed with LLMs as the primary user
  • Reduce the number of tool calls
  • Reduce token expense

Move to Tree-sitter

173 languages

Tree-sitter gives us fast, incremental AST parsing for basically every language that matters. One parser to rule them all.

Great, 173 languages! But... how do you organize the ontology for 173 different AST schemas?

Python has function_definition, Ruby has method, Go has function_declaration—same concept, different names.

Ontology-Driven Development

The ontology isn't documentation—it's the source of truth that drives everything:

OWL/RDFS Ontology Parser field types RDFS reasoning rules SPARQL prefixes Terminal vs Structural Cross-lang unification Auto-injected queries

Change the ontology, and the parser, materializer, and query tool all adapt. Zero code changes.

What the Ontology Looks Like

Language-Specific (Python)

python:function_definition
    a owl:Class ;
    rdfs:subClassOf ts-core:Node ;
    rdfs:subClassOf repolex:function_definition .

python:name
    a owl:DatatypeProperty ;
    python:isTerminalField true .

Cross-Language (Repolex)

repolex:function_definition
    a owl:Class ;
    rdfs:subClassOf ts-core:Node .

# Ruby, Go, Java all map here too:
ruby:method
    rdfs:subClassOf repolex:function_definition .
go:function_declaration
    rdfs:subClassOf repolex:function_definition .

RDFS reasoning materializes the cross-language types at parse time. Query once, get results across all languages.

The Big-AST Graph Problem

Great! We can now parse 173 different languages into our AST graph... but it takes a-long-time, and this graph is... BIG. I wonder what happens when I make a single code change...

Making It Faster: Mirroring Git's Internals

Key insight: Git already content-addresses files. Same content = same blob hash, regardless of filename or location.

Commit abc123 Commit def456 filetree blob_hash → path blob/e7f2a1 (AST) blob/3bc9d4 (AST) blob/a1f8e2 (AST) Content-addressed: parse once, reuse forever Typical repo: 80-90% of blobs unchanged between commits

Stored as gzipped N-Quads. Append-only commits + content-addressed blobs = efficient incremental updates.

Adding Semantics: LSP via Multilspy

Tree-sitter alone (syntax)

# We see:
import jmespath

# Tree-sitter gives us:
#   import_statement
#     dotted_name: "jmespath"
#
# But where does jmespath
# come from? No idea.

+ LSP resolution (semantics)

# Now we know:
?import a repolex:lsp.import_statement ;
  repolex:scm.resolution_node ?node .

?node ts-core:sourceFile
  "/site-packages/jmespath/__init__.py" ;
  ts-core:startRow 1 .

SCM queries (.scm files) capture resolution nodes from tree-sitter. Multilspy resolves them to actual definitions.

Currently supporting 11 languages via Microsoft's multilspy. Enables call graphs, cross-file navigation.

The Graph Layers

AST

Structural syntax from tree-sitter. Every node, field, relationship in the source code.

?func a python:function_definition

LSP

Semantic resolution via multilspy. Where things are defined, what calls what.

?node a repolex:lsp.call_site

Dataflow

How data moves through the code. Variable assignments, return values, parameter passing.

?assign repolex:flowsTo ?usage

Control Flow

Execution paths, branches, loops. Which code runs under what conditions.

?block repolex:branchesTo ?target

Each layer is a separate named graph. Compose them for richer queries.

Solving the Dependency Problem: Composable Pre-Parsed Graphs

The dependency question: your code imports libraries. Those libraries import other libraries. How do you query across all of them?

your-app AST + LSP graphs flask pre-parsed graph sqlalchemy pre-parsed graph Unified SPARQL Endpoint query across everything

We use deps.dev to resolve dependencies to their actual GitHub repos, and address all code with the org/repo structure. Load multiple graphs into one store. Named graphs keep them organized. SPARQL queries span them all.

What You Can Ask

Dead Code Detection

SELECT ?func ?name
WHERE {
  ?func a repolex:function_definition ;
        ts-core:text ?name .
  FILTER NOT EXISTS {
    ?call repolex:lsp.call_site ?func
  }
}

Import Map

SELECT ?file ?module
WHERE {
  ?node a repolex:lsp.import_statement ;
    repolex:scm.resolution_node ?res .
  ?res ts-core:sourceFile ?file .
  ?node ts-core:text ?module .
}

Architecture Overview

SELECT ?class ?method (COUNT(?call) as ?n)
WHERE {
  ?class a repolex:class_definition ;
         ts-core:text ?name .
  ?method a repolex:function_definition .
  ?call repolex:lsp.call_site ?method .
} GROUP BY ?class ?method

Cross-Repo Dependencies

SELECT ?yourFunc ?libFunc ?lib
WHERE {
  GRAPH ?yourGraph {
    ?call repolex:lsp.call_site ?libFunc .
    ?call ts-core:sourceFile ?yourFunc . }
  GRAPH ?libGraph {
    ?libFunc a repolex:function_definition . }
}

Distribution on a Budget: repolex-forx

No servers. No cloud compute bills. Just GitHub.

GitHub Action triggered on push repolex parse incremental content-addressed .nq.gz files committed to repo or GitHub Release git clone or git pull instant graphs Infrastructure cost: $0

Did you know? If your repos and org are public:

  • Up to 20 GitHub Actions running at once
  • That's 20 fast servers, running 24/7 (comparable to parsing on M2 Ultra locally)
  • Release assets up to 2GB each, 100MB per repo file
  • Up to 100,000 repos per org
  • This is enough to parse all of open source

github.com/orgs/repolex-forx/repositories

ASIMOV Integration

Repolex fits into ASIMOV as a perception layer—structured knowledge that feeds into ASIMOV's intelligence architecture.

Reasoning Memory & Identity Knowledge Graphs Repolex — Perception Layer AST + LSP + Dataflow + CFG ASIMOV layers

Verifiable, local-first knowledge. No cloud dependency. The graphs carry provenance—you can trace every triple back to a specific commit, file, and line.

Where We Are

Done

Tree-sitter parsing for 173 languages • Ontology-driven development • Content-addressed blob storage • Git history graphs • RDFS cross-language reasoning • lexq query tool with JSON-LD compaction • SCM query capture • LSP integration (11 languages via multilspy) • Dataflow analysis • Control flow graphs • Slimmer AST graphs

In Progress

LLM query ergonomics • Premade CONSTRUCT queries • repolex-forx GitHub Actions pipeline fine-tuning • ASIMOV module integration

Beyond Code

We've built something that turns Git repositories into composable knowledge graphs.

But Git repos aren't just code.

Documentation repos

Config repos (IaC, K8s)

Data repos

Research paper repos

Legal document repos

Anything versioned in Git

Git repos as context graphs. Pre-parsed, composable, queryable. A public knowledge layer for AI.

repolex.ai

github.com/repolex-ai

Rob Kunkle

x/twitter: @lux
linkedin: linkedin.com/in/robkunkle
rob.kunkle@gmail.com

RDF SPARQL tree-sitter knowledge graphs