repolex

From Weekend Hack to 173 Languages

Rob Kunkle

x/twitter: @lux
linkedin: linkedin.com/in/robkunkle

ASIMOV DevLabs — Circuit Launch, Oakland
March 2026

How It Started

"I need a tool that lets an agent actually understand a codebase—not just grep through it."

How Agents See Code Today

Current Approach	With Structured Graphs
`grep -r "def " *.py`	`?func a python:function_definition`
String matching, regex, glob	Typed nodes with relationships
No understanding of structure	AST + LSP + dataflow
Context window = search results	Precise graph queries
Hopes the right file is nearby	Traverses cross-file dependencies

Markdown, grep, glob, regex—these are approximations. Graphs are the actual structure.

So we built it...

1 week. Claude + Python.

We built a parser that turned Python repos into RDF graphs, queryable with SPARQL.

It Worked!

SELECT ?func ?name ?startRow
WHERE {
  ?func a python:function_definition ;
        python:text ?name ;
        python:startRow ?startRow .
}
ORDER BY ?startRow

An agent could ask structured questions about code and get precise answers. No hallucination. No guessing file locations.

Agents are extremely adept at using SPARQL for generalized queries.

Find all functions, trace imports, map class hierarchies—all as graph queries.

What This Unlocked

Dead code & uncalled functions

Query for functions that exist but are never referenced anywhere. Instant cleanup list.

Cross-repo comparison

"Compare our repo to our competitor's." Same graph schema, same queries, side by side.

Great documentation

Agents can generate accurate docs because they have the full structural picture, not just string matches.

General grounded analysis

Everything grounded in actual code facts.

I even asked my friend Dan to try it ...

"Sorry, our codebase is Ruby."

And just like that, the real project began.

The Journey Begins

I knew this was a powerful idea. Store codebases as RDF, give agents SPARQL, and they get clear vision of the full repo.

But how could I make this readily available to everyone?

Not just Python. Not just my repos. A universal code intelligence layer that any agent could use against any codebase, regardless of the language.

The Competitive Analysis ca. 6 months ago

Is anyone else even doing this already?

It turned out: No, nobody is using SPARQL for code understanding.

Instead I found:

VS Code plugins — various tooling for working on a local repo. Tightly coupled to one IDE, not composable.
RAG / embedding tools — quick similarity search over code. Good for finding related snippets, but no structural understanding.
MCP tools (e.g. Serena) — leverage LSP for code navigation, but a pain to install and use. Not pre-computed or distributable.
Context aggregators (e.g. Context7) — pull info from the web to give agents more context. Often rely on LLM aggregation, not grounded in structure.
And interestingly... a defunct Java project called "Web of Code" that sought to link all of Java code into a single linked semantic graph.

Problems I Didn't Want to Solve

Not another coding assistant

Cursor, Copilot, Claude Code—that space is covered.

Not realtime live graphs

IDE plugins that parse on every keystroke? That's an LSP's job.

Not another free MCP plugin

How would I even benefit from this?

Not a molecule graph view of code

That no one actually ever uses.

Problems I Did Want to Solve

Look at a repo, at a given commit, and give an agent the full picture.

Agent superpowers

Seeing structure across files
Seeing dead code / unused files
See what functions call a given function
Control flow and dataflow throughout the repo

User ergonomics

No complicated installs
As fast as possible

Universal / Cross-repo

Works with any language
Dependency code included in analysis

Built for LLMs

Tooling developed with LLMs as the primary user
Reduce the number of tool calls
Reduce token expense

Move to Tree-sitter

173 languages

Tree-sitter gives us fast, incremental AST parsing for basically every language that matters. One parser to rule them all.

Great, 173 languages! But... how do you organize the ontology for 173 different AST schemas?

Python has function_definition, Ruby has method, Go has function_declaration—same concept, different names.

Ontology-Driven Development

The ontology isn't documentation—it's the source of truth that drives everything:

Change the ontology, and the parser, materializer, and query tool all adapt. Zero code changes.

What the Ontology Looks Like

Language-Specific (Python)

python:function_definition
    a owl:Class ;
    rdfs:subClassOf ts-core:Node ;
    rdfs:subClassOf repolex:function_definition .

python:name
    a owl:DatatypeProperty ;
    python:isTerminalField true .

Cross-Language (Repolex)

repolex:function_definition
    a owl:Class ;
    rdfs:subClassOf ts-core:Node .

# Ruby, Go, Java all map here too:
ruby:method
    rdfs:subClassOf repolex:function_definition .
go:function_declaration
    rdfs:subClassOf repolex:function_definition .

RDFS reasoning materializes the cross-language types at parse time. Query once, get results across all languages.

The Big-AST Graph Problem

Great! We can now parse 173 different languages into our AST graph... but it takes a-long-time, and this graph is... BIG. I wonder what happens when I make a single code change...

Making It Faster: Mirroring Git's Internals

Key insight: Git already content-addresses files. Same content = same blob hash, regardless of filename or location.

Stored as gzipped N-Quads. Append-only commits + content-addressed blobs = efficient incremental updates.

Adding Semantics: LSP via Multilspy

Tree-sitter alone (syntax)

# We see:
import jmespath

# Tree-sitter gives us:
#   import_statement
#     dotted_name: "jmespath"
#
# But where does jmespath
# come from? No idea.

+ LSP resolution (semantics)

# Now we know:
?import a repolex:lsp.import_statement ;
  repolex:scm.resolution_node ?node .

?node ts-core:sourceFile
  "/site-packages/jmespath/__init__.py" ;
  ts-core:startRow 1 .

SCM queries (.scm files) capture resolution nodes from tree-sitter. Multilspy resolves them to actual definitions.

Currently supporting 11 languages via Microsoft's multilspy. Enables call graphs, cross-file navigation.

The Graph Layers

AST

Structural syntax from tree-sitter. Every node, field, relationship in the source code.

?func a python:function_definition

LSP

Semantic resolution via multilspy. Where things are defined, what calls what.

?node a repolex:lsp.call_site

Dataflow

How data moves through the code. Variable assignments, return values, parameter passing.

?assign repolex:flowsTo ?usage

Control Flow

Execution paths, branches, loops. Which code runs under what conditions.

?block repolex:branchesTo ?target

Each layer is a separate named graph. Compose them for richer queries.

Solving the Dependency Problem: Composable Pre-Parsed Graphs

The dependency question: your code imports libraries. Those libraries import other libraries. How do you query across all of them?

We use deps.dev to resolve dependencies to their actual GitHub repos, and address all code with the org/repo structure. Load multiple graphs into one store. Named graphs keep them organized. SPARQL queries span them all.

What You Can Ask

Dead Code Detection

SELECT ?func ?name
WHERE {
  ?func a repolex:function_definition ;
        ts-core:text ?name .
  FILTER NOT EXISTS {
    ?call repolex:lsp.call_site ?func
  }
}

Import Map

SELECT ?file ?module
WHERE {
  ?node a repolex:lsp.import_statement ;
    repolex:scm.resolution_node ?res .
  ?res ts-core:sourceFile ?file .
  ?node ts-core:text ?module .
}

Architecture Overview

SELECT ?class ?method (COUNT(?call) as ?n)
WHERE {
  ?class a repolex:class_definition ;
         ts-core:text ?name .
  ?method a repolex:function_definition .
  ?call repolex:lsp.call_site ?method .
} GROUP BY ?class ?method

Cross-Repo Dependencies

SELECT ?yourFunc ?libFunc ?lib
WHERE {
  GRAPH ?yourGraph {
    ?call repolex:lsp.call_site ?libFunc .
    ?call ts-core:sourceFile ?yourFunc . }
  GRAPH ?libGraph {
    ?libFunc a repolex:function_definition . }
}

Distribution on a Budget: repolex-forx

No servers. No cloud compute bills. Just GitHub.

Did you know? If your repos and org are public:

Up to 20 GitHub Actions running at once
That's 20 fast servers, running 24/7 (comparable to parsing on M2 Ultra locally)
Release assets up to 2GB each, 100MB per repo file
Up to 100,000 repos per org
This is enough to parse all of open source

github.com/orgs/repolex-forx/repositories

ASIMOV Integration

Repolex fits into ASIMOV as a perception layer—structured knowledge that feeds into ASIMOV's intelligence architecture.

Verifiable, local-first knowledge. No cloud dependency. The graphs carry provenance—you can trace every triple back to a specific commit, file, and line.

Where We Are

Done

Tree-sitter parsing for 173 languages • Ontology-driven development • Content-addressed blob storage • Git history graphs • RDFS cross-language reasoning • lexq query tool with JSON-LD compaction • SCM query capture • LSP integration (11 languages via multilspy) • Dataflow analysis • Control flow graphs • Slimmer AST graphs

In Progress

LLM query ergonomics • Premade CONSTRUCT queries • repolex-forx GitHub Actions pipeline fine-tuning • ASIMOV module integration

Public-facing graph registry • Parse 1,000 open source repos • AI training set publication

Beyond Code

We've built something that turns Git repositories into composable knowledge graphs.

But Git repos aren't just code.

Documentation repos

Config repos (IaC, K8s)

Data repos

Research paper repos

Legal document repos

Anything versioned in Git

Git repos as context graphs. Pre-parsed, composable, queryable. A public knowledge layer for AI.

repolex.ai

github.com/repolex-ai

Rob Kunkle

x/twitter: @lux
linkedin: linkedin.com/in/robkunkle
rob.kunkle@gmail.com

RDF SPARQL tree-sitter knowledge graphs