Why RDF for Code Analysis?
Most code analysis tools use custom databases or JSON blobs. We chose RDF and SPARQL. Here’s why.
The Problem with Traditional Approaches
When you analyze a codebase, you need to answer questions like:
- Which functions depend on this class?
- What changed between these two versions?
- Where are all the deprecated APIs used?
Traditional tools build custom query languages or force you into their specific data model. Every new tool reinvents the wheel.
RDF as a Universal Model
RDF (Resource Description Framework) is a W3C standard for representing knowledge as graphs. Think of it as a universal connector between different tools and data sources.
With Repolex, we model code as semantic triples:
<Function_foo> rdf:type python:FunctionDef .
<Function_foo> python:hasDocstring <Docstring_123> .
<Docstring_123> dsp:shortDescription "Processes user input" .
Why This Matters
Composability: Different analyses create compatible graphs. AST parsing, git history, and dependency tracking all contribute to one unified knowledge base.
Standards-based: SPARQL is a proven query language. OWL reasoning is well-understood. We’re building on decades of semantic web research.
Evolution: As we learn patterns in code, we evolve the ontology. The system gets smarter without code changes.
The Downsides
RDF isn’t free. There’s complexity in managing ontologies and learning SPARQL. But for a system designed to reason about millions of lines of code across hundreds of repositories, the tradeoffs are worth it.
More on the technical architecture in future posts.