odvcencio/gotreesitter: Pure Go tree-sitter runtime

Pure-Go tree-sitter runtime – no CGO, no C toolchain, WASM-ready.

go get github.com/odvcencio/gotreesitter

Implements the same parse-table format used by Tree-Sitter, so existing grammars work without recompilation. CGO performs better than binding on every workload – there are incremental edits (major operations in editors and language servers) 90 times faster Compared to the C implementation.

Every existing Go tree-sitter binding requires a CGO. That means:

  • cross-compilation break (GOOS=wasip1, GOARCH=arm64 Linux, Windows without MSYS2)
  • CI pipelines require C toolchain in every build image
  • go install fails for end users without gcc
  • Race detectors, fuzzing and coverage tools work poorly across CGO limits

Gotrisitter is pure cow. go get And build – on any target, on any platform.

import (
    "fmt"

    "github.com/odvcencio/gotreesitter"
    "github.com/odvcencio/gotreesitter/grammars"
)

func main() {
    src := []byte(`package main

func main() {}
`)

    lang := grammars.GoLanguage()
    parser := gotreesitter.NewParser(lang)

    tree := parser.Parse(src)
    fmt.Println(tree.RootNode())

    // After editing source, reparse incrementally:
    //   tree.Edit(edit)
    //   tree2 := parser.ParseIncremental(newSrc, tree)
}

TreeSitter’s S-Expression query language is supported, including predicate and cursor-based streaming. See Known Limitations for current warnings.

q, _ := gotreesitter.NewQuery(`(function_declaration name: (identifier) @fn)`, lang)
cursor := q.Exec(tree.RootNode(), lang, src)

for {
    match, ok := cursor.NextMatch()
    if !ok {
        break
    }
    for _, cap := range match.Captures {
        fmt.Println(cap.Node.Text(src))
    }
}

After the initial parse, re-parse only the changed region – unchanged subtrees are automatically reused.

// Initial parse
tree := parser.Parse(src)

// User types "x" at byte offset 42
src = append(src[:42], append([]byte("x"), src[42:]...)...)

tree.Edit(gotreesitter.InputEdit{
    StartByte:   42,
    OldEndByte:  42,
    NewEndByte:  43,
    StartPoint:  gotreesitter.Point{Row: 3, Column: 10},
    OldEndPoint: gotreesitter.Point{Row: 3, Column: 10},
    NewEndPoint: gotreesitter.Point{Row: 3, Column: 11},
})

// Incremental reparse — ~1.38 μs vs 124 μs for the CGo binding (90x faster)
tree2 := parser.ParseIncremental(src, tree)

tip: Use grammars.DetectLanguage("main.go") Choosing correct grammar based on file name – useful for editor integration.

hl, _ := gotreesitter.NewHighlighter(lang, highlightQuery)
ranges := hl.Highlight(src)

for _, r := range ranges {
    fmt.Printf("%s: %q\n", r.Capture, src[r.StartByte:r.EndByte])
}

Comment: The text predicts (#eq?, #match?, #any-of?, #not-eq?) require source []byte evaluate. passing nil Disables predicate checking.

Extract definitions and references from source code:

entry := grammars.DetectLanguage("main.go")
lang := entry.Language()

tagger, _ := gotreesitter.NewTagger(lang, entry.TagsQuery)
tags := tagger.Tag(src)

for _, tag := range tags {
    fmt.Printf("%s %s at %d:%d\n", tag.Kind, tag.Name,
        tag.NameRange.StartPoint.Row, tag.NameRange.StartPoint.Column)
}

Everyone LangEntry exposes one Quality Fields indicating how reliable the parse output is:

quality Meaning
full Token source or DFA with external scanner – full integrity
partial DFA-partial – missing external scanner, tree may contain silent gaps
none cannot be parsed

entries := grammars.AllLanguages()
for _, e := range entries {
    fmt.Printf("%s: %s\n", e.Name, e.Quality)
}

measured against go-tree-sitter (standard CGO bindings), parsing a Go source file with 500 function definitions.

goos: linux / goarch: amd64 / cpu: Intel(R) Core(TM) Ultra 9 285

# pure-Go parser benchmarks (root module)
go test -run '^$' -bench 'BenchmarkGoParse' -benchmem -count=3

# C baseline benchmarks (cgo_harness module)
cd cgo_harness
go test . -run '^$' -tags treesitter_c_bench -bench 'BenchmarkCTreeSitterGoParse' -benchmem -count=3

benchmark ns/op b/op allot/op
BenchmarkCTreeSitterGoParseFull 2,058,000 600 6
BenchmarkCTreeSitterGoParseIncrementalSingleByteEdit 124,100 648 7
BenchmarkCTreeSitterGoParseIncrementalNoEdit 121,100 600 6
BenchmarkGoParseFull 1,330,000 10,842 2,495
BenchmarkGoParseIncrementalSingleByteEdit 1,381 361 9
BenchmarkGoParseIncrementalNoEdit 8.63 0 0

Summary:

workload gotrisitter cgo binding Ratio
full parse 1,330 μs 2,058 μs ~1.5 times faster
Incremental (single-byte editing) 1.38 μs 124 μs ~90 times faster
Incremental (no-op repair) 8.6 ns 121 μs ~14,000 times faster

The incremental hot path aggressively reuses sub-trees – a single-byte edit re-parses in microseconds while CGO binding pays the full C-runtime and call overhead. The no-edit fast path exits on a single zero-check: zero allocation, single-digit nanosecond.


205 grammars were sent to the registry. run go run ./cmd/parity_report For live per-language status.

Current Summary:

  • 204 complete – Parse without errors (DFA with token source or full external scanner)
  • 1 partialnorg (Requires external scanner with 122 tokens, not implemented yet)
  • 0 unsupported

Backend Breakdown:

  • 195 dfa – Hand-written DFA lexer, switch to external scanner where needed
  • 1 dfa-partial – DFA generated without external scanner (norg)
  • 9 token_source – Handwritten pure-Go lexer bridge (cert, c, go, html, java, json, lua, toml, yaml)

Go external scanner equipped with handwritten text in 111 languages zzz_scanner_attachments.go.

Full language list (205):
ada, agda, angular, apex, arduino, asm, astro, authzed, awk, bash, bass, beancount, bibtex, bicep, bitbake, blade, brightscript, c, c_sharp, caddy, cairo, capnp, chatito, circom, clojure, cmake, cobol, comment, commonlisp, cooklang, corn, cpon, cpp, crystal, css, csv, cuda, cue, cylc, d, dart, desktop, devicetree, dhall, diff, disassembly, djot, dockerfile, dot, doxygen, dtd, earthfile, ebnf, editorconfig, eds, eex, elisp, elixir, elm, elsa, embedded_template, enforce, erlang, facility, faust, fennel, fidl, firrtl, fish, foam, forth, fortran, fsharp, gdscript, git_config, git_rebase, gitattributes, gitcommit, gitignore, gleam, glsl, gn, go, godot_resource, gomod, graphql, groovy, hack, hare, haskell, haxe, hcl, heex, hlsl, html, http, hurl, hyprlang, ini, janet, java, javascript, jinja2, jq, jsdoc, json, json5, jsonnet, julia, just, kconfig, kdl, kotlin, ledger, less, linkerscript, liquid, llvm, lua, luau, make, markdown, markdown_inline, matlab, mermaid, meson, mojo, move, nginx, nickel, nim, ninja, nix, norg, nushell, objc, ocaml, odin, org, pascal, pem, perl, php, pkl, powershell, prisma, prolog, promql, properties, proto, pug, puppet, purescript, python, ql, r, racket, regex, rego, requirements, rescript, robot, ron, rst, ruby, rust, scala, scheme, scss, smithy, solidity, sparql, sql, squirrel, ssh_config, starlark, svelte, swift, tablegen, tcl, teal, templ, textproto, thrift, tlaplus, tmux, todotxt, toml, tsx, turtle, twig, typescript, typst, uxntal, v, verilog, vhdl, vimdoc, vue, wgsl, wolfram, xml, yaml, yuck, zig


Speciality Situation
compile + execute (NewQuery, Execute, ExecuteNode) Supported
cursor streaming(Exec, NextMatch, NextCapture) Supported
structural quantifier (?, *, +) Supported
reversion ([...]) Supported
field matching (name: (identifier)) Supported
#eq? / #not-eq? Supported
#match? / #not-match? Supported
#any-of? / #not-any-of? Supported
#lua-match? Supported
#has-ancestor? / #not-has-ancestor? Supported
#not-has-parent? Supported
#is? / #is-not? Supported
#set! / #offset! instructions analyzed and accepted


As of February 23, 2026, all shipped highlights and tagged questions are compiled in this repo (156/156 non empty HighlightQuery entries, 69/69 non empty TagsQuery Entries).

There are currently no known query-syntax gaps that block shipped highlight or tag queries.

1 language (norg) Requires an external scanner that has not been ported to Go. It parses using the DFA lexer alone, but tokens that require an external scanner are silently skipped. The tree structure is valid but may contain gaps. check entry.Quality to distinguish full From partial.


1. add grammar to grammars/languages.manifest.

2. Generate binding:

go run ./cmd/ts2go -manifest grammars/languages.manifest -outdir ./grammars -package grammars -compact=true

it regenerates grammars/embedded_grammars_gen.go, grammars/grammar_blobs/*.binand language register stubs.

3. add smoke samples to cmd/parity_report/main.go And grammars/parse_support_test.go.

4. Please attest it:

go run ./cmd/parity_report
go test ./grammars/...

GoTreeSitter reimplements the Tree-Sitter runtime in pure Go:

  • parser — Table-driven LR(1) with GLR support for ambiguous grammars.
  • incremental reuse – Cursor-based subtree reuse; Unchanged areas skip to full reanalysis
  • arena allocator – Slab-based node allocation with ref counting, reducing GC pressure
  • dfa lexer – generated from grammar tables ts2goWith handwritten bridges where needed
  • external scanner vm – Bytecode interpreter for language-specific scanning (Python indentation, etc.)
  • query engine – S-expression pattern matches with predicate evaluation and streaming cursors
  • highlighter – Query-based syntax highlighting with incremental support
  • tagger – Symbol definition/context extraction using tag queries

Grammar tables extracted from upstream tree-sitter parser.c by files ts2go Tools, serialized into compressed binary blobs, and lazy-loaded upon first language usage. No C code is run at parse time.

To avoid embedding blobs in the binary, build with -tags grammar_blobs_external and set GOTREESITTER_GRAMMAR_BLOB_DIR to a directory containing *.bin Grammar Drops. Uses external blob mode mmap By default on Unix (GOTREESITTER_GRAMMAR_BLOB_MMAP=false Disable).

To ship a small embedded binary with a curated language set, build with -tags grammar_set_core (The core set includes common languages ​​such as c, go, java, javascript, python, rust, typescriptetc.).

To restrict languages ​​registered at runtime (embedded or external), set:

GOTREESITTER_GRAMMAR_SET=go,json,python

For long-running processes, the grammar cache memory is tunable:

// Keep only the 8 most recently used decoded grammars in cache.
grammars.SetEmbeddedLanguageCacheLimit(8)

// Drop one language blob from cache (e.g. "rust.bin").
grammars.UnloadEmbeddedLanguage("rust.bin")

// Drop all decoded grammars from cache.
grammars.PurgeEmbeddedLanguageCache()

you can also set GOTREESITTER_GRAMMAR_CACHE_LIMIT Start enforcing cash caps without code changes in the process. set it 0 Only if you don’t explicitly want any retention (each grammar access will be decoded again).

Passive removal can be enabled with env vars:

GOTREESITTER_GRAMMAR_IDLE_TTL=5m
GOTREESITTER_GRAMMAR_IDLE_SWEEP=30s

Loader compaction/interning is enabled by default and tunable via:

GOTREESITTER_GRAMMAR_COMPACT=true
GOTREESITTER_GRAMMAR_STRING_INTERN_LIMIT=200000
GOTREESITTER_GRAMMAR_TRANSITION_INTERN_LIMIT=20000

The test suite includes:

  • smoke test – Parse all 205 grammars without crashing any samples or generating error nodes
  • purity snapshot – Golden S-expression tests catch parser and grammar regression for 20 main languages
  • highlight verification – End-to-end testing that generates highlighted ranges of compiled highlight queries
  • query testing – Pattern matching, predicate, cursor, field-based matching
  • parser test – Incremental reparsing, error recovery, GLR ambiguity resolution
  • fuzzingFuzzGoParseDoesNotPanic for parser robustness
go test ./... -race -count=1

Current: v0.4.0 – 205 grammar, static parser, incremental re-parsing, query engine, highlighting, tagging.

next:

  • Query engine parity hardening – field-negative semantics, metadata directive behavior, and additional edge-case parity with upstream tree-sitter query execution
  • More handwritten external scanners for high-value dfa-partial Languages
  • Parse() (*Tree, error) – return errors instead of silent zero trees
  • Automated parity testing against C tree-sitter output
  • Fuzzing extension to cover more languages ​​and query engines

MIT



<a href

Leave a Comment