Skip to content

Performance: 47% faster parse+render, 60% fewer allocations#2056

Open
tobi wants to merge 81 commits intomainfrom
autoresearch/liquid-perf-2026-03-11
Open

Performance: 47% faster parse+render, 60% fewer allocations#2056
tobi wants to merge 81 commits intomainfrom
autoresearch/liquid-perf-2026-03-11

Conversation

@tobi
Copy link
Member

@tobi tobi commented Mar 11, 2026

Summary

47% faster combined parse+render time, 60% fewer object allocations on the ThemeRunner benchmark (real Shopify theme templates with production-like data). Zero test regressions — all 974 unit tests pass.

Metric Main This PR Change
Combined (parse+render) 7,488µs 3,967µs -47%
Parse time 5,928µs 2,803µs -53%
Render time 1,481µs 1,161µs -22%
Object allocations 62,620 24,881 -60%

Measured with YJIT enabled on Ruby 3.4, using performance/bench_quick.rb (best of 3 runs, 10 iterations each with GC disabled, after 20-iteration warmup).

Methodology

This PR was developed through 85 automated experiments using an autoresearch loop: edit → commit → run tests → benchmark → keep/discard. Each change was validated against the full unit test suite before benchmarking. Changes that regressed either correctness or the primary metric were reverted immediately.

The approach was allocation-driven: profile where objects are created, eliminate the ones that aren't needed, and defer the ones that are. Ruby's GC scanning time dominates at these scales — every avoided allocation compound-saves in GC pressure.

Architecture: the Cursor class

The headline architectural change is Liquid::Cursor — a StringScanner wrapper with higher-level methods tuned for Liquid's grammar. One Cursor instance lives on each ParseContext and is reused across all tag/variable/expression parsing within a template.

cursor = parse_context.cursor
cursor.reset(markup)
cursor.skip_ws
tag_name = cursor.scan_tag_name   # C-level regex scan
cursor.expect_id("in")            # zero-alloc: regex skip + byte compare
cursor.skip_fragment              # zero-alloc: regex skip

Key design: scan_* methods return strings (allocate), skip_* / expect_* methods return lengths or booleans (zero-alloc). Methods delegate to C-level StringScanner.scan/skip with compiled regexes — benchmarking showed this is 2-3x faster than Ruby-level peek_byte/scan_byte loops.

This replaces ~150 scattered getbyte/byteslice calls across BlockBody, Variable, If, For with a shared vocabulary. It's also the foundation for eventual single-pass parsing — the Cursor can be advanced forward through an entire template source without intermediate token arrays.

What changed (by impact)

Parse optimizations (~53% faster, ~38K fewer allocs)

Replace regex with byte-level parsing, then regex-delegate via Cursor. The original code used =~ regex matching with Regexp.last_match captures for tag tokens, variable lookups, for tag syntax, if conditions, and number literals. Each =~ call creates a MatchData object. Replaced with forward-only scanning via Cursor, which uses C-level StringScanner.scan/skip with compiled regexes — no MatchData, no Ruby-level byte loops:

  • BlockBody.parse_tag_token: FullToken regex → Cursor scan_tag_name + position math
  • VariableLookup.scan_variable: VariableParser regex → manual byte scanner
  • For#lax_parse: Syntax regex → Cursor skip_id/expect_id/scan_fragment
  • If#lax_parse: SIMPLE_CONDITION regex → Cursor parse_simple_condition
  • Expression.parse_number: INTEGER_REGEX/FLOAT_REGEX → Cursor scan_number
  • Variable.simple_variable_markup: getbyte chain replaces regex for identifier detection

Fast-path Variable initialization. 100% of variables in the benchmark (1,197) now parse through try_fast_parse — a byte-level scanner that extracts the name expression and filter chain without touching the Lexer or Parser. Zero Lexer/Parser fallbacks — even multi-argument filters like pluralize: 'item', 'items' are scanned directly with comma-separated arg handling. Only keyword arguments (key: value) would fall through (none appear in the benchmark templates).

Cached no-arg filter tuples. The [filtername, EMPTY_ARRAY] tuple for no-argument filters (75% of all filter calls) is now frozen and cached per filter name via NO_ARG_FILTER_CACHE. Saves ~650 array allocations.

Fast-path VariableLookup. Simple identifier chains (product.title, forloop.index) skip scan_variable entirely. A simple_lookup? byte check validates the pattern, then byteslice + dot-splitting creates the lookups array directly. For single-name variables (product), @lookups = Const::EMPTY_ARRAY — zero-alloc.

Avoid unnecessary string allocations. Expression.parse skips strip when no leading/trailing whitespace. Variable fast-path reuses the markup string directly when no trimming is needed (avoids byteslice). blank_string? uses match? regex instead of byte loop.

Render optimizations (~22% faster, ~3K fewer allocs)

Splat-free filter invocation. Filters without arguments (| escape, | strip_html — 75% of all filter calls) now use invoke_single(method, input) which avoids the *args array allocation. Single-arg filters use invoke_two. Only 59 calls per render still need the splat path.

Primitive type fast paths. find_variable returns immediately for String, Integer, Float, Array, Hash, nil, true, false — skipping to_liquid (which returns self for all of these) and respond_to?(:context=) checks. Same optimization in VariableLookup#evaluate for hash key lookups and result handling. to_liquid_value skipped for String/Integer keys.

Hash fast-path in VariableLookup. instance_of?(Hash) check before the general respond_to?(:[]) / respond_to?(:key?) chain — hashes are the most common lookup target.

Context#find_variable optimizations. Top-scope fast path (most common in for loops). Single-scope shortcut — when only one scope exists, skip find_index and go straight to environments.

Cached small integer to_s. Utils.to_s returns pre-computed frozen strings for integers 0-999, avoiding 267 Integer#to_s allocations per render cycle.

Lazy initialization. Context defers StringScanner and @interrupts array creation until actually needed. Registers defers @changes hash. static_environments uses EMPTY_ARRAY when empty. block_delimiter strings cached per tag name.

Utils.to_s / Utils.inspect lazy seen hash. The seen = {} default parameter allocated a hash on every call even though the recursive-structure guard is almost never triggered. Changed to seen = nil with seen || {} only when entering Hash/Array branches.

Utils.slice_collection fast path. When from == 0, to.nil?, and collection is already an Array, returns it directly instead of copying through slice_collection_using_each.

Code removed / simplified

The Cursor consolidation deleted ~75 lines of duplicated byte-scanning logic. Methods that previously had 20+ lines of manual getbyte/scan_byte loops are now 1-3 line regex delegations. Examples:

# Before: 15 lines of manual byte scanning
def scan_id
  start = @ss.pos
  b = @ss.peek_byte
  return unless b && ((b >= 97 && b <= 122) || (b >= 65 && b <= 90) || b == USCORE)
  @ss.scan_byte
  while (b = @ss.peek_byte)
    break unless (b >= 97 && b <= 122) || ...
    @ss.scan_byte
  end
  @ss.scan_byte if @ss.peek_byte == QMARK
  @source.byteslice(start, @ss.pos - start)
end

# After: C-level regex is 2-3x faster
ID_REGEX = /[a-zA-Z_][\w-]*\??/
def scan_id = @ss.scan(ID_REGEX)

What did NOT work (reverted experiments)

  • Lexer output caching. 93% cache hit rate across templates, but the Parser's expression method mutates token strings in-place via str << variable_lookups. Cached tokens get corrupted. Would need frozen tokens + dup-on-mutate, which adds more allocs than it saves.
  • Shared expression cache across templates. Only 70 unique expressions across all templates, but a global cache leaks state between parses and grows unboundedly. Per-template caches are the right tradeoff.
  • Whitespace trimming in parse_variable_token. Saves downstream byteslice allocs but changes error message content (markup_context uses the trimmed string).
  • Manual truncatewords. Byte-level word scanning to avoid String#split — creates more allocs from per-word byteslice than split does internally.
  • case/when type dispatch in Context#evaluate. YJIT already optimizes respond_to? well — the case/when adds overhead from type checking.

Benchmark reproduction

cd performance
bundle exec ruby bench_quick.rb   # single run
# or
./auto/autoresearch.sh            # tests + 3-run best-of

The benchmark uses ThemeRunner which parses/renders 4 real Shopify themes (dropify, ripen, tribble, vogue) with production-like database fixtures. YJIT is enabled. GC is disabled during measurement windows. Times are Process.clock_gettime(CLOCK_MONOTONIC) wall-clock, allocations via ObjectSpace.count_objects.

Files changed

  • lib/liquid/cursor.rb — new Cursor class (StringScanner wrapper with regex-based Liquid-specific methods)
  • lib/liquid/block_body.rb — tag/variable token parsing via Cursor, regex blank_string?
  • lib/liquid/variable.rbtry_fast_parse byte-level name+filter scanner with multi-arg support, cached no-arg filter tuples, invoke_single/invoke_two render dispatch
  • lib/liquid/variable_lookup.rbsimple_lookup? byte validator, parse_simple fast path, primitive type fast paths in evaluate
  • lib/liquid/expression.rb — byte-level parse_number, conditional strip, byteslice for string literals
  • lib/liquid/context.rbinvoke_single/invoke_two, find_variable primitive fast paths + single-scope shortcut, lazy init, frozen defaults
  • lib/liquid/strainer_template.rbinvoke_single/invoke_two dispatch methods
  • lib/liquid/tags/if.rb — Cursor-based simple condition parsing
  • lib/liquid/tags/for.rb — Cursor-based lax_parse with zero-alloc skip_id/expect_id
  • lib/liquid/block.rb — cached block_delimiter strings
  • lib/liquid/registers.rb — lazy @changes hash
  • lib/liquid/standardfilters.rb — allocation-optimized truncatewords
  • lib/liquid/lexer.rb\s+ instead of \s* for whitespace skip
  • lib/liquid/utils.rb — cached small integer to_s, lazy seen hash, slice_collection Array fast path
  • lib/liquid/parse_context.rb — Cursor instance, attr_reader for expression_cache/string_scanner
  • lib/liquid/resource_limits.rb — expose last_capture_length for render loop optimization

tobi added 30 commits March 11, 2026 07:10
tobi added 28 commits March 11, 2026 09:53
…identifiers without Lexer/Parser\n\nResult: {"status":"keep","combined_µs":4427,"parse_µs":3181,"render_µs":1246,"allocations":27235}
…tespace, no filters)\n\nResult: {"status":"keep","combined_µs":4277,"parse_µs":3057,"render_µs":1220,"allocations":27026}
…ilters (e.g. pluralize: 'item', 'items')\n\nResult: {"status":"keep","combined_µs":4266,"parse_µs":3032,"render_µs":1234,"allocations":26480}
…ray + string allocations\n\nResult: {"status":"keep","combined_µs":4280,"parse_µs":3009,"render_µs":1271,"allocations":26395}
… per render cycle\n\nResult: {"status":"keep","combined_µs":4158,"parse_µs":2920,"render_µs":1238,"allocations":26128}
…on until needed\n\nResult: {"status":"keep","combined_µs":4299,"parse_µs":3057,"render_µs":1242,"allocations":26015}
…nterpolation\n\nResult: {"status":"keep","combined_µs":4372,"parse_µs":3127,"render_µs":1245,"allocations":25605}
…tually written\n\nResult: {"status":"keep","combined_µs":4287,"parse_µs":3059,"render_µs":1228,"allocations":25595}
…array allocs per render cycle\n\nResult: {"status":"keep","combined_µs":4262,"parse_µs":3079,"render_µs":1183,"allocations":25535}
…oids method lookup overhead\n\nResult: {"status":"keep","combined_µs":4207,"parse_µs":2943,"render_µs":1264,"allocations":25535}
… environments\n\nResult: {"status":"keep","combined_µs":4323,"parse_µs":3055,"render_µs":1268,"allocations":25535}
… respond_to?(:context=)\n\nResult: {"status":"keep","combined_µs":4225,"parse_µs":3009,"render_µs":1216,"allocations":25535}
…Result: {"status":"keep","combined_µs":4334,"parse_µs":3062,"render_µs":1272,"allocations":25535}
…checks for Hash objects\n\nResult: {"status":"keep","combined_µs":4110,"parse_µs":2922,"render_µs":1188,"allocations":25535}
…Scanner.scan is faster than Ruby-level byte scanning\n\nResult: {"status":"keep","combined_µs":4185,"parse_µs":2943,"render_µs":1242,"allocations":25535}
… performance\n\nResult: {"status":"keep","combined_µs":4184,"parse_µs":2931,"render_µs":1253,"allocations":25535}
…h regex — cleaner, same/better perf\n\nResult: {"status":"keep","combined_µs":4132,"parse_µs":2890,"render_µs":1242,"allocations":25535}
…eslice allocation for op strings\n\nResult: {"status":"keep","combined_µs":4007,"parse_µs":2808,"render_µs":1199,"allocations":25535}
…status":"keep","combined_µs":4047,"parse_µs":2795,"render_µs":1252,"allocations":25535}
…: {"status":"keep","combined_µs":4102,"parse_µs":2849,"render_µs":1253,"allocations":25535}
…"combined_µs":4121,"parse_µs":2812,"render_µs":1309,"allocations":25535}
…nResult: {"status":"keep","combined_µs":4184,"parse_µs":2921,"render_µs":1263,"allocations":25535}
…ds respond_to? dispatch\n\nResult: {"status":"keep","combined_µs":4131,"parse_µs":2893,"render_µs":1238,"allocations":25535}
…t: {"status":"keep","combined_µs":4196,"parse_µs":3042,"render_µs":1154,"allocations":25535}
…across templates\n\nResult: {"status":"keep","combined_µs":4147,"parse_µs":2992,"render_µs":1155,"allocations":24881}
…ds respond_to? dispatch\n\nResult: {"status":"keep","combined_µs":4103,"parse_µs":2881,"render_µs":1222,"allocations":24881}
@tobi tobi changed the title Performance: 35% faster parse+render, 53% fewer allocations Performance: 47% faster parse+render, 60% fewer allocations Mar 11, 2026
@tobi tobi requested a review from ianks March 11, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant