runtime (gc_blocks.go): use best-fit allocation #5105

niaow · 2025-12-01T03:26:00Z

The allocator originally just looped through the blocks until it found a sufficiently-long range. This is simple, but it fragments very easily and can degrade to a full heap scan for long requests.

Instead, we now maintain a sorted nested list of free ranges by size. The allocator will select the shortest sufficient-length range, generally reducing fragmentation. This data structure can find a range in time directly proportional to the requested length.

Performance in the problematic go/format benchmark:

                    │ linear.txt  │            best-fit.txt             │
                    │   sec/op    │   sec/op     vs base                │
Format/array1-10000   31.77m ± 4%   25.71m ± 2%  -19.08% (p=0.000 n=20)

                    │  linear.txt  │             best-fit.txt             │
                    │     B/s      │     B/s       vs base                │
Format/array1-10000   1.945Mi ± 4%   2.403Mi ± 2%  +23.53% (p=0.000 n=20)

The allocator originally just looped through the blocks until it found a sufficiently-long range. This is simple, but it fragments very easily and can degrade to a full heap scan for long requests. Instead, we now maintain a sorted nested list of free ranges by size. The allocator will select the shortest sufficient-length range, generally reducing fragmentation. This data structure can find a range in time directly proportional to the requested length.

niaow · 2025-12-01T03:32:46Z

This is the same basic mechanism as #1181, but it is a lot cleaner.

niaow · 2025-12-01T03:47:01Z

This adds 100-300 bytes of code. We need to decide if this is worth it.

eliasnaur · 2025-12-01T08:14:07Z

I often run out of memory because of fragmentation, so I heartily support anything that combats it.

aykevl · 2025-12-02T11:00:27Z

@dgryski can you take a look? To see whether it helps with GC performance?

dgryski · 2025-12-02T20:21:53Z

In general, "best fit" is going to reduce fragmentation at the expense of CPU time. An allocation-heavy benchmark (in this case the binary trees benchmark game) shows this to be the case:

~/go/src/github.com/dgryski/trifles/binarytrees $ hyperfine -N "./trees-dev.exe 15"  "./trees-best.exe 15"
Benchmark 1: ./trees-dev.exe 15
  Time (mean ± σ):     784.9 ms ±  15.6 ms    [User: 1507.8 ms, System: 2174.3 ms]
  Range (min … max):   758.9 ms … 804.3 ms    10 runs

Benchmark 2: ./trees-best.exe 15
  Time (mean ± σ):      1.027 s ±  0.022 s    [User: 1.877 s, System: 2.854 s]
  Range (min … max):    0.998 s …  1.057 s    10 runs

Summary
  ./trees-dev.exe 15 ran
    1.31 ± 0.04 times faster than ./trees-best.exe 15

Our current allocation scheme is "next fit".

Interestingly, using -gc=precise -target=wasip1, best fit comes out faste.r

Running the binary trees benchmark with -gc=precise on native instead of -gc=conservative occasionally gives a SEGV. :(

niaow · 2025-12-02T20:35:14Z

It might be worth waiting until #5104 is merged. I remember I was able to optimize the free ranges construction on my experiments branch by exploiting the new metadata format. The current free range construction code just loops over the individual blocks.

Also, are you using array-backed trees or trees that are each allocated as separate fixed-size objects? If the latter is the case, that is basically the worst case for this - there isn't really any meaningful fragmentation in the first place.

niaow · 2025-12-02T20:40:20Z

Can you link the trees code so I can debug the SEGV?

dgryski · 2025-12-02T20:41:23Z

I'm using https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html . So yes, fixed-sized allocations for the tree nodes.

niaow · 2025-12-02T21:01:37Z

Oh right that SEGV is the race condition where we release the GC lock before writing the layout bitmap. I fixed it in #5102 while reorganizing the alloc code, but then kinda forgot about it.

niaow · 2025-12-02T22:55:45Z

Also, the main issue with the binary trees benchmark here is that the collector is not the bottleneck. The lock is the bottleneck. If you switch to -scheduler tasks to eliminate the lock contention, there is a gigantic performance improvement. The actual impact of the best-fit change is negligible.

[niaow@finch tinygo]$ time /tmp/bintree-dev.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m5.167s
user    0m5.343s
sys     0m22.502s
[niaow@finch tinygo]$ time /tmp/dev-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.220s
user    0m0.209s
sys     0m0.012s
[niaow@finch tinygo]$ time /tmp/bintree-best-fit-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.226s
user    0m0.218s
sys     0m0.009s

Add SSTGCHint() and related functions to optimize GC behavior for SST's simpler memory patterns. SST has fundamentally different memory characteristics: - Single shared stack (no per-goroutine allocations) - Fixed-size event queues (pre-allocated) - Tasks created once at startup - Run-to-completion (no blocking state) The best-fit allocator from PR tinygo-org#5105 is not critical for SST because the allocation patterns are much more predictable and less prone to fragmentation.

deadprogram · 2025-12-04T13:01:17Z

I ran the tinybench benchmarks using both the dev branch and then with this branch applied.

https://github.com/tinygo-org/tinybench

I do understand that these benchmarks are probably not the best for testing GC performance.

Here my results:

Before

tinygo version 0.40.0-dev-9404bb87 linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1482            805580 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1054           1065539 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               61          18088050 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1393121093 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2772118003 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  207           5809472 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 135           9688843 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  63          16362101 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            72          18954161 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           38          30118819 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           21          55955333 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          46763151 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         273176370 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1287202089 ns/op

After

tinygo version 0.40.0-dev-386d078f linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1518            793471 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1107           1014044 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               64          18540136 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1394121663 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2783834336 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  206           5595721 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 127           9197207 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  90          16001350 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            66          18966750 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           38          30509022 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           18          55581540 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          47555624 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         273767775 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1282017493 ns/op

dgryski

LGTM. I think let's go ahead with this. The improved memory usage is worth the slight slow-down. (Although having the free list ranges means we spend less time searching for a spot, so I'm not even sure there is a significant slow-down.)

deadprogram · 2025-12-05T10:23:19Z

Please resolve merge conflicts now @niaow since #5102 was merged.

soypat · 2025-12-05T14:23:44Z

Why do we not see significant size diff except for some arduino targets in sizediff comparison in CI?

niaow · 2025-12-05T18:15:53Z

AVR generates more instructions because:

It can only shift one bit at a time, so some of the state access gets lowered awkwardly.
Memory accesses (8-bit) is smaller than pointers (16-bit) so each load/store of a pointer gets broken in two.

This appears as a larger percentage also because AVR leaves most features (e.g. the scheduler) off by default.

niaow · 2025-12-05T20:09:53Z

The rebase is proving troublesome, it seems that something in wasm is breaking.

dgryski approved these changes Dec 4, 2025

View reviewed changes

runtime (gc_blocks.go): use best-fit allocation #5105

Are you sure you want to change the base?

runtime (gc_blocks.go): use best-fit allocation #5105

Conversation

niaow commented Dec 1, 2025

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

niaow commented Dec 1, 2025

Uh oh!

eliasnaur commented Dec 1, 2025

Uh oh!

aykevl commented Dec 2, 2025

Uh oh!

dgryski commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

dgryski commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025

Uh oh!

niaow commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deadprogram commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

dgryski left a comment

Choose a reason for hiding this comment

Uh oh!

deadprogram commented Dec 5, 2025

Uh oh!

soypat commented Dec 5, 2025

Uh oh!

niaow commented Dec 5, 2025

Uh oh!

niaow commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

niaow commented Dec 2, 2025 •

edited

Loading

deadprogram commented Dec 4, 2025 •

edited

Loading