Skip to content

Conversation

@niaow
Copy link
Member

@niaow niaow commented Dec 1, 2025

The allocator originally just looped through the blocks until it found a sufficiently-long range. This is simple, but it fragments very easily and can degrade to a full heap scan for long requests.

Instead, we now maintain a sorted nested list of free ranges by size. The allocator will select the shortest sufficient-length range, generally reducing fragmentation. This data structure can find a range in time directly proportional to the requested length.

Performance in the problematic go/format benchmark:

                    │ linear.txt  │            best-fit.txt             │
                    │   sec/op    │   sec/op     vs base                │
Format/array1-10000   31.77m ± 4%   25.71m ± 2%  -19.08% (p=0.000 n=20)

                    │  linear.txt  │             best-fit.txt             │
                    │     B/s      │     B/s       vs base                │
Format/array1-10000   1.945Mi ± 4%   2.403Mi ± 2%  +23.53% (p=0.000 n=20)

The allocator originally just looped through the blocks until it found a sufficiently-long range.
This is simple, but it fragments very easily and can degrade to a full heap scan for long requests.

Instead, we now maintain a sorted nested list of free ranges by size.
The allocator will select the shortest sufficient-length range, generally reducing fragmentation.
This data structure can find a range in time directly proportional to the requested length.
@niaow
Copy link
Member Author

niaow commented Dec 1, 2025

This is the same basic mechanism as #1181, but it is a lot cleaner.

@niaow
Copy link
Member Author

niaow commented Dec 1, 2025

This adds 100-300 bytes of code. We need to decide if this is worth it.

@eliasnaur
Copy link
Contributor

I often run out of memory because of fragmentation, so I heartily support anything that combats it.

@aykevl
Copy link
Member

aykevl commented Dec 2, 2025

@dgryski can you take a look? To see whether it helps with GC performance?

@dgryski
Copy link
Member

dgryski commented Dec 2, 2025

In general, "best fit" is going to reduce fragmentation at the expense of CPU time. An allocation-heavy benchmark (in this case the binary trees benchmark game) shows this to be the case:

~/go/src/github.com/dgryski/trifles/binarytrees $ hyperfine -N "./trees-dev.exe 15"  "./trees-best.exe 15"
Benchmark 1: ./trees-dev.exe 15
  Time (mean ± σ):     784.9 ms ±  15.6 ms    [User: 1507.8 ms, System: 2174.3 ms]
  Range (min … max):   758.9 ms … 804.3 ms    10 runs

Benchmark 2: ./trees-best.exe 15
  Time (mean ± σ):      1.027 s ±  0.022 s    [User: 1.877 s, System: 2.854 s]
  Range (min … max):    0.998 s …  1.057 s    10 runs

Summary
  ./trees-dev.exe 15 ran
    1.31 ± 0.04 times faster than ./trees-best.exe 15

Our current allocation scheme is "next fit".

Interestingly, using -gc=precise -target=wasip1, best fit comes out faste.r

Running the binary trees benchmark with -gc=precise on native instead of -gc=conservative occasionally gives a SEGV. :(

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

It might be worth waiting until #5104 is merged. I remember I was able to optimize the free ranges construction on my experiments branch by exploiting the new metadata format. The current free range construction code just loops over the individual blocks.

Also, are you using array-backed trees or trees that are each allocated as separate fixed-size objects? If the latter is the case, that is basically the worst case for this - there isn't really any meaningful fragmentation in the first place.

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

Can you link the trees code so I can debug the SEGV?

@dgryski
Copy link
Member

dgryski commented Dec 2, 2025

I'm using https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-go-2.html . So yes, fixed-sized allocations for the tree nodes.

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

Oh right that SEGV is the race condition where we release the GC lock before writing the layout bitmap. I fixed it in #5102 while reorganizing the alloc code, but then kinda forgot about it.

@niaow
Copy link
Member Author

niaow commented Dec 2, 2025

Also, the main issue with the binary trees benchmark here is that the collector is not the bottleneck. The lock is the bottleneck. If you switch to -scheduler tasks to eliminate the lock contention, there is a gigantic performance improvement. The actual impact of the best-fit change is negligible.

[niaow@finch tinygo]$ time /tmp/bintree-dev.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m5.167s
user    0m5.343s
sys     0m22.502s
[niaow@finch tinygo]$ time /tmp/dev-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.220s
user    0m0.209s
sys     0m0.012s
[niaow@finch tinygo]$ time /tmp/bintree-best-fit-tasks.elf 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

real    0m0.226s
user    0m0.218s
sys     0m0.009s

amken3d pushed a commit to amken3d/tinygo that referenced this pull request Dec 3, 2025
Add SSTGCHint() and related functions to optimize GC behavior for
SST's simpler memory patterns.

SST has fundamentally different memory characteristics:
- Single shared stack (no per-goroutine allocations)
- Fixed-size event queues (pre-allocated)
- Tasks created once at startup
- Run-to-completion (no blocking state)

The best-fit allocator from PR tinygo-org#5105 is not critical for SST because
the allocation patterns are much more predictable and less prone to
fragmentation.
@deadprogram
Copy link
Member

deadprogram commented Dec 4, 2025

I ran the tinybench benchmarks using both the dev branch and then with this branch applied.

https://github.com/tinygo-org/tinybench

I do understand that these benchmarks are probably not the best for testing GC performance.

Here my results:

Before

tinygo version 0.40.0-dev-9404bb87 linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1482            805580 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1054           1065539 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               61          18088050 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1393121093 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2772118003 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  207           5809472 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 135           9688843 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  63          16362101 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            72          18954161 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           38          30118819 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           21          55955333 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          46763151 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         273176370 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1287202089 ns/op

After

tinygo version 0.40.0-dev-386d078f linux/amd64 (using go version go1.25.3 and LLVM version 20.1.1)

    bench_test.go:145: name="fannkuch-redux" compiler="tinygo" binarysize=1544008 version=0.40.0
BenchmarkAll/fannkuch-redux:args=6/go/tinygo-32             1518            793471 ns/op
BenchmarkAll/fannkuch-redux:args=7/go/tinygo
BenchmarkAll/fannkuch-redux:args=7/go/tinygo-32             1107           1014044 ns/op
BenchmarkAll/fannkuch-redux:args=9/go/tinygo
BenchmarkAll/fannkuch-redux:args=9/go/tinygo-32               64          18540136 ns/op

    bench_test.go:145: name="fasta" compiler="tinygo" binarysize=1674984 version=0.40.0
BenchmarkAll/fasta:args=12500000/go/tinygo-32                  1        1394121663 ns/op
BenchmarkAll/fasta:args=25000000/go/tinygo
BenchmarkAll/fasta:args=25000000/go/tinygo-32                  1        2783834336 ns/op

    bench_test.go:145: name="n-body" compiler="tinygo" binarysize=1549928 version=0.40.0
BenchmarkAll/n-body:args=50000/go/tinygo-32                  206           5595721 ns/op
BenchmarkAll/n-body:args=100000/go/tinygo
BenchmarkAll/n-body:args=100000/go/tinygo-32                 127           9197207 ns/op
BenchmarkAll/n-body:args=200000/go/tinygo
BenchmarkAll/n-body:args=200000/go/tinygo-32                  90          16001350 ns/op

    bench_test.go:145: name="n-body-nosqrt" compiler="tinygo" binarysize=1550944 version=0.40.0
BenchmarkAll/n-body-nosqrt:args=50000/go/tinygo-32            66          18966750 ns/op
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=100000/go/tinygo-32           38          30509022 ns/op
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo
BenchmarkAll/n-body-nosqrt:args=200000/go/tinygo-32           18          55581540 ns/op

    bench_test.go:145: name="spectral-norm" compiler="tinygo" binarysize=1656968 version=0.40.0
BenchmarkAll/spectral-norm:args=1000/go/tinygo-32             25          47555624 ns/op
BenchmarkAll/spectral-norm:args=2500/go/tinygo
BenchmarkAll/spectral-norm:args=2500/go/tinygo-32              4         273767775 ns/op
BenchmarkAll/spectral-norm:args=5500/go/tinygo
BenchmarkAll/spectral-norm:args=5500/go/tinygo-32              1        1282017493 ns/op

Copy link
Member

@dgryski dgryski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think let's go ahead with this. The improved memory usage is worth the slight slow-down. (Although having the free list ranges means we spend less time searching for a spot, so I'm not even sure there is a significant slow-down.)

@deadprogram
Copy link
Member

Please resolve merge conflicts now @niaow since #5102 was merged.

@soypat
Copy link
Contributor

soypat commented Dec 5, 2025

Why do we not see significant size diff except for some arduino targets in sizediff comparison in CI?

@niaow
Copy link
Member Author

niaow commented Dec 5, 2025

AVR generates more instructions because:

  1. It can only shift one bit at a time, so some of the state access gets lowered awkwardly.
  2. Memory accesses (8-bit) is smaller than pointers (16-bit) so each load/store of a pointer gets broken in two.

This appears as a larger percentage also because AVR leaves most features (e.g. the scheduler) off by default.

@niaow
Copy link
Member Author

niaow commented Dec 5, 2025

The rebase is proving troublesome, it seems that something in wasm is breaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants