release: fix duplicate libs, store symbolic links #17299

taronaeo · 2025-11-16T11:55:28Z

Enables storing of symlink information in the archive to avoid duplicating libraries such as *.dylib. Saves on storage space and bandwidth.

Introduces the .tar.gz archive format specifically for Unix/Linux systems while .zip remains for Windows. Adds quick download links to the release description.

CISC · 2025-11-16T18:06:00Z

Have you tested the archives on various platforms? AFAIK there are several issues with symlinks in zip.

taronaeo · 2025-11-16T23:30:31Z

Tested this on an M1 MacBook Pro and it worked as intended. Any specific OS I should test this on?

I can try to spin up VMs to test this coming weekend.

taronaeo · 2025-11-16T23:32:32Z

Also, i feel like it would be a better idea to incrementally move to .tar.gz for Linux releases and keep .zip for Windows. WDYT?

CISC · 2025-11-17T08:41:38Z

Windows in particular would be nice to confirm still works.

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

taronaeo · 2025-11-18T04:25:31Z

Windows in particular would be nice to confirm still works.

Will test it this weekend and report back.

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

Sweet. I'll start including .tar archives in this PR and maybe we can put a deprecation warning somewhere (maybe README.md?) to inform that Linux releases will be moving to .tar instead of .zip.

Then maybe after a few months, we can officially deprecate .zip for Linux.

taronaeo · 2025-11-22T04:41:57Z

Windows in particular would be nice to confirm still works.

Can confirm that this still works on Windows. Although I'm unsure why the release now generates .zip.zip instead of a single .zip...

Click to expand `llama-cli` output on Windows

PS C:\Users\llama.cpp\Documents\llama-bin-win-cpu-x64> .\llama-cli -m C:\Users\llama.cpp\Downloads\stories15M_MOE-Q8_0.gguf -no-cnv -n 25
load_backend: loaded RPC backend from C:\Users\llama.cpp\Documents\llama-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\llama.cpp\Documents\llama-bin-win-cpu-x64\ggml-cpu-haswell.dll
build: 7083 (d6abfe8c8) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 63 tensors from C:\Users\llama.cpp\Downloads\stories15M_MOE-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 4x24M
llama_model_loader: - kv   3:                            general.license str              = mit
llama_model_loader: - kv   4:                          llama.block_count u32              = 6
llama_model_loader: - kv   5:                       llama.context_length u32              = 256
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 288
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 768
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 6
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 6
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                         llama.expert_count u32              = 4
llama_model_loader: - kv  13:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 48
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  20:                      tokenizer.ggml.scores arr[f32,32000]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,32000]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   19 tensors
llama_model_loader: - type q8_0:   44 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 36.87 MiB (8.51 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 256
print_info: n_embd           = 288
print_info: n_embd_inp       = 288
print_info: n_layer          = 6
print_info: n_head           = 6
print_info: n_head_kv        = 6
print_info: n_rot            = 48
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 48
print_info: n_embd_head_v    = 48
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 288
print_info: n_embd_v_gqa     = 288
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 768
print_info: n_expert         = 4
print_info: n_expert_used    = 2
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 256
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 36.36 M
print_info: general.name     = n/a
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 6 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 7/7 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    36.87 MiB
..........................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) > n_ctx_train (256) -- possible training context overflow
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache:        CPU KV buffer size =    27.00 MiB
llama_kv_cache: size =   27.00 MiB (  4096 cells,   6 layers,  1/1 seqs), K (f16):   13.50 MiB, V (f16):   13.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =    63.06 MiB
llama_context: graph nodes  = 289
llama_context: graph splits = 1
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: model was trained on only 256 context tokens (4096 specified)
 
system_info: n_threads = 2 (n_threads_batch = 2) / 2 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 
sampler seed: 583020045
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 25, n_keep = 1
 
 Once upon a time, there was a little boy named Timmy. Timmy was walking with his mommy and daddy
 
llama_perf_sampler_print:    sampling time =       0.80 ms /    26 runs   (    0.03 ms per token, 32378.58 tokens per second)
llama_perf_context_print:        load time =     124.81 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =      68.11 ms /    25 runs   (    2.72 ms per token,   367.05 tokens per second)
llama_perf_context_print:       total time =      74.19 ms /    26 tokens
llama_perf_context_print:    graphs reused =         24
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                  126 =    36 +      27 +      63                |

taronaeo · 2025-11-22T05:19:44Z

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

Found out the problem. Quoting from https://github.com/actions/upload-artifact?tab=readme-ov-file#limitations:

When an Artifact is uploaded, all the files are assembled into an immutable Zip archive. There is currently no way to download artifacts in a format other than a Zip or to download individual artifact contents.

So even if we package it as a .tar archive, it will eventually be uploaded as a .tar.zip file because of how actions/upload-artifact@v4 works. We can circumvent this by creating our own upload script similar to how LLVM is doing it:

https://github.com/llvm/llvm-project/blob/89189218b8ba0a2a64ae1f0f76f485eaf06ad6c6/.github/workflows/release-binaries.yml#L282-L289

But I'm not sure how keen are you guys on having 1 additional script to handle this limitation. Please let me know if its okay to have an additional script :)

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> release: add .tar release Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> release: rm gunzip Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> release: add deprecation notice to release.yml Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> release: fix .tar archives not uploaded Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> release: forgot to upload .tar archives Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> release: fix more missing .tar uploads Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

CISC · 2025-11-22T10:21:52Z

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

Found out the problem. Quoting from https://github.com/actions/upload-artifact?tab=readme-ov-file#limitations:

When an Artifact is uploaded, all the files are assembled into an immutable Zip archive. There is currently no way to download artifacts in a format other than a Zip or to download individual artifact contents.

So even if we package it as a .tar archive, it will eventually be uploaded as a .tar.zip file because of how actions/upload-artifact@v4 works.

Oh, that answers my previous question then. :P

But I'm not sure how keen are you guys on having 1 additional script to handle this limitation. Please let me know if its okay to have an additional script :)

Up to @slaren

CISC · 2025-11-22T10:27:57Z

BTW, I didn't mean to upload plain .tar files earlier, I meant that there are probably better alternatives than .gz, though I don't think .tar.zip is one of them... :D

taronaeo · 2025-11-22T10:41:24Z

BTW, I didn't mean to upload plain .tar files earlier, I meant that there are probably better alternatives than .gz, though I don't think .tar.zip is one of them... :D

Haha, I was wondering why you would create an archive without compressing the contents :) I will add the compression algorithm once we have decided which to use.

I was considering between .tar.gz and .tar.xz, and I would prefer if we use .tar.gz mainly for compatibility sake, though I am seeing a lot more uses of .tar.xz in newer/bigger projects (f.ex., LLVM). If anyone has any opinions, feel free to sound out.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo · 2025-11-29T16:49:19Z

@CISC Requesting your review for this PR again :)

It turns out that we can actually upload any archive type (e.g., .tar.gz) to GitHub releases without it being converted into a .zip, or us having to maintain an additional script. The .zip limitation was only for GitHub Artifacts which is different from GitHub releases.

So I've chosen .tar.gz to be the preferred archive for Linux for now and included a deprecation notice in the GitHub release description + some quick download links

Please let me know if I need to change anything

taronaeo · 2025-12-01T22:59:49Z

As I'm sure you're aware Diego is currently on Hiatus.

Is anyone else able to review this PR on behalf of @/slaren?

From: #17639 (comment)

CISC

Sorry, missed the latest update.

Is .tar.gz ideal for macOS (as a desktop experience)?

taronaeo · 2025-12-02T03:50:14Z

Is .tar.gz ideal for macOS (as a desktop experience)?

Tested it with the last commit, had no issues with symlinks :)

taronaeo requested a review from slaren as a code owner November 16, 2025 11:55

taronaeo requested review from CISC and removed request for slaren November 16, 2025 11:55

github-actions bot added the devops improvements to build systems and github actions label Nov 16, 2025

DajanaV mentioned this pull request Nov 16, 2025

UPSTREAM PR #17299: release: fix duplicate libs, store symbolic links auroralabs-loci/llama.cpp#225

Open

CISC requested a review from slaren November 17, 2025 08:41

taronaeo force-pushed the fix/release-duplicate-libs branch from 61a0bce to dec86a0 Compare November 21, 2025 14:47

taronaeo force-pushed the fix/release-duplicate-libs branch from 07ea8ed to f95db68 Compare November 22, 2025 05:37

taronaeo force-pushed the fix/release-duplicate-libs branch from f95db68 to 49d4164 Compare November 22, 2025 05:43

release: fix linting

8174e29

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo marked this pull request as draft November 29, 2025 10:17

taronaeo added 7 commits November 29, 2025 19:02

release: switch to .tar.gz

f533553

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: update release message

751a3cf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: disable release workflow for debug

76f6335

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: debug file info

a00ecf2

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: undo debug info and attempt release

bd119c7

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: bugfix missing .tar.gz upload

865bcb4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: fix broken release links

a8ec809

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo marked this pull request as ready for review November 29, 2025 16:43

CISC approved these changes Dec 1, 2025

View reviewed changes

taronaeo merged commit 7b6d745 into ggml-org:master Dec 2, 2025
3 checks passed

deadprogram mentioned this pull request Dec 2, 2025

chore: update llama.cpp download location and format hybridgroup/yzma#124

Merged

release: fix duplicate libs, store symbolic links #17299

release: fix duplicate libs, store symbolic links #17299

Uh oh!

Conversation

taronaeo commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 16, 2025

Uh oh!

taronaeo commented Nov 16, 2025

Uh oh!

taronaeo commented Nov 16, 2025

Uh oh!

CISC commented Nov 17, 2025

Uh oh!

taronaeo commented Nov 18, 2025

Uh oh!

taronaeo commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Nov 22, 2025

Uh oh!

CISC commented Nov 22, 2025

Uh oh!

taronaeo commented Nov 22, 2025

Uh oh!

taronaeo commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Dec 1, 2025

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

taronaeo commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taronaeo commented Nov 16, 2025 •

edited

Loading

taronaeo commented Nov 22, 2025 •

edited

Loading

taronaeo commented Nov 22, 2025 •

edited

Loading

taronaeo commented Nov 29, 2025 •

edited

Loading