Skip to content

Conversation

@taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Nov 16, 2025

fixes #17229

Enables storing of symlink information in the archive to avoid duplicating libraries such as *.dylib. Saves on storage space and bandwidth.

Introduces the .tar.gz archive format specifically for Unix/Linux systems while .zip remains for Windows. Adds quick download links to the release description.

@taronaeo taronaeo requested a review from slaren as a code owner November 16, 2025 11:55
@taronaeo taronaeo requested review from CISC and removed request for slaren November 16, 2025 11:55
@github-actions github-actions bot added the devops improvements to build systems and github actions label Nov 16, 2025
@CISC
Copy link
Collaborator

CISC commented Nov 16, 2025

Have you tested the archives on various platforms? AFAIK there are several issues with symlinks in zip.

@taronaeo
Copy link
Collaborator Author

Tested this on an M1 MacBook Pro and it worked as intended. Any specific OS I should test this on?

I can try to spin up VMs to test this coming weekend.

@taronaeo
Copy link
Collaborator Author

Also, i feel like it would be a better idea to incrementally move to .tar.gz for Linux releases and keep .zip for Windows. WDYT?

@CISC
Copy link
Collaborator

CISC commented Nov 17, 2025

Windows in particular would be nice to confirm still works.

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

@CISC CISC requested a review from slaren November 17, 2025 08:41
@taronaeo
Copy link
Collaborator Author

Windows in particular would be nice to confirm still works.

Will test it this weekend and report back.

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

Sweet. I'll start including .tar archives in this PR and maybe we can put a deprecation warning somewhere (maybe README.md?) to inform that Linux releases will be moving to .tar instead of .zip.

Then maybe after a few months, we can officially deprecate .zip for Linux.

@taronaeo taronaeo force-pushed the fix/release-duplicate-libs branch from 61a0bce to dec86a0 Compare November 21, 2025 14:47
@taronaeo
Copy link
Collaborator Author

taronaeo commented Nov 22, 2025

Windows in particular would be nice to confirm still works.

Can confirm that this still works on Windows. Although I'm unsure why the release now generates .zip.zip instead of a single .zip...

Click to expand `llama-cli` output on Windows
PS C:\Users\llama.cpp\Documents\llama-bin-win-cpu-x64> .\llama-cli -m C:\Users\llama.cpp\Downloads\stories15M_MOE-Q8_0.gguf -no-cnv -n 25
load_backend: loaded RPC backend from C:\Users\llama.cpp\Documents\llama-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\Users\llama.cpp\Documents\llama-bin-win-cpu-x64\ggml-cpu-haswell.dll
build: 7083 (d6abfe8c8) with clang version 19.1.5 for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 26 key-value pairs and 63 tensors from C:\Users\llama.cpp\Downloads\stories15M_MOE-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 4x24M
llama_model_loader: - kv   3:                            general.license str              = mit
llama_model_loader: - kv   4:                          llama.block_count u32              = 6
llama_model_loader: - kv   5:                       llama.context_length u32              = 256
llama_model_loader: - kv   6:                     llama.embedding_length u32              = 288
llama_model_loader: - kv   7:                  llama.feed_forward_length u32              = 768
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 6
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 6
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                         llama.expert_count u32              = 4
llama_model_loader: - kv  13:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  14:                          general.file_type u32              = 7
llama_model_loader: - kv  15:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  16:                 llama.rope.dimension_count u32              = 48
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  20:                      tokenizer.ggml.scores arr[f32,32000]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,32000]   = [3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   19 tensors
llama_model_loader: - type q8_0:   44 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 36.87 MiB (8.51 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 2 ('</s>')
load: special tokens cache size = 3
load: token to piece cache size = 0.1684 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 256
print_info: n_embd           = 288
print_info: n_embd_inp       = 288
print_info: n_layer          = 6
print_info: n_head           = 6
print_info: n_head_kv        = 6
print_info: n_rot            = 48
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 48
print_info: n_embd_head_v    = 48
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 288
print_info: n_embd_v_gqa     = 288
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 768
print_info: n_expert         = 4
print_info: n_expert_used    = 2
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 256
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 36.36 M
print_info: general.name     = n/a
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 6 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 7/7 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    36.87 MiB
..........................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) > n_ctx_train (256) -- possible training context overflow
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache:        CPU KV buffer size =    27.00 MiB
llama_kv_cache: size =   27.00 MiB (  4096 cells,   6 layers,  1/1 seqs), K (f16):   13.50 MiB, V (f16):   13.50 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:        CPU compute buffer size =    63.06 MiB
llama_context: graph nodes  = 289
llama_context: graph splits = 1
common_init_from_params: added </s> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
main: model was trained on only 256 context tokens (4096 specified)
 
system_info: n_threads = 2 (n_threads_batch = 2) / 2 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
 
sampler seed: 583020045
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 25, n_keep = 1
 
 Once upon a time, there was a little boy named Timmy. Timmy was walking with his mommy and daddy
 
llama_perf_sampler_print:    sampling time =       0.80 ms /    26 runs   (    0.03 ms per token, 32378.58 tokens per second)
llama_perf_context_print:        load time =     124.81 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =      68.11 ms /    25 runs   (    2.72 ms per token,   367.05 tokens per second)
llama_perf_context_print:       total time =      74.19 ms /    26 tokens
llama_perf_context_print:    graphs reused =         24
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                  126 =    36 +      27 +      63                |

@taronaeo
Copy link
Collaborator Author

taronaeo commented Nov 22, 2025

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

Found out the problem. Quoting from https://github.com/actions/upload-artifact?tab=readme-ov-file#limitations:

When an Artifact is uploaded, all the files are assembled into an immutable Zip archive. There is currently no way to download artifacts in a format other than a Zip or to download individual artifact contents.

So even if we package it as a .tar archive, it will eventually be uploaded as a .tar.zip file because of how actions/upload-artifact@v4 works. We can circumvent this by creating our own upload script similar to how LLVM is doing it:

https://github.com/llvm/llvm-project/blob/89189218b8ba0a2a64ae1f0f76f485eaf06ad6c6/.github/workflows/release-binaries.yml#L282-L289

But I'm not sure how keen are you guys on having 1 additional script to handle this limitation. Please let me know if its okay to have an additional script :)

@taronaeo taronaeo force-pushed the fix/release-duplicate-libs branch from 07ea8ed to f95db68 Compare November 22, 2025 05:37
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: add .tar release

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: rm gunzip

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: add deprecation notice to release.yml

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: fix .tar archives not uploaded

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: forgot to upload .tar archives

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

release: fix more missing .tar uploads

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo taronaeo force-pushed the fix/release-duplicate-libs branch from f95db68 to 49d4164 Compare November 22, 2025 05:43
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@CISC
Copy link
Collaborator

CISC commented Nov 22, 2025

Moving to tar (not necessarily gzip) on Linux makes sense, not sure why we're using zip?

Found out the problem. Quoting from https://github.com/actions/upload-artifact?tab=readme-ov-file#limitations:

When an Artifact is uploaded, all the files are assembled into an immutable Zip archive. There is currently no way to download artifacts in a format other than a Zip or to download individual artifact contents.

So even if we package it as a .tar archive, it will eventually be uploaded as a .tar.zip file because of how actions/upload-artifact@v4 works.

Oh, that answers my previous question then. :P

But I'm not sure how keen are you guys on having 1 additional script to handle this limitation. Please let me know if its okay to have an additional script :)

Up to @slaren

@CISC
Copy link
Collaborator

CISC commented Nov 22, 2025

BTW, I didn't mean to upload plain .tar files earlier, I meant that there are probably better alternatives than .gz, though I don't think .tar.zip is one of them... :D

@taronaeo
Copy link
Collaborator Author

BTW, I didn't mean to upload plain .tar files earlier, I meant that there are probably better alternatives than .gz, though I don't think .tar.zip is one of them... :D

Haha, I was wondering why you would create an archive without compressing the contents :) I will add the compression algorithm once we have decided which to use.

I was considering between .tar.gz and .tar.xz, and I would prefer if we use .tar.gz mainly for compatibility sake, though I am seeing a lot more uses of .tar.xz in newer/bigger projects (f.ex., LLVM). If anyone has any opinions, feel free to sound out.

@taronaeo taronaeo marked this pull request as draft November 29, 2025 10:17
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@taronaeo taronaeo marked this pull request as ready for review November 29, 2025 16:43
@taronaeo
Copy link
Collaborator Author

taronaeo commented Nov 29, 2025

@CISC Requesting your review for this PR again :)

It turns out that we can actually upload any archive type (e.g., .tar.gz) to GitHub releases without it being converted into a .zip, or us having to maintain an additional script. The .zip limitation was only for GitHub Artifacts which is different from GitHub releases.

So I've chosen .tar.gz to be the preferred archive for Linux for now and included a deprecation notice in the GitHub release description + some quick download links

Please let me know if I need to change anything

@taronaeo
Copy link
Collaborator Author

taronaeo commented Dec 1, 2025

As I'm sure you're aware Diego is currently on Hiatus.

Is anyone else able to review this PR on behalf of @/slaren?

From: #17639 (comment)

Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed the latest update.

Is .tar.gz ideal for macOS (as a desktop experience)?

@taronaeo
Copy link
Collaborator Author

taronaeo commented Dec 2, 2025

Is .tar.gz ideal for macOS (as a desktop experience)?

Tested it with the last commit, had no issues with symlinks :)

@taronaeo taronaeo merged commit 7b6d745 into ggml-org:master Dec 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Regression: distribution contains duplicate dylibs (macOS)

2 participants