-
Notifications
You must be signed in to change notification settings - Fork 14k
PHI3-vision gguf conversion #7705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -103,6 +103,59 @@ python ./examples/convert-legacy-llama.py ../llava-v1.6-vicuna-7b/ --skip-unknow | |
| **note** llava-1.6 needs more context than llava-1.5, at least 3000 is needed (just run it at -c 4096) | ||
| **note** llava-1.6 greatly benefits from batched prompt processing (defaults work) | ||
|
|
||
| ## Phi-3-Vision-128K-Instruct gguf conversion | ||
| 1) Set a working directory for PHI3V and PHI3 instruct. Clone both into this dir. (It's easiest to cd into your local hf cache and copy the models from there to here) | ||
|
|
||
| ```console | ||
| mkdir phi3-fun | ||
| cd phi3-fun | ||
|
|
||
| mkdir phi3-base | ||
| git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct | ||
|
|
||
| mkdir phi3-vision | ||
| git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct | ||
|
|
||
| ``` | ||
|
|
||
| 2) Use `llava-surgery-v2.py` to extract clip from PHI3V: | ||
| ```console | ||
| python examples/llava/llava-surgery-v2.py -C -m phi3-fun/phi3-vision/ | ||
| ``` | ||
| - you will find a llava.projector and a llava.clip file in your model directory | ||
|
|
||
| 4) Copy the llava.clip file into a subdirectory (like vit), rename it to pytorch_model.bin and add a fitting vit configuration to the directory: | ||
| ```console | ||
| // under phi3-fun/phi-vision dir | ||
| mkdir vit | ||
| cp llava.clip vit/pytorch_model.bin | ||
| cp llava.projector vit/ | ||
| curl -s -q https://huggingface.co/cmp-nct/llava-1.6-gguf/raw/main/config_vit.json -o vit/config.json | ||
| ``` | ||
| set `mm_projector_type` -> `mlp_phi` in `config.json` | ||
|
|
||
| 5) Create the visual gguf model: | ||
| ```console | ||
| python examples/llava/convert-image-encoder-to-gguf.py -m phi3-fun/phi3-vision/vit --llava-projector phi3-fun/phi3-vision/vit/llava.projector --output-dir phi3-fun/phi3-vision/vit --clip-model-is-vision | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| ``` | ||
|
|
||
| 6) Extract the language-modelling (everything except CLIP) part of PHI3V and assign the weights to a normal PHI3 model | ||
|
|
||
| ```console | ||
| python examples/llava/phi3-weight-transfer.py --phi3-instruct-base-path phi3-fun/phi3-base --phi3v-base-path phi3-fun/phi3-vision | ||
| ``` | ||
|
|
||
| 7) Convert this to a normal gguf | ||
| (First delete the old safetensors from this directory) | ||
| ```console | ||
| python convert-hf-to-gguf.py phi3-fun/phi3-base | ||
| ``` | ||
|
|
||
| 8) Invoke | ||
| ```console | ||
| ./llava-cli -m phi3-fun/phi3-base/ggml-model-f16.gguf --mmproj phi3-fun/phi3-vision/vit/mmproj-model-f16.gguf --image IMAGE -c 4096 --temp .1 -p "PROMPT" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Templating should be recommended. |
||
| ``` | ||
|
|
||
| ## llava-cli templating and llava-1.6 prompting | ||
|
|
||
| llava-1.5 models all use the same vicuna prompt, here you can just add your image question like `-p "Provide a full description."` | ||
|
|
@@ -137,3 +190,4 @@ Alternatively just pay notice to how many "tokens" have been used for your promp | |
| - [x] Support non-CPU backend for the image encoding part. | ||
| - [ ] Support different sampling methods. | ||
| - [ ] Support more model variants. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -86,7 +86,7 @@ def bytes_to_unicode(): | |
| ap.add_argument("--clip-model-is-openclip", action="store_true", required=False, | ||
| help="The clip model is from openclip (for ViT-SO400M type))") | ||
| ap.add_argument("--llava-projector", help="Path to llava.projector file. If specified, save an image encoder for LLaVA models.") | ||
| ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2"], default="mlp") | ||
| ap.add_argument("--projector-type", help="Type of projector. Possible values: mlp, ldp, ldpv2", choices=["mlp", "ldp", "ldpv2", "mlp_phi"], default="mlp_phi") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd not change the default
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change back to "mlp" default and add the phi one into "possible values" help text |
||
| ap.add_argument("-o", "--output-dir", help="Directory to save GGUF files. Default is the original model directory", default=None) | ||
| # Example --image_mean 0.48145466 0.4578275 0.40821073 --image_std 0.26862954 0.26130258 0.27577711 | ||
| # Example --image_mean 0.5 0.5 0.5 --image_std 0.5 0.5 0.5 | ||
|
|
@@ -206,39 +206,39 @@ def bytes_to_unicode(): | |
| fout.add_float32(k(KEY_ATTENTION_LAYERNORM_EPS, VISION), v_hparams["layer_norm_eps"]) | ||
| block_count = v_hparams["num_hidden_layers"] - 1 if has_llava_projector else v_hparams["num_hidden_layers"] | ||
| fout.add_uint32(k(KEY_BLOCK_COUNT, VISION), block_count) | ||
| # /** | ||
| # "image_grid_pinpoints": [ | ||
| # [ | ||
| # 336, | ||
| # 672 | ||
| # ], | ||
| # [ | ||
| # 672, | ||
| # 336 | ||
| # ], | ||
| # [ | ||
| # 672, | ||
| # 672 | ||
| # ], | ||
| # [ | ||
| # 1008, | ||
| # 336 | ||
| # ], | ||
| # [ | ||
| # 336, | ||
| # 1008 | ||
| # ] | ||
| # ], | ||
| # Flattened: | ||
| # [ | ||
| # 336, 672, | ||
| # 672, 336, | ||
| # 672, 672, | ||
| # 1008, 336, | ||
| # 336, 1008 | ||
| # ] | ||
| # * | ||
| # */ | ||
| # /** | ||
| # "image_grid_pinpoints": [ | ||
| # [ | ||
| # 336, | ||
| # 672 | ||
| # ], | ||
| # [ | ||
| # 672, | ||
| # 336 | ||
| # ], | ||
| # [ | ||
| # 672, | ||
| # 672 | ||
| # ], | ||
| # [ | ||
| # 1008, | ||
| # 336 | ||
| # ], | ||
| # [ | ||
| # 336, | ||
| # 1008 | ||
| # ] | ||
| # ], | ||
| # Flattened: | ||
| # [ | ||
| # 336, 672, | ||
| # 672, 336, | ||
| # 672, 672, | ||
| # 1008, 336, | ||
| # 336, 1008 | ||
| # ] | ||
| # * | ||
| # */ | ||
| if "image_grid_pinpoints" in v_hparams: | ||
| # flatten it | ||
| image_grid_pinpoints = [] | ||
|
|
@@ -257,7 +257,6 @@ def bytes_to_unicode(): | |
| if "mm_projector_type" in v_hparams: | ||
| fout.add_string("clip.vision.mm_projector_type", v_hparams["mm_projector_type"]) | ||
|
|
||
|
|
||
| if processor is not None: | ||
| image_mean = processor.image_processor.image_mean if args.image_mean is None or args.image_mean == default_image_mean else args.image_mean | ||
| image_std = processor.image_processor.image_std if args.image_std is None or args.image_std == default_image_std else args.image_std | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. consider to put this entire logic into llava_surgery_v2.py |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| import argparse | ||
| import json | ||
| import os | ||
|
|
||
| import torch | ||
| from safetensors.torch import save_file | ||
| from transformers import AutoModelForCausalLM | ||
|
|
||
|
|
||
| def main(args): | ||
|
|
||
| # https://stackoverflow.com/questions/67689219/copy-one-layers-weights-from-one-huggingface-bert-model-to-another | ||
|
|
||
| phi3_vision = AutoModelForCausalLM.from_pretrained(args.phi3v_base_path, | ||
| device_map="auto", | ||
| trust_remote_code=True, | ||
| torch_dtype=torch.float16, | ||
| _attn_implementation='eager') | ||
|
|
||
| print("PHI3 VISION LOADED IN MEMORY") | ||
|
|
||
| phi3_base = AutoModelForCausalLM.from_pretrained(args.phi3_instruct_base_path, | ||
| device_map="auto", | ||
| trust_remote_code=True, | ||
| torch_dtype=torch.float16, | ||
| _attn_implementation='eager') | ||
|
|
||
| print("PHI3 BASE LOADED IN MEMORY") | ||
|
|
||
| phi3_vision_layers = dict(phi3_vision.named_parameters()) | ||
| phi3_base_layers = dict(phi3_base.named_parameters()) | ||
|
|
||
| parts = list(set(phi3_vision_layers.keys()) & set(phi3_base_layers.keys())) | ||
|
|
||
| print("----------------------------------------------------") | ||
| print("before transfer") | ||
| print(dict(phi3_vision.named_parameters())["model.layers.19.mlp.gate_up_proj.weight"] | ||
| == dict(phi3_base.named_parameters())["model.layers.19.mlp.gate_up_proj.weight"]) | ||
| print("----------------------------------------------------") | ||
|
|
||
| for part in parts: | ||
| phi3_base_layers[part].data.copy_(phi3_vision_layers[part].data) | ||
| # target # source | ||
|
|
||
| print("----------------------------------------------------") | ||
| print("after transfer") | ||
| print(dict(phi3_vision.named_parameters())["model.layers.19.mlp.gate_up_proj.weight"] | ||
| == dict(phi3_base.named_parameters())["model.layers.19.mlp.gate_up_proj.weight"]) | ||
| print("----------------------------------------------------") | ||
|
|
||
| # save updated model weights | ||
| outfile = "phi3-instruct-vision-weight-transfer.safetensors" | ||
| outpath = os.path.join(args.phi3_instruct_base_path, outfile) | ||
| save_file(phi3_base_layers, outpath) | ||
| print(f"updates .safetensors saved to {outpath}") | ||
|
|
||
| # update safetensors index config | ||
| weight_index_path = os.path.join(args.phi3_instruct_base_path, "model.safetensors.index.json") | ||
|
|
||
| with open(weight_index_path, "r") as f: | ||
| index_data = json.load(f) | ||
|
|
||
| for k,v in index_data["weight_map"].items(): | ||
| if v != "phi3-instruct-vision-weight-transfer.safetensors": | ||
| index_data["weight_map"][k] = outfile | ||
|
|
||
| with open(weight_index_path, "w") as f: | ||
| json.dump(index_data, f) | ||
|
|
||
| print(f"hf saftensor mapping updated!") | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
|
|
||
| parser = argparse.ArgumentParser(description="script to copy weights from PHI3V language model to PHI3-instruct") | ||
|
|
||
| parser.add_argument("--phi3-instruct-base-path", type=str, default="microsoft/Phi-3-mini-128k-instruct", help="model path or model card for PHI3-instruct") | ||
| parser.add_argument("--phi3v-base-path", type=str, default="microsoft/Phi-3-vision-128k-instruct", help="model path or model card for PHI3V") | ||
|
|
||
| main(parser.parse_args()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directories won't match up at this point as "git clone" creates their own subdirs.
renaming them would be better, and 2x mkdir could be removed