Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to load the model Phi 4 on ROCm-runtime #296

Open
jurykor opened this issue Jan 10, 2025 · 1 comment
Open

Failed to load the model Phi 4 on ROCm-runtime #296

jurykor opened this issue Jan 10, 2025 · 1 comment

Comments

@jurykor
Copy link

jurykor commented Jan 10, 2025

🥲 Failed to load the model

Error loading model.

(Exit code: 18446744072635812000). Unknown error. Try a different model and/or config.

Windows 11 Pro 24H2, 26100.2605
AMD Ryzen 7 7700 / 64Gb
AMD Radeon RX 7800 XT / 16Gb / AMD Software 24.12.1
LM Studio 0.3.6

With ROCm 1.1.10, 1.1.13 - "Error loading model"
With Vulkan - all ok

2025-01-10.1-rocm-1.10-err.log
2025-01-10.1-rocm-1.13-err.log
2025-01-10.1-vulkan-ok.log

@jurykor
Copy link
Author

jurykor commented Jan 11, 2025

The model imported from 'ollama' loads normally:

Import:

ollama pull phi4:14b-q4_K_M
copy G:\AI\.ollama\blobs\sha256-fd7b6731c33c57f61767612f56517460ec2d1e2e5a3f0163e0eb3d8d8cb5df20 G:\AI\.lmstudio\models\_ollama\phi4\phi4-14b-q4_K_M.gguf 

Loading with ROCm 1.1.10:

[2025-01-11 15:56:20][DEBUG] AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

[2025-01-11 15:56:20][DEBUG] llama_model_loader: loaded meta data with 33 key-value pairs and 243 tensors from G:\AI\.lmstudio\models\_ollama\phi4\phi4-14b-q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Phi 4
llama_model_loader: - kv   3:                            general.version str              = 4
llama_model_loader: - kv   4:                       general.organization str              = Microsoft
llama_model_loader: - kv   5:                           general.basename str              = phi
llama_model_loader: - kv   6:                         general.size_label str              = 15B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,7]       = ["phi", "nlp", "math", "code", "chat"...
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  11:                        phi3.context_length u32              = 16384
llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 16384
llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv  15:                           phi3.block_count u32              = 40
llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 250000.000000
llama_model_loader: - kv  21:                          general.file_type u32              = 15

[2025-01-11 15:56:20][DEBUG] llama_model_loader: - kv  22:              phi3.attention.sliding_window u32              = 131072
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = dbrx

[2025-01-11 15:56:20][DEBUG] llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...

[2025-01-11 15:56:20][DEBUG] llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

[2025-01-11 15:56:20][DEBUG] llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 100257
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  101 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   21 tensors

[2025-01-11 15:56:20][DEBUG] llm_load_vocab: special tokens cache size = 96

[2025-01-11 15:56:20][DEBUG] llm_load_vocab: token to piece cache size = 0.6151 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 100352
llm_load_print_meta: n_merges         = 100000
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 10
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 131072
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4

[2025-01-11 15:56:20][DEBUG] llm_load_print_meta: n_embd_k_gqa     = 1280
llm_load_print_meta: n_embd_v_gqa     = 1280
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 250000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 14B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 14.66 B
llm_load_print_meta: model size       = 8.43 GiB (4.94 BPW) 
llm_load_print_meta: general.name     = Phi 4
llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: PAD token        = 100257 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 100265 '<|im_end|>'
llm_load_print_meta: max token length = 256

[2025-01-11 15:56:21][DEBUG] ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.26 MiB

[2025-01-11 15:56:25][DEBUG] llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  8354.71 MiB
llm_load_tensors:        CPU buffer size =   275.62 MiB

[2025-01-11 15:56:28][DEBUG] llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 250000.0
llama_new_context_with_model: freq_scale = 1

[2025-01-11 15:56:28][DEBUG] llama_kv_cache_init:      ROCm0 KV buffer size =  3200.00 MiB
llama_new_context_with_model: KV self size  = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB

[2025-01-11 15:56:28][DEBUG] llama_new_context_with_model:  ROCm_Host  output buffer size =     0.38 MiB

[2025-01-11 15:56:28][DEBUG] llama_new_context_with_model:      ROCm0 compute buffer size =  1357.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    42.01 MiB
llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant