转换成 sentencepiece 的之后载入失败 #13

yzlnew · 2024-02-04T11:45:26Z

通过类方法 convert_to_sentencepiece 转换为 sp model，再进行 load 的时候报错

import sentencepiece as spm

sp_model = spm.SentencePieceProcessor()
sp_model.Load("sp.model")

libc++abi: terminating due to uncaught exception of type Darts::Details::Exception: /Users/runner/work/sentencepiece/sentencepiece/third_party/darts_clone/darts.h:1143: exception: failed to insert key: zero-length key

相关 issue google/sentencepiece#156

模型里面有 "\0"，是否应该在 convert 的时候去掉，以及是否有副作用？

The text was updated successfully, but these errors were encountered:

bojone · 2024-02-06T14:06:09Z

转换前的模型方便共享吗？或者给一个最小的复现代码？

yzlnew · 2024-02-07T03:26:35Z

@bojone 按照 README 的例子复现。模型在这里 https://microbin.yzlnew.com/upload/sloth-worm-falcon

from bytepiece import Tokenizer

tokenizer1 = Tokenizer('tokenizer_80k_small_isolated.model')
tokenizer1.convert_to_sentencepiece('sp.model')

import sentencepiece as spm
tokenizer2 = spm.SentencePieceProcessor("sp.model")

bojone · 2024-02-09T05:21:21Z

@yzlnew 看上去你不是ensure_unicode版本？只有ensure_unicode版本的模型才保证能顺利转换成sentencepiece（在较新的版本中，ensure_unicode默认是开启的，你可以检查一下）

yzlnew · 2024-02-09T07:16:20Z

@bojone 奇怪了，这个模型是用 0.6.3 训练的，而且也是 ensure_unicode 的。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

转换成 sentencepiece 的之后载入失败 #13

转换成 sentencepiece 的之后载入失败 #13

yzlnew commented Feb 4, 2024 •

edited

Loading

bojone commented Feb 6, 2024

yzlnew commented Feb 7, 2024

bojone commented Feb 9, 2024

yzlnew commented Feb 9, 2024 •

edited

Loading

转换成 sentencepiece 的之后载入失败 #13

转换成 sentencepiece 的之后载入失败 #13

Comments

yzlnew commented Feb 4, 2024 • edited Loading

bojone commented Feb 6, 2024

yzlnew commented Feb 7, 2024

bojone commented Feb 9, 2024

yzlnew commented Feb 9, 2024 • edited Loading

yzlnew commented Feb 4, 2024 •

edited

Loading

yzlnew commented Feb 9, 2024 •

edited

Loading