Analyzing the Information Density of Tokenization Methods to Achieve Effective Training of NLP Models
Riya Bhatia, Shivam Syal, Angela Yuan, Komal Keesara, Tawshia Chowdhury, Dmitri Pavlichin
Stanford STEM to SHTEM
Finalized article (some figures here no longer properly display—please refer to our Github files for them): https://theinformaticists.com/2021/08/26/analyzing-the-information-density-of-various-tokenizations-for-the-optimization-of-natural-language-processing-models/