write a tool to convert bamini text to unicode text, keeping the english words as english itself #235

tshrinivasan · 2024-11-06T17:20:00Z

Bamini is a tamil typing system. non unicode.

sample:
nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy fw;W Diploma in Indigendus Medicine and Surgery gl;lk; ngw;W itj;jpa fyhepjpahshu;. ,q;F fy;tp fw;Fk; fhyj;jpy; tpupTiuahsu;fs; Nguhrpupau;fs; rf khztu;fs; KjypNahupd; ed;kjpg;ig ngw;wpUe;jhu;. Nguhrpupau; gzpf;fu; lhf;lu; ey;yehjd; lhf;lu; rghgjpg;gps;is KjypNahupd; ed;kjpg;Gg; ngw;w cj;jkhd khztdhf tpsq;fpdhu;.

We can convert to unicode using open-tamil

from tamil.txt2unicode import *

bamini_words = "nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy fw;W Diploma in Indigendus Medicine and Surgery gl;lk; ngw;W itj;jpa fyhepjpahshu;. ,q;F fy;tp fw;Fk; fhyj;jpy; tpupTiuahsu;fs; Nguhrpupau;fs; rf khztu;fs; KjypNahupd; ed;kjpg;ig ngw;wpUe;jhu;. Nguhrpupau; gzpf;fu; lhf;lu; ey;yehjd; lhf;lu; rghgjpg;gps;is KjypNahupd; ed;kjpg;Gg; ngw;w cj;jkhd khztdhf tpsq;fpdhu;."


unicode_words = bamini2unicode(bamini_words)

print(unicode_words)

Result:
கொழும்பிலுள்ள ஆயுர் வேத மருத்துவக் கல்லூரியிலே கற்று னுipடழஅய in ஐனெபைநனெரள ஆநனiஉiநெ யனெ ளுரசபநசல பட்டம் பெற்று வைத்திய கலாநிதியாளார். இங்கு கல்வி கற்கும் காலத்தில் விரிவுரையாளர்கள் பேராசிரியர்கள் சக மாணவர்கள் முதலியோரின் நன்மதிப்பை பெற்றிருந்தார். பேராசிரியர் பணிக்கர் டாக்டர் நல்லநாதன் டாக்டர் சபாபதிப்பிள்ளை முதலியோரின் நன்மதிப்புப் பெற்ற உத்தமான மாணவனாக விளங்கினார்.

The source Bamini has all letters as english charecters only. We can convert them all to tamil with bamini2unicode method in open-tamil python library.

But, when the source has regular english words, those also converted as tamil, which provide junk charecters.

Example -
source = nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy
result = கொழும்பிலுள்ள ஆயுர் வேத மருத்துவக் கல்லூரியிலே

source = Diploma in Indigendus Medicine and Surgery
result = னுipடழஅய in ஐனெபைநனெரள ஆநனiஉiநெ யனெ ளுரசபநசல

We have to write code to check each source word, skip for conversion, if it is real english world.

The text was updated successfully, but these errors were encountered:

tshrinivasan · 2024-11-06T17:23:22Z

Now, trying to find if the word "Indigendus" is english or not using python.

import re
from nltk.corpus import words,wordnet, names
import enchant

# Load NLTK word list and initialize pyenchant dictionary
nltk_words = set(words.words())
wordnet_words = set(word.name().split('.')[0] for word in wordnet.all_synsets())
names_words = set(names.words())
enchant_dict = enchant.Dict("en_US")

def is_english_word(word):
    # Basic check: word contains only alphabetic English characters
    if not re.fullmatch(r"[A-Za-z]+", word):
        return False
    
    # Check if word is in NLTK word list or pyenchant dictionary
    #return word.lower() in nltk_words or enchant_dict.check(word)
    word_lower = word.lower()
    
    return (
        word_lower in nltk_words or
        word_lower in wordnet_words or
        word_lower in names_words or
        enchant_dict.check(word_lower)
    )



print(is_english_word("Indigendus"))

Result :
False

the word indigendus is a latin word.
https://en.wiktionary.org/wiki/indigendus

Have to find, how to find meaningful english or latin words from the given input string, using python.

tshrinivasan · 2024-11-07T22:41:28Z

A non-traditional solution will be to use a large language model (LLM) or a
small language model (SLM) like Phi3. I find it perfect for anything to do
with language.

I checked with GPT and this prompt works:

The following string is a non-unicode representation using a custom font of
a non-english language. The string has mixed non english, english and latin
words also in it. Extract the english and and latin words in this format:

english_and_latin_words = ["word1", "word2"...]

The string : "nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy fw;W Diploma in
Indigendus Medicine and Surgery gl;lk; ngw;W itj;jpa fyhepjpahshu;. ,q;F
fy;tp fw;Fk; fhyj;jpy; tpupTiuahsu;fs; Nguhrpupau;fs; rf khztu;fs;
KjypNahupd; ed;kjpg;ig ngw;wpUe;jhu;. Nguhrpupau; gzpf;fu; lhf;lu;
ey;yehjd; lhf;lu; rghgjpg;gps;is KjypNahupd; ed;kjpg;Gg; ngw;w cj;jkhd
khztdhf tpsq;fpdhu;."

Your Answer:

english_and_latin_words = ["Diploma", "in", "Indigendus", "Medicine",
"and", "Surgery"]

You could use a small language model for lower cost by running it yourself.
This is very simple use case and small models should work. I have not
tested them for this though,

BTW we make it super easy to implement the above and deploy it as an API
with our project "Unstract":
https://github.com/Zipstack/unstract

Received the above solution from ILUGC mailing list.
Thanks Arun Venkataswamy

tshrinivasan · 2024-11-07T22:41:53Z

Thanks arun for the idea.

In my CPU only desktop (4 vcpu), the ollama with Phi3, took 100% of
the CPU in few seconds.
killed it and turned off the ollama service to stop the machine from rebooting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write a tool to convert bamini text to unicode text, keeping the english words as english itself #235

write a tool to convert bamini text to unicode text, keeping the english words as english itself #235

tshrinivasan commented Nov 6, 2024

tshrinivasan commented Nov 6, 2024 •

edited

Loading

tshrinivasan commented Nov 7, 2024

tshrinivasan commented Nov 7, 2024

write a tool to convert bamini text to unicode text, keeping the english words as english itself #235

write a tool to convert bamini text to unicode text, keeping the english words as english itself #235

Comments

tshrinivasan commented Nov 6, 2024

tshrinivasan commented Nov 6, 2024 • edited Loading

tshrinivasan commented Nov 7, 2024

Your Answer:

tshrinivasan commented Nov 7, 2024

tshrinivasan commented Nov 6, 2024 •

edited

Loading