Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write a tool to convert bamini text to unicode text, keeping the english words as english itself #235

Open
tshrinivasan opened this issue Nov 6, 2024 · 3 comments

Comments

@tshrinivasan
Copy link
Member

Bamini is a tamil typing system. non unicode.

sample:
nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy fw;W Diploma in Indigendus Medicine and Surgery gl;lk; ngw;W itj;jpa fyhepjpahshu;. ,q;F fy;tp fw;Fk; fhyj;jpy; tpupTiuahsu;fs; Nguhrpupau;fs; rf khztu;fs; KjypNahupd; ed;kjpg;ig ngw;wpUe;jhu;. Nguhrpupau; gzpf;fu; lhf;lu; ey;yehjd; lhf;lu; rghgjpg;gps;is KjypNahupd; ed;kjpg;Gg; ngw;w cj;jkhd khztdhf tpsq;fpdhu;.

We can convert to unicode using open-tamil

from tamil.txt2unicode import *

bamini_words = "nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy fw;W Diploma in Indigendus Medicine and Surgery gl;lk; ngw;W itj;jpa fyhepjpahshu;. ,q;F fy;tp fw;Fk; fhyj;jpy; tpupTiuahsu;fs; Nguhrpupau;fs; rf khztu;fs; KjypNahupd; ed;kjpg;ig ngw;wpUe;jhu;. Nguhrpupau; gzpf;fu; lhf;lu; ey;yehjd; lhf;lu; rghgjpg;gps;is KjypNahupd; ed;kjpg;Gg; ngw;w cj;jkhd khztdhf tpsq;fpdhu;."


unicode_words = bamini2unicode(bamini_words)

print(unicode_words)

Result:
கொழும்பிலுள்ள ஆயுர் வேத மருத்துவக் கல்லூரியிலே கற்று னுipடழஅய in ஐனெபைநனெரள ஆநனiஉiநெ யனெ ளுரசபநசல பட்டம் பெற்று வைத்திய கலாநிதியாளார். இங்கு கல்வி கற்கும் காலத்தில் விரிவுரையாளர்கள் பேராசிரியர்கள் சக மாணவர்கள் முதலியோரின் நன்மதிப்பை பெற்றிருந்தார். பேராசிரியர் பணிக்கர் டாக்டர் நல்லநாதன் டாக்டர் சபாபதிப்பிள்ளை முதலியோரின் நன்மதிப்புப் பெற்ற உத்தமான மாணவனாக விளங்கினார்.

The source Bamini has all letters as english charecters only. We can convert them all to tamil with bamini2unicode method in open-tamil python library.

But, when the source has regular english words, those also converted as tamil, which provide junk charecters.

Example -
source = nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy
result = கொழும்பிலுள்ள ஆயுர் வேத மருத்துவக் கல்லூரியிலே

source = Diploma in Indigendus Medicine and Surgery
result = னுipடழஅய in ஐனெபைநனெரள ஆநனiஉiநெ யனெ ளுரசபநசல

We have to write code to check each source word, skip for conversion, if it is real english world.

@tshrinivasan
Copy link
Member Author

tshrinivasan commented Nov 6, 2024

Now, trying to find if the word "Indigendus" is english or not using python.

import re
from nltk.corpus import words,wordnet, names
import enchant

# Load NLTK word list and initialize pyenchant dictionary
nltk_words = set(words.words())
wordnet_words = set(word.name().split('.')[0] for word in wordnet.all_synsets())
names_words = set(names.words())
enchant_dict = enchant.Dict("en_US")

def is_english_word(word):
    # Basic check: word contains only alphabetic English characters
    if not re.fullmatch(r"[A-Za-z]+", word):
        return False
    
    # Check if word is in NLTK word list or pyenchant dictionary
    #return word.lower() in nltk_words or enchant_dict.check(word)
    word_lower = word.lower()
    
    return (
        word_lower in nltk_words or
        word_lower in wordnet_words or
        word_lower in names_words or
        enchant_dict.check(word_lower)
    )



print(is_english_word("Indigendus"))

Result :
False

the word indigendus is a latin word.
https://en.wiktionary.org/wiki/indigendus

Have to find, how to find meaningful english or latin words from the given input string, using python.

@tshrinivasan
Copy link
Member Author

A non-traditional solution will be to use a large language model (LLM) or a
small language model (SLM) like Phi3. I find it perfect for anything to do
with language.

I checked with GPT and this prompt works:


The following string is a non-unicode representation using a custom font of
a non-english language. The string has mixed non english, english and latin
words also in it. Extract the english and and latin words in this format:

english_and_latin_words = ["word1", "word2"...]

The string : "nfhOk;gpYs;s MAu; Ntj kUj;Jtf; fy;Y}upapNy fw;W Diploma in
Indigendus Medicine and Surgery gl;lk; ngw;W itj;jpa fyhepjpahshu;. ,q;F
fy;tp fw;Fk; fhyj;jpy; tpupTiuahsu;fs; Nguhrpupau;fs; rf khztu;fs;
KjypNahupd; ed;kjpg;ig ngw;wpUe;jhu;. Nguhrpupau; gzpf;fu; lhf;lu;
ey;yehjd; lhf;lu; rghgjpg;gps;is KjypNahupd; ed;kjpg;Gg; ngw;w cj;jkhd
khztdhf tpsq;fpdhu;."

Your Answer:

english_and_latin_words = ["Diploma", "in", "Indigendus", "Medicine",
"and", "Surgery"]


You could use a small language model for lower cost by running it yourself.
This is very simple use case and small models should work. I have not
tested them for this though,

BTW we make it super easy to implement the above and deploy it as an API
with our project "Unstract":
https://github.com/Zipstack/unstract


Received the above solution from ILUGC mailing list.
Thanks Arun Venkataswamy

@tshrinivasan
Copy link
Member Author

Thanks arun for the idea.

In my CPU only desktop (4 vcpu), the ollama with Phi3, took 100% of
the CPU in few seconds.
killed it and turned off the ollama service to stop the machine from rebooting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant