-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Platform and use case examples
Unsupervised machine learning aided language modelling
Technical Guide and Documentation
In a nutshell, it is about natural language processing. Alice stands for 'A Language Interpreter as semantiC Experiment'. As it's stated in the abstract of Yacc: "Computer program input generally has some structure; in fact, every computer program that does input can be thought of as defining an input language which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers."
Instead of learning such program specific input languages I'm trying to build a reusable library which acts as a human interface -simply called 'hi'- to turn natural language text to structured analyses which can be used to extract all the information that are necessary to translate the text to a language that can be understood by computers. The translation process is carried out in two steps: first morphological, syntactic and semantic analyses are carried out whose results are handed over to the caller/client program in JSON structure, then the analyses need to be parsed by the client to generate a translation or extract information for tagging and classification. Examples for different platforms/use cases are also provided in the project to demonstrate how it can be done.
Android: You can currently make phone calls or look for contacts in Hungarian and English (even offline if a dictionary is available for download for your choice of language) like:
list contacts with Peter
keress névjegyeket Péterrel
Check it out in Play Store in English
Check it out in Play Store in Hungarian
Javascript (browsers or Node.js): You can find an example on embedding the compiled js lib into a website which demonstrates how sentences about searching for a location on a map could be interpreted like:
show location of thai restaurants in erding
Check it out on the project page
Desktop: In this use case, file handling commands are interpreted -currently tuned for filtered file and directory listing using logical expressions like:
list symlinked or executable directories in directory abc that are not empty
Clone the project and build it on your desktop according to the How to build section.
I spent the last months with improving the performance of prep_abl which is used to preprocess the text corpus for the machine learning tool (ABL). As preprocessing a big text corpus required too much memory I rewrote prep_abl to write files instead. The calculation of token paths for all possible combination of word analyses returned by the morphological analyser (foma) was originally designed for analysis (interpretation) only and not for generation. So in case of calculating every possible token path for a big text corpus in the preprocessing phase for ml was also slow and used too much memory. Therefore, I started to redesign it so that only valid token paths get generated thus not spending time on invalid ones and not requiring memory for those. Thanks to NG's idea about calculating a path for a unique path number in an array representation of a cartasian product (see the int2indices() function in NG's solution or its reimplementation in the source file of tokenpaths.cpp in this project as tokenpaths::path_nr_to_indices), the performance has improved dramatically. In addition to that, prep_abl got multithreading support as well to make it even more faster. I haven't had time to deal with anything else (like updating the Android or js clients) and testing is still in progress as till now mainly functional/performance tests have been carried out to make sure that both the machine learning scenario and the interpreter work as well as they did earlier. I'll now prepare some bigger text corpus to see how the ml processes cope with that.
With the last few commits an initial machine learning integration has landed as part of the project. The machine learning framework and the ways you can use it are described under the Unsupervised machine learning aided language modelling section and to test the generated grammar the test tools have been extended as well which is also described in the corresponding Tests (NLTK based) section. As mentioned, it is for generating the grammar for a given corpus but having a foma fst for the language of the corpus is a prerequisite. It also won't give you the semantics so there are still some modelling tasks that you need to do yourself. Theoretically, using a machine learning tool for morphology like Morfessor would be an option but you'd still need to find a way to create a foma fst out of that (or extend hilib to handle other morphological analyzers beside foma). Concerning semantics, the dependency graphs of certain ml/ai platforms may prove useful at least for setting up the DEPOLEX content (DEPendencies-Of-LEXemes, exactly as its name suggests) but the rule-to-rule-map is still there to be maintained to connect the grammar to the semantics as I doubt that there's any way to generate it. Anyway, till the project gets that far to generate everything out of the box, you may use now the ml_tools for make as a new target to build the tools you need for machine learning. Of course, you'll need to build first the Alignment Based Learning framework itself.
Managed to get rid of numbering the tokens manually in the GCAT db table which has its advantages and disadvantages as well. Fortunately, the disadvantages are only short term as there some adjustments one needs to make to already existing language models in db. First, if you have NULLs anywhere in the gcat db table in any of the key fields, you must put a value there. It was anyway not a good idea to allow NULL in a key but sqlite allows it and I did so too but that's now hardened. If you had NULL in the feature field, you need to put the value 'Stem' in there as NULL was anyway interpreted as a lazy fallback for 'Stem'. If your model handles constants or unknown words (or using hilib terms call it concealed words) you'll need to make sure that you have an entry in gcat with the reserved 'CON' gcat and 'Stem' feature with a token number greater than zero for the modelled language to get a token generated for it in bison. Till now, you could have only one entry for constants/concealed words for all the languages in the db but now it became language specific. So instead of the language independent t_Con token in bison, you'll have language specific ones like t_ENG_CON_Stem for English. This is another thing you have to adjust in your grammar i.e. to replace t_Con with a language specific symbol. The good news is that you don't need to number your terminal tokens manually any more. You don't even need to change anything in the token column of the gcat db table as the parser generator (gensrc) program will interpret that field in a way that it generates a token in the bison source for values greater than zero which was the case for the token numbers anyway except for the entry having the reserved gcat 'CON' as that had to be zero. (That's why I mentioned if you have any such entry then you need to set a token number greater than zero).
Finally added a makefile to the project which aims to be cross platform. I have not yet tested it on Linux, only on my own NetBSD but both with (bsd)make and gmake (i.e. the gnu make). First I tried to get the job done by writing a cmake script which worked but I just didn't like it so threw it away and wrote a posix makefile. As indicated in the 'How to build' section as well, for the time being I keep the build scripts I used till now just to give the makefile some time to prove that it works fine. I built a help target for the makefile so until I write the documentation for it here, you can only rely on that which you can simply invoke by typing 'make help'. Another small step forward is switching over the c bison parser to c++ which turned out to be simpler than I thought. As a kind of heads up: I'll now try to get rid of numbering the tokens which means that in case of success, the hi_db.sql db schema will need to be changed as the corresponding column will be removed from the GCAT db table.
Although I didn't want to add more features till the first release is out but I could not avoid doing so. Mainly as I wanted to make it sure that the logical operators (not, and, or) can be modeled not only in English but in Hungarian as well which turned out to be a bit more difficult than in English:) So I had to add support to the bison operator precedence and context dependent precedence. This means that the GCAT table got extended with two fields (precedence and precedence_level) due to the operator precedence support which makes previous model db-s incompatible so you'll have to adjust your content sql files and rebuild your db. The technical documentation does not yet reflect these changes so please, refer to the example contents and the db schema file (hi_db.sql). The GRAMMAR table had to be extended as well with a new field (called precedence) to support context dependent precedence. Last but not least, the RULE_TO_RULE_MAP table got also extended with two fields (main_set_op and dependent_set_op) which make it possible to carry out set operations on the set of symbols collected for the main or the dependent nodes. Apparently, set operations are applied on two set of symbols collected for the same node so don't expect that e.g. you can merge (make union of) the symbols collected for the main node and that of the dependent node. This leads to simpler syntactic models as in many cases a syntax rule had to be introduced just to declare a new symbol which you could use as a restriction during the semantic validation.
A small improvement worth mentioning is that I introduced the END symbol for $end (end of file) so that you can use it in your grammar. If you want to use it, just add an entry to the SYMBOL table with the value "END" as symbol and the necessary language id (lid).
I also introduced some tests (based on NLTK) to be able to check the language models, of which you can read more at the Tests section.
Another new functionality that makes functor development easier is that the gensrc tool can now copy the functor definitions from files in the db. You may have already noticed the functors subdirectories in the platform specific directories where the shell script and javascript files reside. The content of a functor implementation file gets copied in the FUNCTOR_DEFS definition field if a file can be found with the name you put in the definition field (without quotes) within the directory you specify as the fourth parameter when invoking gensrc. Using quotes, means the definition field must be left untouched. As a corollary, you'll need to invoke gensrc each time you rebuild your db.
What got pushed yesterday is the result of squeezing out of the framework what I wanted to achieve in the last few years: introducing logical operators (not, and, or). It took a long time to fix all the bugs that got in the way of achieving this and I must admit that it's still not perfect but at least it's possible now. This time only the desktop client got updated to make use of this feature but as the library supports it now you can build any other clients to make use of it as well. The tricky thing was to preserve the logical order of the dependencies involved in a logical expression. In a sentence like:
list symlinked or executable directories in directory abc that are not empty
this is not at all trivial. Don't ponder much about the adjectives used here, I just wanted something to play with and the file flags came handy. The functor implementation is done in posix shell script as usual but as I'm not a shell script guru, I'm pretty sure that there are many bugs in the functors besides the ones I know for which I'll create some issues. The list of words that can be used in the desktop client got finally updated below. The biggest challenge currently when building a model that accepts logical operators preserving the logical order of dependencies is that the framework can only bind a dependency to its direct ancestor e.g.:
list directories that are not empty and symlinked
had to be modelled in depolex in a way that the verb "are" (be) has "not" as dependency and a logical group of adjectives describing directories (like "empty" and "symlinked") which I called "dirbeprop". At the same time, "not" also has "dirbeprop" as its dependency. This is necessary because once the interpreter finds "are" and looks for its dependencies, it'll find "not" and bind it to "are". Then it finds "empty" which gets bound to "not". (I'm ignoring the "and" operator now in order not to complicate matters.) So when the interpreter finds "symlinked" it can only bind it to "are" if "are" is a direct ancestor of it in the dependency hierarchy (depolex). If I modelled it in a way that "dirbeprop" was not added as a dependency to "are" only to "not" but marking "not" as optional then once "not" was found for "empty", "symlinked" could not be bound to "are" across "not" as if "not" was present only for "empty" but not for "symlinked" because "are" was 2 levels higher in the hierarchy and as "symlinked" wasn't negated it did not get bound of course to "not". So the interpreter currently cannot bind dependencies across levels but I treat this for the time being a feature not as a bug as there's a workaround and the solution to this is not trivial. This means that I had to apply the same strategy wherever negation popped up so I had to create negative and positive paths for the dependencies of each logical operator. It does work but I don't know yet how it scales. Nevertheless, I'll make use of the logical operators as soon as possible in the android client and the js client as well.
The android client got also an update to support 64 bit libraries as it was anyway a must due to google pushing this. So I'll need to bring this change into the playstore as well otherwise they'll kick the app out soon. Let's see if I manage to add some new features at the same time like searching in the contact list with conditions to make use of the logical operators.
Nothing peculiar, just creating an entry to mark the commit of today as RC1 alpha:) At least, I consider it feature complete so no new features are in the pipeline for the first release unless it turns out to be inevitable. The features added now are: support for more than one target language, functor tags, syntactic analysis and analysis switches.
By supporting more than one target language one can implement a functor in different languages. The language chosen as target must be passed over when calling the interpreter.
Functor tags are useful when parsing the semantic analysis you get back from the interpreter. E.g. you may want to tag some functors of verbs like 'go', 'drive', etc. representing a certain type of activity as 'navigate'. To achieve this, you'll need to add entries in the FUNCTOR_TAGS db table with your choice of tag-value pairs. You can add any number of tag-value pairs to a functor and in case of providing a trigger tag, they'll only be added if the trigger tag is present in the feature set of the morpheme belonging to the functor at the time of creating the analysis.
Supporting syntactic analysis got added finally so now you can get that as well from the interpeter along with the morphological and semantic analysis.
Adding switches for different types of analyses makes it possible to get only the type of analysis you want. It's not about simply creating the requested type of analysis in the end. The implementation is done in a way to only execute the codeline that's necessary for the analysis whenever possible. E.g. in case of requesting only morphological analysis, no syntactic or semantic analyses will be carried out. Similarly, when requesting morphological and syntactic analysis, no semantic analysis is carried out at all. Of course, requesting semantic analysis implies carrying out morphological and syntactic analyses as well.
No client code has been updated but I'll focus more now on preparing the project for a release instead of adding new features:)
Today's commit is a huge one -not only considering its size. Unfortunately though, I didn't have time to fix all issues for all clients as I was mainly focusing on the library itself and the Android client of which the Hungarian version can now be accessed in Play Store to check out the details or for alpha testing. Concerning the library, I made the design decision to take a step back in order to be able to take two forward in a different direction: I abandoned supporting different transcriptors inside the library and left only one for JSON. This does not mean that the functors cannot be implemented in any language, it only means that no executable code will be assembled by the library. Instead of that, an analysis will be returned in a JSON structure. The analysis needs to be parsed by the platform specific clients and create the executable code out of that. As mentioned, I was focusing this time on Android, so you can find an example implementation of such a client in the hi_android folder.
Another important improvement is the database design hardening but it's rather a hardening of the implementation as many foreign key constraints were not turned on till now, only because the SQLite version (3.8) supporting partial indexes e.g. came with the Android API level 21.
There's also a significant performance improvement due to adding a cache to the lexer and restructuring the process of analysis by carrying out tokenization completely before starting the syntactic analysis. This makes it possible to build in some switches influencing the type of analysis to be carried out be it either morphologic, syntactic or semantic. However, this is not yet complete. Another thing for which this change paves the way, is the parallel processing of different token paths. As one word may have more than one morphological analyses, it can easily happen that if A and B are words both having two morphological analyses, then A1-B1, A1-B2, A2-B1 and A2-B2 need to be analysed as well. Now, that the morphological analyses are done before starting the syntactic analysis, multiple threads can be started for each and every different token path being built from the different morphological analyses.
From now on, the interpreter will traverse all possible token paths that can be constructed from the morphological analyses reporting all successful interpretations and all emerging errors. This means, that in case of getting more than one successful interpretations, the caller has to decide which of them is more relevant.
Partial analysis has been enabled as well, so you can still get a full morphological and semantic analysis even if the sentence is ungrammatical. This makes it possible for the client to figure out what could have that sentence meant. E.g. 'list contacts with Peter' is a grammatical sentence compared to 'list contact with Peter' but as you get back the functors along with its dependencies in the analysis, the client can try to solve the puzzle. Check out the Android client for details.
Another new feature is introduced by which one can bind lexemes not existing in the lexicon to a functor by their grammatical categories. This solves e.g. the problem of interpreting numbers as they don't need to be enumerated in the lexicon assigning them to a/one functor. Furthermore, if neither a lexeme is registered for a stem in the lexicon, nor a dependency entry in the depolex table, it poses no problem any more unless the node of the stem is combined. Along with that I made the usage of foma files easier by getting rid of the [lfea] and [gcat] tags. So if you want to use an existing foma fst with this interpreter, you only need to change the rules that access the stems to add the [stem] tag.
Last but not least, a few words about the clients. The desktop client currently only prints the JSON analysis so that you can verify it. The morphological model has been updated to get rid of the [lfea] and [gcat] tags but neither the syntactic nor the semantics part has been touched or tested thoroughfully. Roughly the same can be said about JS client as well, however, the github pages now runs the code of this commit along with all the bugs that may occur. I'll fix/improve these clients as soon as possible. The Android client got a major update, making the project now gradle aware to be able to develop it in Android Studio 3.0. It can be considered as a demo of all these changes. The English part is not that useful IRL though, as you can only ask your phone to list your contacts with regards to a certain name using the words: list, contact(s), with, name. E.g. 'list contacts with (name) peter' or just 'list contacts'. Though, you can now use it offline as well since the EXTRA_PREFER_OFFLINE option got introduced for the speech recognizer intent in API level 23 which is therefore now the minimum API level (i.e. Android version 6.0) required by the app. The Hungarian part is a bit more useful as you can make real phone calls by using the words: hív(d), fel, a, az. E.g. 'hívd fel Pétert', 'hívd Pétert', 'hívd az orvost', 'hívd a 112-t', 'hívd fel a 00 36 1 234 56 78-at', etc. The client will give you a list of numbers if more than one are found with the contact name specified and you can just choose which of them to call simply by saying the sequence number assigned to the phone number in the list like: 'hívd az elsőt', 'hívd az utolsót', 'hívd a másodikat' or just 'a harmadikat'. This also works with names, if google speech recognition delivers a wrong result. If you already had a successful interpretation but the contact name did not match any of your contacts, you just have to repeat the name, part of the name or a spelled variant and the client will try to figure out a match in your contacts.
Most of the build scripts have been reworked so that some arguments can be passed on to them. Another major change is that finally the design efforts now bring their fruits as I could write a simple program called gensrc that can generate the bison source from the grammar db table where you only need to provide the syntactic rules as if they were usual A->B C or A->B bison rules but without any coding. So now those who don't like coding can model a language easier but in order to add linguistic features at runtime (like the obligatory main_verb symbol), you'll still need code snippets that take care of it. I also adjusted the documentation slightly on the main wiki page to reflect these changes, see the "Modelling a language" and "How to build" sections. Another important thing got improved in the meantime, namely the runtime error reporting which provides you better error messages about missing/inconsitent model configuration in the db file replacing the old slothful behavior of quitting with exit_failure simply.
Changed the transcriptor for js so that instead of real js now only a JSON structure is generated. I'd not drop the other possibility of generating js code directly either but for online usecases the JSON structure seems to be more valid just like api.ai does it. So now you can check out on the github pages of the project what you get back if you submit any of these examples (where 'abc', 'def', 'erding', 'thai' are not part of the dictionary, just constants so you can use whatever you like there):
- show abc
- show restaurant abc
- show thai restaurants
- show abc in def
- show location of thai restaurants
- show location of thai restaurants in erding
- show location of restaurant abc
- show location of abc
- show location of abc in erding
- show location of restaurant abc in erding
- show thai restaurants in erding
- show restaurant abc in erding
- show restaurants in erding
- show location of restaurants
- show location of restaurants in erding
- show restaurants
If you copy the JSON result from the browser's debugger console, you can simply validate it at e.g. http://jsonlint.com
Adding support for javascript thanks to emscripten. You can even try it on the github pages of the project. The words you can use are: show, restaurant, location, of, in. Even though it's a handful of words you can already create such sentences like 'show locations of thai restaurants in madrid' or just 'show restaurants in paris' or even just 'show mcdonalds'. It spits out an alert with a generated javascript translation of the command but without any functors being implemented so currently you won't get anything useful except the process of analysis in the browser's debugger console:)
Multiple language support has been validated on Android by adding a small Hungarian lexicon and morphological analyzer derived from the comprehensive work of Eleonora whose project Hunmoprh-foma is hosted by me among my projects. Due to that, one can make now phone calls by saying 'hívd fel ...t' where ... is the name of the person:) There are of course many things to improve but the multiple language support seems to work.
Improved error handling: until now the interpreter gave back either a string with an executable script or nothing. For the time being, if it cannot interpret the input, it gives back a string containing something like 'interpreted phrase/stuck at word'. Besides that, you get a feedback about the error from bison as well on the standard output.
And finally: hopefully managed to commit more bugfixes than new bugs:)
Adding support for Android. Check out the Android crosscompile steps file for details in the corresponding folder. Only a smoke test has been done on my own phone with Android 4.2.2. It accepts only the following voice commands in English: "list contacts", "list contacts with <name>" and "list contacts with name <name>". Instead of executing shell scripts and native executables on Android, which is actually possible but not the most convenient, javascript and java shall be used to implement the functors. Technical details about developing on Android is described here -especially in section 'Binding JavaScript code to Android code': https://developer.android.com/guide/webapps/webview.html#BindingJavaScript
Have fun:)
Translation capability is back in the framework and added support for defining relative clauses with the restriction that they can only have an auxiliary but no main verb so the following sentence can now be interpreted: "list files that are in directory abc".
Managed to screw up the whole repository and had to restore it from the available versions so the changes between the initial commits are not that gradual any more and the commit dates of course don't reflect the past dates when the changes were originally committed.
Since the first commit the framework has changed a lot and the current version is not capable of translating the commands into shell scripts but simply validates their feasibility according to the model set up in customizing. However, the feature will be back as soon as possible.
Support for morphosyntactic rules has been added using the foma library. In addition to it, a second goal has been set up and already partially achieved (while breaking the translation capability for the time being): providing a reusable framework either for development or educational purposes in natural language processing.
Moving to C++.
Uploading the initial version of the interpreter which was the thesis for my degree done in programming.
This software is provided "as is" and without any warranty. Use it at your own risk under the license of GPL v3.
This software does not store your data.
The interpreter itself is relying on the following components: a lexical analyzer, a syntactic parser (call it parser), a morphological analyser, a semantic parser (call it interpreter) and a database connector. The lexical analyzer used to be built with Lex but later got replaced with a hi specific function, the morphological analyser is built with foma, the syntactic parser is built with bison while the database connector is based on SQLite3.
The lexical analyzer scans the input and validates the words in that with the stems returned by foma, which are checked against the lexicon i.e. the dictionary in the database. The parser validates the input against syntactic rules while the interpreter checks the semantics and translates the command.
The library itself contains only one function to which one needs to pass the command and it returns the analyses in a JSON structure which the client must parse and assemble the corresponding shell script (for browsers and android: javascript) of it. To execute the shell script I created a main program which can take the commands from CLI, passes it to the mentioned library function and executes the shell script in a child process. This way it really feels like talking to a machine. One simply invokes the executable by typing 'hi' and after hitting enter the command can be formulated in English.
The interpreter is developed and tested nowadays on NetBSD (Linux would fit the bill as well though) as I don't really have time to get all the development environment tools (e.g. android ndk, emscripten) working on Minix3 which I used earlier. The development language has changed in the meantime from C to C++. However, the C heritage can still be seen here and there. The shell scripts are aimed to be POSIX compliant but it may not always be the case -which is then considered as bug.
This is highly experimental but finally I prepared an end to end toolchain to integrate it. The machine learning tool is that of the Alignment Based Learning. What I did in addition is that there's a preparation step that breaks down the words in a text corpus (without punctuation currently) first by foma to morpheme tags based on the foma fst assigned to the language in the language model db. You can invoke it as:
prep_abl /path/to/dbfile.db /path/to/abl/training/corpus <language id> /path/to/output/file/name
The first parameter is the language model db file having the content prepared in the tables LANGUAGES, GCAT, SYMBOLS, LEXICON and the minimum required data by the foreign key constraints in other tables. However, for the machine learning phase you don't need to have any other syntax or semantics related content. The language id is one of the ids you have in your language model db available which at the same time identifies the foma fst as well. The output of prep_abl is the corpus for the training. You need to feed the ABL tool with that corpus, containing the morpheme tags (instead of the words) in the right sequence.
Of course, you need to know how to use the ABL tools but that project (see link above) has a nice documentation and the tools have a short help as well. However, for the impatient, this is how I invoked the ABL commands for my test corpus:
abl_align -a a -p b -e -i /path/to/corpus/file -o /path/to/corpus/file/name/aligned
abl_cluster -i /path/to/aligned/corpus/file -o /path/to/corpus/file/name/clustered
abl_select -s b -i /path/to/clustered/corpus/file -o /path/to/corpus/file/name/selected
Once you have done the training, you need to put the rules learned from the corpus into your language model db. As ABL does not provide the grammar rules directly, I had to write a postprocessing tool that extracts the rules from the ABL output. You need to invoke it as:
proc_abl /path/to/abl_select/output/file <language id> [/path/to/dbfile.db]
The parameters are pretty obvious to use I guess. The language model db is optional in order that you can make test runs without writing the grammar rules and symbols in the db.
As mentioned in the beginning, this is highly experimental so there's a lots of room for improvement e.g. machine generated symbols for the rules are pretty hard to read, there's no conversion to right recursion, etc. Besides all that, this will NOT give you the semantics. That you'll still need to write yourself.
If you want to test the grammar the machine built, you can do so by using the test tools: stex and stax. Ideally, you should get back the sentences from stax which you have in the corpus you used in the abl preparation step.
If you'd like to create your own model for a language, you'll need to think over the followings:
- Phonology
- Morphology
- Lexicon
- Grammar (syntax)
- Semantics
You have the possibility to maintain your own rules for all those. Unfortunately, documentation is lagging behind but you can ask for help either per email or by creating an issue describing your problem and I'll try to help. Some technical help you can find in the technical documentation but it's also not always up to date so as usual, the best way is to browse the source, especially the hi_db.sql file and the content sql files created for different platforms in case of modelling.
The rules for phonology and morphology belong to foma, so please check the technical documentation for some examples and links pointing to the original documentation of foma to be able to create your own morphological analyser. There are two analysers in development I use usually: a Hungarian and an English. Depending on the target, the morphological analyser can be built as:
desktop:
make desktop_fst DESKTOPFOMAPATH=/path/to/your/foma/file DESKTOPLEXCFILES=/your/lexc/files/directory
Android:
make android_fst ANDROIDFOMAPATH=/path/to/your/foma/file ANDROIDLEXCFILES=/your/lexc/files/directory
javascript:
make js_fst JSFOMAPATH=/path/to/your/foma/file JSLEXCFILES=/your/lexc/files/directory
The grammar rules can be either coded manually in a bison file as shown in the corresponding section of the technical documentation or you can just enter your syntactic rules in the grammar db table in your content sql file entering the language id for which the rule is relevant, the parent symbol, the head symbol and the non-head symbol as if it was a bison rule like: A->B C. However, in order to add a linguistic feature to a node (like main_verb), you have to put the corresponding code snippet of the action in the action field of the grammar table entry of the rule.
After you've created your content sql file with the lexicon, grammar, semantic rules, etc. you have to create a db file from it as follows:
desktop (supposing you have your mycontent.sql in the subdirectory of the project directory build/hi_desktop):
make desktop_parser_db NATIVEPARSERDBNAME=mymodel.db NATIVEPARSERDBCONTENT=build/hi_desktop/mycontent.sql
Android (supposing you have your mycontent.sql in the subdirectory of the project directory build/hi_android):
make android_parser_db ANDROIDPARSERDBNAME=mymodel.db ANDROIDPARSERDBCONTENT=build/hi_android/mycontent.sql
javascript (supposing you have your mycontent.sql in the subdirectory of the project directory build/hi_js):
make js_parser_db JSPARSERDBNAME=mymodel.db JSPARSERDBCONTENT=build/hi_js/mycontent.sql
Now, you can generate the bison source from your db file:
desktop (with mymodel.db in build/hi_desktop):
make desktop_bison_parser NATIVEPARSERDBNAME=mymodel.db
Android (with mymodel.db in build/hi_android):
make android_bison_parser ANDROIDPARSERDBNAME=mymodel.db
javascript (with mymodel.db in build/hi_js):
make js_bison_parser JSPARSERDBNAME=mymodel.db
If you have action snippets and functor implementations you can also pass their location in the following parameters for the corresponding target:
DESKTOPACTIONSNIPPETS, DESKTOPFUNCTORPATH
ANDROIDACTIONSNIPPETS, ANDROIDFUNCTORPATH
JSACTIONSNIPPETS, JSFUNCTORPATH
Once you have your foma fst file, db file and bison source, you can build your own interpreter out of these.
A makefile is now available but it's pretty bare bones with no external dependency checks and a minimal target dependency setup. Until further documentation here, use the help target to find out more, simply typing:
make help
That will give you all the parameters that can be used for each target and the dependencies in the end. There are only a few steps to get it up and running if the dependencies are installed on the target system:
Desktop:
make desktop_parser
make shared_native_lib
make desktop_client
- Now you have an executable by default in build/hi_desktop called 'hi' which interprets the text input entered.
Android (requires Android NDK):
make android_parser
make arm32_lib NDK32BITTOOLCHAINDIR=/your/32bit/android/NDK/toolchain/directory
make arm64_lib NDK64BITTOOLCHAINDIR=/your/64bit/android/NDK/toolchain/directory
- Build your android project that links the library file. Please, refer to the hi_android directory containing an example project which you can directly import using Android Studio. If you'd like to replace the library file in the example with the one you compiled, you need to copy yours in the hi_android/hi/app/src/main/jniLibs/arm64-v8a or hi_android/hi/app/src/main/jniLibs/armeabi-v7a directory.
Javascript (requires Emscripten):
make js_parser
make embedded_js_lib EMSCRIPTENDIR=/your/emscripten/directory
- Now you have a js file by default in build/hi_js/embedded/ which you can use in the index.html file after modifying it according to your needs.
Testing a language model is generally pretty difficult as grammars can generate a huge number of sentences even for a relatively small set of words. For the time being, the only thing I could come up with is to make use of NLTK's sentence generation capabilities. This means that you'll need python3.6 and NLTK installed. There are currently two tools in the tests directory: stex is a wrapper around NLTK that generates every possible sentence structure with the terminal symbols of the word forms specified. I'd recommend restricting the generation by depth instead of number of sentences since as soon as the generator runs into a recursive rule (which may even be the first sentence), it will result in an infinite loop and crash. As a hint, when I was testing the grammar for the desktop use case for file/directory listing using logical operators, syntax trees built for such sentences reached at least a depth of 10 so below that I did not get any result. Once you're satisfied with the result set, you'll need to redirect its output to a file to feed the other tool called stax. Invoke stex like:
`./stex /path/to/dbfile.db <language id> <sentence nr limit>n|<tree_depth>d list,of,all,wordforms,to,be,generated`
In order to make stex output lines unique, there's now a small script that does it which you can invoke as:
`remove_stex_output_duplicates.sh /path/to/stex_output_file`
The script will generate a file with the same name as the stex output suffixed with _unique. Stax simply takes the output of stex (preferably made unique):
`./stax /path/to/stex_output_file [/path/to/prep_abl/output/file]`
It generates the word forms from the terminal symbols (tokens) in the sentence structures so you'll get sentences that your grammar accepts for the given set of words. Optionally, if you provide the output of the abl preparation step as the second parameter, you'll get some basic statistics to show how the sentences generated based on the grammar rules induced by machine learning relate to the sentences used for the training.
Once you have an executable built, you just need to invoke it and it will be waiting for your command. The accepted input is limited to only one command at a time. Punctuation is not checked so hitting enter indicates the end of the sentence and triggers execution. The dictionary currently contains the following words:
Verbs:
- list
copyremovedeletechangemovemake
Nouns:
- file
- files
- directory
- directories
Adjectives:
- executable
non-executable- empty
- symlinked
Pronouns:
all
Prepositions:
tofrom- in
Polarity:
- not
Conjunction:
- and
- or
The supported syntax in which the commands need to be formulated is the syntax of English imperatives which was implemented in the beginning as follows:
(NOTE: THIS IS NOT UP TO DATE ANY MORE AS THE NUMBER OF RULES ARE GROWING AND THE RULES THEMSELVES ARE CHANGING A LOT AS WELL, SO JUST TAKE IT AS AN EXAMPLE AND RATHER CHECK OUT THE GRAMMAR DB TABLES IN THE PLATFORM SPECIFIC CONTENT SQL FILES)
S------------->ENG_VP
ENG_VP-------->ENG_Vbar1
ENG_VP-------->ENG_Vbar1 ENG_AdvP
ENG_VP-------->ENG_Vbar2
ENG_VP-------->ENG_Vbar2 ENG_PP
ENG_VP-------->ENG_Vbar3 ENG_NP
ENG_Vbar3----->ENG_V ENG_AdvP
ENG_Vbar2----->ENG_Vbar1 ENG_PP
ENG_Vbar2----->ENG_Vbar1 ENG_NP
ENG_Vbar1----->ENG_V ENG_NP
ENG_PP-------->ENG_Prep ENG_NP
ENG_NP-------->ENG_CNP
ENG_NP-------->ENG_QPro ENG_CNP
ENG_CNP------->ENG_A ENG_CNP
ENG_CNP------->ENG_N
ENG_AdvP------>ENG_Adv
ENG_V--------->t_ENG_V_stem
ENG_QPro------>t_ENG_QPro
ENG_N--------->ENG_N_Sg
ENG_N--------->ENG_N_Pl
ENG_N_Sg_0Con->ENG_N_Stem ENG_N_lfea_Sg
ENG_N_Sg------>ENG_N_Sg_0Con ENG_1Con
ENG_N_Sg------>ENG_1Con
ENG_N_Pl_0Con->ENG_N_Stem ENG_N_lfea_Pl
ENG_1Con------>ENG_Con
ENG_nCon------>ENG_1Con ENG_Con
ENG_nCon------>ENG_nCon ENG_Con
ENG_N_Pl------>ENG_N_Pl_0Con
ENG_N_Pl------>ENG_N_Pl_0Con ENG_nCon
ENG_N_Pl------>ENG_nCon
ENG_N_Stem---->t_ENG_N_stem
ENG_N_lfea_Sg->t_ENG_N_lfea_Sg
ENG_N_lfea_Pl->t_ENG_N_lfea_Pl
ENG_A--------->t_ENG_A
ENG_Prep------>t_ENG_Prep
ENG_Con------->t_Con
ENG_Adv------->t_ENG_Adv
Symbols:
Since the framework has been prepared -at least technically- to be able to handle different languages, the symbols are prefixed with a language id.
S - Start symbol
ENG_NP - Noun Phrase
ENG_VP - Verb Phrase
ENG_VbarX - Intermediate node for verbs
ENG_PP - Prepositional Phrase
ENG_ADVP - Adverbial Phrase
ENG_Prep - Preposition
ENG_CNP - Common Noun Phrase
ENG_QPro - Quantified Pronoun
ENG_Con - Constant
ENG_1Con - 1 constant
ENG_nCon - >1 constant
ENG_V - Verb
ENG_Adv - Adverb
ENG_A - Adjective
ENG_N - Noun
ENG_N_Sg_0Con - Singular noun without constant
ENG_N_Pl_0Con - Plural noun without constant
ENG_N_Pl - Plural Noun
ENG_N_Sg - Singular Noun
ENG_N_Stem - Noun stem
ENG_N_lfea_Sg - Morpheme indicating singular noun
ENG_N_lfea_Pl - Morpheme indicating plural noun
t_* - terminal symbol
e - empty word
- Introducing and supporting more syntactic categories*
- Enhancing the lexicon*
- Handling compound sentences
- Handling defining relative clause >> done
- Supporting noun specific adjective interpretation (getting rid of classification problems posed by the semantic tree) >> done
- Reengineering Yacc source to avoid recompiling if only dictionary changes but syntactic rules aren't changed >> done
- Resolving conflicting adjectives*
- Handling statements*
- Handling questions
- Introducing interaction (Did you mean ...?)
- Context handling (e.g. Copy all non-executable files to directory abc. Copy those files to def as well.)
- Machine learning*
- Support any language as source of translation (just by technically providing the possibility in DB) >> done
- Partial/error tolerant sentence analysis, simply by omitting words that cannot be analysed or can be analysed but don't fit in the given syntactic model. If no sentence analysis can be carried out, even giving back just the morphological analysis of words may make sense as in a mobile phone use case users may just say one word commands. >> done
*=Under development