This is not meant to be comprehensive, but it should at least touch on the main aspects of adding a language.
- Add a dictionary file -
./add-lang.sh de DE
(will require the relevant GNU ASpell library to be installed) - Attribute dictionary source - Add a relevant
assets/dictionaries/dictionary.*.LICENSE
file - Make Gradle aware of the language - Add
"de_DE",
to thelanguages
array inbuild.gradle
- Generate a trie representation of the dictionary -
./gradlew buildDictionary_de
- Subclass
Language
- Add it to thelibraries/trie
library - Tell Lexica about your
Language
class - Add an entry toLanguage#getAllLanguages()
- Generate random letter distributions -
./gradlew analyseLanguage_de
- Add language name (in English) - Edit
app/src/main/res/values/strings.xml
, addingpref_dict_LANG
and optionallypref_dict_LANG_description
- Add scrabble scores - Edit your
Language
subclass, adding letter scores from Wikipedia - Scrabble letter distributions - Run tests -
./gradlew check && ./gradlew connectedCheck
This project uses the GNU Aspell project to obtain dictionaries.
The script add_lang.sh
in the root directory of this project will:
- Dump all words from a specific dictionary (e.g. en_UK or de_DE).
- Omit words shorter than 3 characters and longer than 9.
Although it is of course technically possible for words longer than 9 to be recorded on Lexica boards, in practice it is so unlikely as to cause problems when generating new random board generators. The reason is that it is hard to measure how successful a board generator is if the vast majority of words in a language are very long (e.g. in German).
Other dictionaries have also been included, such as the Japanese dictionary. This comes to us from the JMdict project and used under the CC-BY-SA-3.0 license. This was made possible from the work of @wichmann here.
Once a dictionary is available, the next trick is to create a set of probability distributions to be used by random board generators, so that the generated boards tend to have nice properties (i.e. lots of words).
The format used by these probability distributions is inherited from the original Lexic project:
a 12 3 2 1 1
b 5 3 1
c 3 1
d 8 4 1
e 24 12 3 1 1
...
Boards are generated by consulting this probability distribution, and performing a weighted random choice of letters. This is done by looking at the first column of numbers, and performing roulette wheel selection based on this value. The higher the value, the more likely a letter will be chosen.
Once a letter is chosen (e.g. d
in the example board above), then the first number in that row is removed
(leaving d 4 1
in this example. You will note that each successive number is lower, meaning the probability
of a subsequent letter being chosen again is less than the original choice of that letter.
Once a letters has had all of its numbers exhausted, that letter will not appear on the board any more.
Thus, in the example above, c 3 1
means that the letter c
can be chosen at most twice per board.
The original source code of Lexic included hard coded versions of these letter frequencies, without a way to generate them. This version of Lexica includes a script to count the number of times each letter occurs.
To run the algorithm: ./gradlew analyseLanguage_LANG
.