Introduce new break types and phrase splitting for Japanese addresses #3629
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is based on and supersedes #3158.
This PR enables two new types of breaks between terms for query parsing: PART(-) breaks indicate that two terms are connected by some kind of punctuation, thus potentially forming a word together. SOFT PHRASE(:) can be introduced by preprocessors and indicate that a phrase break at this point is highly likely.
The Japanese phrase splitter makes use of soft phrase break to indicate breaks between the address parts. As Japanese addresses tend to be written entirely without spaces, these soft phrase breaks are a useful indicator for Nominatim on how to break up the query.
Testing against a subset of Japanese addresses from the Overture POI corpus, Nominatim can recognise about 3 times as many addresses with the new phrase splitter. The other main obstacle are now missing breaks between numbers and letters.
Many thanks to @miku0 for all her hard work on this.