Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce new break types and phrase splitting for Japanese addresses #3629

Merged
merged 4 commits into from
Jan 9, 2025

Conversation

lonvia
Copy link
Member

@lonvia lonvia commented Jan 9, 2025

This PR is based on and supersedes #3158.

This PR enables two new types of breaks between terms for query parsing: PART(-) breaks indicate that two terms are connected by some kind of punctuation, thus potentially forming a word together. SOFT PHRASE(:) can be introduced by preprocessors and indicate that a phrase break at this point is highly likely.

The Japanese phrase splitter makes use of soft phrase break to indicate breaks between the address parts. As Japanese addresses tend to be written entirely without spaces, these soft phrase breaks are a useful indicator for Nominatim on how to break up the query.

Testing against a subset of Japanese addresses from the Overture POI corpus, Nominatim can recognise about 3 times as many addresses with the new phrase splitter. The other main obstacle are now missing breaks between numbers and letters.

Many thanks to @miku0 for all her hard work on this.

lonvia added 4 commits January 6, 2025 17:10
Also enables parsing of PART breaks.
All punctuation will be converted to '-'. Soft breaks : may be
added by preprocessors. The break signs are only used during
query analysis and are ignored during import token analysis.
Code adapted from GSOC code by @miku.
@lonvia lonvia merged commit f8337be into osm-search:master Jan 9, 2025
8 checks passed
@lonvia lonvia deleted the additional-breaks branch January 9, 2025 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant