Introduce new break types and phrase splitting for Japanese addresses #3629

lonvia · 2025-01-09T09:10:08Z

This PR is based on and supersedes #3158.

This PR enables two new types of breaks between terms for query parsing: PART(-) breaks indicate that two terms are connected by some kind of punctuation, thus potentially forming a word together. SOFT PHRASE(:) can be introduced by preprocessors and indicate that a phrase break at this point is highly likely.

The Japanese phrase splitter makes use of soft phrase break to indicate breaks between the address parts. As Japanese addresses tend to be written entirely without spaces, these soft phrase breaks are a useful indicator for Nominatim on how to break up the query.

Testing against a subset of Japanese addresses from the Overture POI corpus, Nominatim can recognise about 3 times as many addresses with the new phrase splitter. The other main obstacle are now missing breaks between numbers and letters.

Many thanks to @miku0 for all her hard work on this.

Also enables parsing of PART breaks.

All punctuation will be converted to '-'. Soft breaks : may be added by preprocessors. The break signs are only used during query analysis and are ignored during import token analysis.

@miku

Code adapted from GSOC code by @miku.

lonvia added 4 commits January 6, 2025 17:10

add SOFT_PHRASE break and enable parsing

499110f

Also enables parsing of PART breaks.

add inner word break penalty

d984100

keep break indicators [:-] during normalisation

86ad9ef

All punctuation will be converted to '-'. Soft breaks : may be added by preprocessors. The break signs are only used during query analysis and are ignored during import token analysis.

add japanese phrase preprocessing

efc09a5

Code adapted from GSOC code by @miku.

lonvia mentioned this pull request Jan 9, 2025

Added a module to split Japanese words #3158

Closed

lonvia merged commit f8337be into osm-search:master Jan 9, 2025
8 checks passed

lonvia deleted the additional-breaks branch January 9, 2025 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new break types and phrase splitting for Japanese addresses #3629

Introduce new break types and phrase splitting for Japanese addresses #3629

lonvia commented Jan 9, 2025

Introduce new break types and phrase splitting for Japanese addresses #3629

Introduce new break types and phrase splitting for Japanese addresses #3629

Conversation

lonvia commented Jan 9, 2025