Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a module to split Japanese words #3158

Closed
wants to merge 2 commits into from

Conversation

miku0
Copy link
Contributor

@miku0 miku0 commented Aug 19, 2023

In these codes, Japanese addresses are divided into three categories based on administrative divisions: cities, municipalities, and below.
Nominatim uses ICU (International Components for Unicode) transliteration for user-entered addresses to split them into meaningful words. Here is an example of debugging. There are many candidates.
image
Fig. 1 The example of debugging.

To help make this division more accurate, when there are large administrative divisions (prefecture and city) in the string, we pre-separate them in the algorithm and put "," markers between the split words.
This "," is set to BreakType.SOFT_PHRASE in the program and words with this node are penalized with a lower search priority.
The node relationship is as follows
(1)--da->(2)--ban->(3)--shi->(4)--da->(5)--ban->(6)
||     ^^ ||
|+------大阪市--------------+ +-------大阪--------+|
+-------------------大阪市大阪---------------------+

As a result of this change, "大阪市大阪" with SOFT_PHRASE is penalized more and given lower search priority than "大阪市", the name of a city (the fifth value from the left is the penalty value).
image
Fig. 2 Before the change.
image
Fig. 3 After the change.

@lonvia
Copy link
Member

lonvia commented Aug 20, 2023

The failing tests may not be related to you code. I have the same in an unrelated change. I'm investigating.

@lonvia
Copy link
Member

lonvia commented Aug 20, 2023

Can you please rebase your code on master? This should make the CI errors go away.

Copy link
Member

@lonvia lonvia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. Just two minor comments from my side.

for p in phrases)))
return normalized

def split_key_japanese_phrases(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably forgot to delete the old code here.

@@ -29,6 +29,7 @@ class BreakType(enum.Enum):
""" Break created as a result of tokenization.
This may happen in languages without spaces between words.
"""
SOFT_PHRASE = ':'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add documentation for this new type, just like it is done in the lines above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for your help and comments.
I added the documentation.

@miku0 miku0 force-pushed the soft_phrase-final branch from 8f43956 to dfbacf4 Compare August 21, 2023 01:06
@lonvia
Copy link
Member

lonvia commented Jan 9, 2025

Superseded by #3629.

@lonvia lonvia closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants