-
-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
❓ Per-field tokenization (question for the community) ❓ #2006
Comments
Could you please elaborate on what sorts of tokenization settings might be available on a per-field basis and some of the use cases/advantages for it? |
I think all the available tokenization settings would become per-field in this case:
etc.
Inviting @superkelvint as I know he knows a lot about it. |
Just to make sure: the mentioned performance reduction would only apply to tables where this feature is used, not on all tables regardless of tokenization model? |
The performance reduction mentioned would likely apply only to tables that utilize this feature. We would do our best to maintain the current level of performance in other aspects. |
Common fields which require non-fulltext treatment include: Numeric Codes and Identifiers
IDs and Part numbers
Internet
Legal
|
Perhaps also important to mention that for users planning to migrate from Lucene/Solr/Elasticsearch (like myself), not being able to specify analyzers per-field makes migrating extremely difficult because we are used to having this flexibility in Lucene-based systems and have therefore used this feature extensively. Granted, Manticore does provide some support for this in the form of numeric, boolean, date field types. But that is very basic compared to Lucene, and applications would very likely have to lose functionality when migrating to Manticore which is a difficult pill to swallow. |
I came here to open a feature request for this specific feature (but spotted this post).
|
Do you mean you used |
We have a large amount of data indexed and only have the resources (and requirement) to infix certain (short) selected fields. I tested swapping one of the indexes from I was presuming this was due to |
Please make sure it actually worked for you. Here's an example showing
Same with
The point is that you can't enable
So could it be that you thought that |
for electronic component search, use min_infix_len = 2,expand_keywords=1,dict = keywords add the special characters to the charset_table, you may need to use urlencode to encode special characters to the url. |
Since the beginning, Sphinx and Manticore have not offered per-field tokenization settings (except for
morphology_skip_fields
andinfix/prefix_fields
), and it seems that there hasn't been much concern about this. On the other hand, if Manticore were to introduce this functionality, it would simplify certain use cases that require different tokenization, such as:ABC-12345-S-BL
).It would be interesting to know if the community considers it important to implement per-field tokenization settings in Manticore, similar to how it works in Elasticsearch and SOLR, allowing for the specification of tokenization settings for each field.
Furthermore, I'm curious how those who have been using Manticore for years have addressed this issue. I'm going to ask personally some Manticore users to provide feedback.
The text was updated successfully, but these errors were encountered: