Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

❓ Per-field tokenization (question for the community) ❓ #2006

Open
sanikolaev opened this issue Mar 26, 2024 · 11 comments
Open

❓ Per-field tokenization (question for the community) ❓ #2006

sanikolaev opened this issue Mar 26, 2024 · 11 comments

Comments

@sanikolaev
Copy link
Collaborator

Since the beginning, Sphinx and Manticore have not offered per-field tokenization settings (except for morphology_skip_fields and infix/prefix_fields), and it seems that there hasn't been much concern about this. On the other hand, if Manticore were to introduce this functionality, it would simplify certain use cases that require different tokenization, such as:

  • Storing titles/descriptions along with SKU numbers (e.g., ABC-12345-S-BL).
  • Managing titles/descriptions and email/IP addresses in the same table.

It would be interesting to know if the community considers it important to implement per-field tokenization settings in Manticore, similar to how it works in Elasticsearch and SOLR, allowing for the specification of tokenization settings for each field.

Furthermore, I'm curious how those who have been using Manticore for years have addressed this issue. I'm going to ask personally some Manticore users to provide feedback.

@sanikolaev sanikolaev pinned this issue Mar 26, 2024
@nickchomey
Copy link

Could you please elaborate on what sorts of tokenization settings might be available on a per-field basis and some of the use cases/advantages for it?

@sanikolaev
Copy link
Collaborator Author

I think all the available tokenization settings would become per-field in this case:

  • charset_table
  • morphology
  • blend_chars
  • ignore_chars
  • stopwords
  • exceptions
  • wordforms

etc.

some of the use cases/advantages for it?

Inviting @superkelvint as I know he knows a lot about it.

@unterninja
Copy link

Just to make sure: the mentioned performance reduction would only apply to tables where this feature is used, not on all tables regardless of tokenization model?

@sanikolaev
Copy link
Collaborator Author

the mentioned performance reduction would only apply to tables where this feature is used

The performance reduction mentioned would likely apply only to tables that utilize this feature. We would do our best to maintain the current level of performance in other aspects.

@superkelvint
Copy link

superkelvint commented Mar 28, 2024

Common fields which require non-fulltext treatment include:

Numeric Codes and Identifiers

  • ISBNs: Unique identifiers for books that should be searchable in their entirety.
  • SSNs (Social Security Numbers): For applications that require identity verification, SSNs need exact match searching without tokenization.
  • Vehicle Identification Numbers (VINs): Each VIN is unique to a specific vehicle and must be searched precisely.

IDs and Part numbers

  • Model Numbers: "Model XR-2000" should remain unaltered for exact model searches.
  • SKUs: e.g. "ELEC-12345-BLU", "SHOE-98765-M-8"
  • ASIN (Amazon Standard Identification Numbers): Unique blocks of letters and/or numbers for identifying items on Amazon. e.g. B0825K99RP
  • Parts Numbers: "6E5-45371-01"
  • Electronic Component Identifiers: Unique codes used for electronic components in manufacturing and assembly, like resistors, capacitors, and integrated circuits, e.g. "ATMEGA328P-PU"

Internet

  • IP addresses
  • URLs
  • email addresses
  • Twitter hashtags and @ mentions: "#ThrowbackThursday" needs to be indexed as a single token for hashtag-based searches "@username" should be searchable as a distinct token to find mentions of specific users.
  • File system paths: c:\Users\MyDocuments or /home/user/documents

Legal

  • Legal Terms: "Ex post facto" should not be stemmed to preserve its specific legal context.
  • Case Names: "Roe v. Wade, 410 U.S. 113" must be tokenized as a whole entity for precise legal reference searching.

@superkelvint
Copy link

Perhaps also important to mention that for users planning to migrate from Lucene/Solr/Elasticsearch (like myself), not being able to specify analyzers per-field makes migrating extremely difficult because we are used to having this flexibility in Lucene-based systems and have therefore used this feature extensively.

Granted, Manticore does provide some support for this in the form of numeric, boolean, date field types. But that is very basic compared to Lucene, and applications would very likely have to lose functionality when migrating to Manticore which is a difficult pill to swallow.

@ChrisHSandN
Copy link

I came here to open a feature request for this specific feature (but spotted this post).

  • Our use case for manticore means we want only a subset of our fields expanded with infixes.
  • We have always used dict=crc (since the early days of Sphinx) but reading the Manticore docs recently made dict=keyword sound appealing (extra wildcard characters, smaller indexes etc.)
  • It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option.

@sanikolaev
Copy link
Collaborator Author

@ChrisHSandN

It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc, but to make queries to some fields not run in infix mode (with probably expand_keywords=1)? If so, it shouldn't be a big deal (at least seems so to me, I'd need to check with the devs) to add support for it for the dict=keywords mode.

@ChrisHSandN
Copy link

@sanikolaev

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc

We have a large amount of data indexed and only have the resources (and requirement) to infix certain (short) selected fields.

I tested swapping one of the indexes from dict=crc to dict=keyword and total .sp* file space increased 40% from 3.2GB to 4.5GB (.spa + .spi went from 0.26GB to 0.46GB; as we are memory limited these are the main limitation).

I was presuming this was due to dict=keywords infixing all the fields?

@sanikolaev
Copy link
Collaborator Author

@ChrisHSandN

we want only a subset of our fields expanded with infixes.
We have always used dict=crc

Please make sure it actually worked for you. Here's an example showing infix_fields doesn't take effect with dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='crc' infix_fields='f'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('@f abc*')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

Same with dict=keywords and min_infix_len works fine:

mysql> drop table if exists t; create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'
--------------

Query OK, 0 rows affected, 1 warning (0.01 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t where match('@f abc*')
--------------

+------+--------+------+
| id   | f      | f2   |
+------+--------+------+
|    1 | abcdef |      |
+------+--------+------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 1ms ---

The point is that you can't enable min_infix_len for dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' min_infix_len='2';
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text, f2 text) dict='crc' min_infix_len='2'
--------------

ERROR 1064 (42000): error adding table 't': RT tables support prefixes and infixes with only dict=keywords

So could it be that you thought that infix_fields worked for you, but it actually didn't, an infix search wasn't effective at all and you didn't notice it?

@chongshengdz
Copy link

chongshengdz commented Dec 28, 2024

for electronic component search, use min_infix_len = 2,expand_keywords=1,dict = keywords add the special characters to the charset_table, you may need to use urlencode to encode special characters to the url.
don't use dict=crc, index size will be huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants