allow stop words #315

setop · 2023-04-07T06:54:29Z

setop
Apr 7, 2023

Will it be feasible to introduce "stop words", that is, words that should not be considered when comparing strings.
Common stop words like "and" "this", and to on. But also custom stop words, like, in my domain "application" is not relevant when comparing strings.

maxbachmann · 2023-04-12T18:25:56Z

maxbachmann
Apr 12, 2023
Maintainer

There is no integrated method for this. You should be able to filter out these words as preprocessing step.

0 replies

setop · 2023-04-12T18:59:33Z

setop
Apr 12, 2023
Author

I should be more precise, I'm thinking of introduce stop words when using a tokenizer, like in methods token_sort_ratio, token_set_ratio.
If done in preprocessing step, tokenization will be done twice. Plus the first step will be in python whereas the second one is in cpp, which is much faster.

3 replies

maxbachmann Apr 12, 2023
Maintainer

How do you intend to call rapidfuzz given an imaginary scorer function token_sort_with_stop_word_ratio?

setop Apr 12, 2023
Author

or could be a kwarg of an existing function token_set_ratio(..., stopwords=List[str])

maxbachmann Apr 12, 2023
Maintainer

Since your concerned about the perfomance I assume you will apply it to a large dataset. It would be helpful to know:

how your input is structured + how large are the datasets your working with
what output you expect
how often this is done
how you currently plan to use the library to do this

This would help me get a better understanding of what would be a good solution for your problem.

The scorer function is just an imaginary name for a scorer which does what you would like it to do, so you can use it in a code example :)

setop · 2023-04-14T08:43:16Z

setop
Apr 14, 2023
Author

Sure, let me explain my use case.

For an information system, I'm trying to link applications and computing resources.
Applications have names in the applications referential. There are about five hundred of them.
Computing resources have description in the CMDB which is supposed to mention the application name but not strictly. There are about five thousand of them.

Currently, I do:

for each resource:
  for each application:
    score = token_set_ratio(<resource description>, <application name>)
  take best score
  if best score > threshold:
    link(resource, application with best score)

This is working. The process is fast (few seconds, less than a minute) but the result is so and so. When investigating, I saw many false positives due to a match on irrelevant words, like "application" or "server". I then pre-processed the input using string replace. But it is quite tedious and would require tokenization to do it properly.

Since functions like token_set_ratio might already tokenize the input, moreover in the fast part or the process, I was thinking it could be a good place to add stop words filtering feature.

1 reply

maxbachmann Apr 14, 2023
Maintainer

You are right that when placing the workload inside scorer you could in theory save tokenizing + rejoining. However you will perform the preprocessing step of removing the unwanted tokens len(resource) x len(application) times. On the other hand if making this a preprocessing function you only need to do the full preprocessing len(resource) + len(application) times. While this is more work per call, it is likely less work overall. I would implement your function in the following way:

proc_application = [remove_stop_words(x) for x in application]
for resource in resources:
    proc_resource = remove_stop_words(resource)
    match = process.extractOne(proc_resource, proc_application, scorer=fuzz.token_set_ratio, score_cutoff=threshold, processor=None)
    if match:
        link(resource, application[match[2]])

A couple of notes on this implementations:

as noted above this only removes the stop words once for each element which should be faster
this calls process.extractOne with the scorer instead of directly calling the scorer. This has the advantage, that process.extractOne is able to save Python calls and is able to reduce duplicated work. E.g. for fuzz.token_set_ratio it will only tokenize the query once.
it passes the score_cutoff into process.extractOne which might allow it to exit early, choose a more optimal implementation
it passes processor=None since the default will change to this for all functions in rapidfuzz v3.0. In case you actually want to run utils.default_process on your elements you should run this ahead of time similar to remove_stop_words, so the work is not duplicated.

I would try to implement it this way and as a first step just try to use a pure Python function for remove_stop_words like e.g.

def remove_stop_words(sentence: str, stop_words: set[set]) -> str:
    # using a set here is only fine, since token_set_ration is going to remove duplicated words + sort the words anyways
    # so this implementation is not compatible with all scorers.
    words = set(sentence.split())
    words = words.difference(stop_words)
    return " ".join(words)

If this turns out to be to slow you can still think about a faster way to achieve this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow stop words #315

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

allow stop words #315

setop Apr 7, 2023

Replies: 3 comments · 4 replies

maxbachmann Apr 12, 2023 Maintainer

setop Apr 12, 2023 Author

maxbachmann Apr 12, 2023 Maintainer

setop Apr 12, 2023 Author

maxbachmann Apr 12, 2023 Maintainer

setop Apr 14, 2023 Author

maxbachmann Apr 14, 2023 Maintainer

setop
Apr 7, 2023

Replies: 3 comments 4 replies

maxbachmann
Apr 12, 2023
Maintainer

setop
Apr 12, 2023
Author

maxbachmann Apr 12, 2023
Maintainer

setop Apr 12, 2023
Author

maxbachmann Apr 12, 2023
Maintainer

setop
Apr 14, 2023
Author

maxbachmann Apr 14, 2023
Maintainer