Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improves search to handle smaller search terms. #4735

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

medha-14
Copy link
Contributor

@medha-14 medha-14 commented Jan 3, 2025

Description

Fixes #4734

Type of change

Please add a line in the relevant section of CHANGELOG.md to document the change (include PR #) - note reverse order of PR #s. If necessary, also add to the list of breaking changes.

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist:

  • No style issues: $ pre-commit run (or $ nox -s pre-commit) (see CONTRIBUTING.md for how to set this up to run automatically when committing locally, in just two lines of code)
  • All tests pass: $ python -m pytest (or $ nox -s tests)
  • The documentation builds: $ python -m pytest --doctest-plus src (or $ nox -s doctests)

You can run integration tests, unit tests, and doctests together at once, using $ nox -s quick.

Further checks:

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@medha-14 medha-14 requested a review from a team as a code owner January 3, 2025 11:10
Copy link

codecov bot commented Jan 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.24%. Comparing base (a7253b8) to head (977f962).
Report is 24 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4735      +/-   ##
===========================================
+ Coverage    99.22%   99.24%   +0.01%     
===========================================
  Files          303      303              
  Lines        23070    23262     +192     
===========================================
+ Hits         22891    23086     +195     
+ Misses         179      176       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@medha-14
Copy link
Contributor Author

medha-14 commented Jan 6, 2025

Could I get a review on this one?

@kratman
Copy link
Contributor

kratman commented Jan 6, 2025

@medha-14 Sorry, there is a bit of a backlog due to the upcoming release and everyone coming back from vacation. Don't worry, we will review this shortly

@medha-14
Copy link
Contributor Author

medha-14 commented Jan 6, 2025

Thank you for the update! I just thought you missed this one.

Copy link
Member

@agriyakhetarpal agriyakhetarpal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think min_similarity could be slightly higher here because the threshold here is too low and can lead to false positives in the search. But, at the same time, being able to resolve potential typos in the search query would require a lower threshold. This makes me feel that we could expose it with a sensible default, but I don't yet know what a sensible default would be. It should be higher than the current 40%, though. In-line comment about this below:

@@ -163,14 +185,24 @@ def search(self, keys: str | list[str], print_values: bool = False):
search_keys = [k.strip().lower() for k in keys if k.strip()]

known_keys = list(self.keys())
known_keys.sort()

min_similarity = 0.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_similarity is also defined above, so it gets defined twice. How about making it an argument for _find_matches()?

Follow-up question: do you think it would make sense to expose it publicly for users through search() as well?

@medha-14
Copy link
Contributor Author

medha-14 commented Jan 12, 2025

Thanks! I think min_similarity could be slightly higher here because the threshold here is too low and can lead to false positives in the search. But, at the same time, being able to resolve potential typos in the search query would require a lower threshold. This makes me feel that we could expose it with a sensible default, but I don't yet know what a sensible default would be. It should be higher than the current 40%, though. In-line comment about this below:

Sorry for the delayed response. I think it's important to clarify that min_similarity is only relevant for substring matches when the search_key is found within the known keys. For cases where there are typos or no substring matches, difflib.get_close_matches() handles those independently, using its own cutoff threshold. Increasing the min_similarity too high would make it difficult to get even the relevant matches.
Eg: If we even set the threshold to 0.5 and a user searches for conc it will not be matched with concentration because it will have similarity ratio of 0.47(approximately) which will not qualify the threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve search for improved handling of short or incomplete search terms
4 participants