Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace deprecated 'punkt' with 'punkt_tab' #83

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

emielsteerneman
Copy link

Problem

The NLTK package punkt has been deprecated, resulting in an error when calling a BM25Tokenizer

image

Solution

Replace punkt with the new punkt_tab

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

This might be a breaking change. NLTK 3.8.1 and lower use punkt whereas NLTK 3.8.2 and above will use punkt_tab. The pyproject.toml file references nltk = "^3.6.5", meaning it will install NLTK 3.8.2 if possible, thus breaking. Introducing this breaking change on a patch version is something that the NLTK maintainers not should have done, but alas.

Another fix would be to freeze the NLTK version.

Test Plan

I tried it locally and it fixed my issue.

Sorry, something went wrong.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@emielsteerneman
Copy link
Author

#81

@emielsteerneman
Copy link
Author

The maintainers suggest upgrading NLTK to at least version 3.9.1

@isaaccs
Copy link

isaaccs commented Sep 3, 2024

if someone doesn't upgrade their version of nltk, their pinecone will break.
I think you should either keep the punk package and look for either punk or punk_tab, or force the nltk version to be at least 3.9 to prevent the code from breaking.

@emielsteerneman
Copy link
Author

@acatav

@isaaccs
Copy link

isaaccs commented Sep 17, 2024

@izellevy

@guidev
Copy link

guidev commented Sep 24, 2024

Any update on this? A library like Pinecone should avoid introducing security risks, especially considering the issue mentioned here: nltk/nltk#3266 (comment). It would be best to fully migrate to punkt_tab and enforce a minimum NLTK version to prevent breakage and vulnerabilities.

@RubenCata
Copy link

@igiloh-pinecone

@vinod-canihelp
Copy link

Just ran into this error on production. Is there a reason why the fix is not in yet? Any workarounds while this is getting delayed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants