Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement/data cleaning class #70

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

marianelamin
Copy link
Collaborator

@marianelamin marianelamin commented Apr 22, 2021

Catching up.
Data cleaning class created 5 months ago to deal with tweets. This class offers several methods that can be applied directly on a str or a pd.Series. to remove punctuation, hashtags, links, mentions etc...
More details on issue #35

Cambios en este PR:

  • src/c4v/data/data_sampler.py
    Make use of the data cleaner utility when sampling the data.
    Use Black formatter
  • src/c4v/data/data_cleaner.py
    Create methods to "clean" texts in varios ways (remove links, hashtags, emojies, punctuation, extra white spaces, trimming, tagging or mentioning, removing Spanish accents).
  • tests/data/test_data_cleaner.py

This utility can grow depending on the necessities of the cleaning phase.
Feedback is encouraged!

@marianelamin marianelamin linked an issue Apr 22, 2021 that may be closed by this pull request
12 tasks
@marianelamin marianelamin requested a review from dieko95 April 23, 2021 02:15
@marianelamin marianelamin self-assigned this Apr 23, 2021
@marianelamin marianelamin added the enhancement New feature or request label Apr 23, 2021
@marianelamin marianelamin marked this pull request as ready for review April 23, 2021 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleaning data before BPE
1 participant