Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Full Text Search phrases outside of individual segments. #179

Open
NotJoeMartinez opened this issue Sep 13, 2024 · 1 comment
Open

Comments

@NotJoeMartinez
Copy link
Owner

The method of returning keywords doesn't work well when the users searches for a phrase that exists across separate rows within the transcript. It also doesn't provide much context. For example if the query is foo bar pancake yeet and the rows are ['foo bar ', 'pancake', 'yeet'] the current method will return three rows and offer little context. If we create a dedicated table for the full transcript and combine it with the fts5 snippet function we can allow the user to control the number of words around their search keyword and still provide precise time stamps. We still need a way to find the time stamps of the returned snippet given a segment of arbitrary length.

NotJoeMartinez added a commit that referenced this issue Sep 13, 2024
Added credit in the changelog for help with #179
@NotJoeMartinez
Copy link
Owner Author

Proposed solution to finding time stamps from snippet referenced in #178

def find_phrase_indexes(phrase, arr):
    marks = []
    fullText = []
    for i, row in enumerate(arr):
        for word in row[2].strip().split():
            marks.append(i)
            fullText.append(word)

    ans = []
    curr = 0
    phraseArr = phrase.split()
    for i, search in enumerate(fullText):
        if search == phraseArr[curr]:
            curr += 1
            if curr == len(phraseArr):
                ans.append([marks[i-len(phraseArr)+1], marks[i]])
                curr = 0
    return ans

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant