-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scanning query strings with offsets #12
base: master
Are you sure you want to change the base?
Conversation
…t possible to search for prefixes at a given offset in the query, therefore avoiding to have to create a string slice for every single character when scanning strings
Hi Florian, I can't unfortunately merge this pull request as-is, because of the following reasons:
|
Hi Mikhail, Thanks for you comments and taking the time! To reply to them: Sorry, I was not aware of Py2x issues - I have stopped using Py2x. Regarding the use case and the second two points, the whole change is all about getting the max speed/efficiency. If you want to scan a long string (say, a book or an article) for the contents of the FSA, you have to create a slice of the string ("string[offset:]") at every other char you want to scan. So to avoid this, I wanted to create a method where you can input always the same string, but scan for a match at a given offset. Also, I actually only need to scan for matches at token boundaries, not every possible character, so I use your DAFSA to only scan for matches at very specific offsets, like so: def YieldLongestMatches(dafsa, input_string): However, as you also noticed, my current implementation leads to the repeated encoding of the same string over and over again to UTF-8 - indeed very inefficient, still. Given how UTF-8 works (only the first byte has the first bit=0, the second byte in a MBS has 10xxxxxx, the next 110xxxx asf.), the principal problem you outline does not occur, however: Only the whole string gets encoded/decoded, and only a correctly encodable prefix should be able to match, so the possible problem you mention should never occur. Therefore, overall, I need to make a few more changes to my request, namely:
def YieldLongestMatches(dafsa, input_string):
Is that OK with you, too? I hope you understand my issue, and the enormous speedup this seemingly minor change brings when scanning large text collections; But if you do not like it, no hard feelings :). Let me know if you are still interested in me contributing this change or not, and if so, if this change (using "bytes" as the method signature) would work in Py2x! Cheers, On Jun 22, 2013, at 13:49 , Mikhail Korobov wrote:
|
Hi Florian,
Thanks for the explanation, your use case is clearer now. Just an idea: I wonder if it is possible to bring memoryviews to API instead of "bytes + offset". They looks like a more standard/pythonic way to take slices without memory copies. You'll then use Also, what about creating longest_prefix method? It could me much faster than prefixes because it probably won't have to construct all the keys. I implemented this method for datrie (see https://github.com/kmike/datrie/blob/master/src/datrie.pyx#L291 ), but was too lazy (= waiting for someone else) to implement it for DAWG.
I see. I think that if we go this way (not memoryviews), it worths mentioning in the source code (as a comment).
I wonder if it is possible to have a single 'prefixes' (and/or "longest_prefix') method that supports both unicode and memoryviews and does a dispatch based on result type (I think it should return unicode regardless of argument type). I think that a simple C-level type check shouldn't cause slowdowns (it didn't for
Python 2.6+ has "bytes" type which is an alias for "str", so I think this will work without issues.
Thanks for working on this pull request and not just making changes that will work for your particular problem! I'm very interested in library improvements, and I think your use case is valid. I'm not sure about "bytes + offset" API because it looks like a workaround for Python copy behavior, and if we take it further all C/C++ extensions will eventually need to add "start" and "stop" arguments for all methods accepting bytes just to avoid Python-level slices, and this just looks inelegant and wrong. But if we won't be able to make memoryviews work or if they'll turn out to be inefficient, like 2x slower for your problem, I'm fine with "bytes + offset". |
Hi Mikhail, Thanks for your replies! Firs, to warn you on time, my wife is about to Regarding the issue of memory-views, I must say I do miss a Java-like Your suggestion of adding a longest prefix method is probably significantly I am less convinced we always should return unicode/str (2.x/3.x), So I propose to add the following two methods to conform with the way rawPrefixes(bytes[, int[, int]]) -> [ bytes ] Then, we might consider updating the current prefixes method accordingly prefixes(str[, int[, int]]) -> [ str ] I could call the longest prefix-only methods "rawLongestPrefix" and Cheers, On 23 June 2013 01:36, Mikhail Korobov [email protected] wrote:
|
Sorry, forgot about your wish for having one a single, unified API; So prefixes(seq:object[, start:int[, end:int]]) -> object: I'd move the current "prefixes" to "_prefixes" and add the new And then all that once more for a new "prefix" (or "longestPrefix", if you On 24 June 2013 19:57, Florian Leitner [email protected] wrote:
|
Hi Florian, Congratulations! Say hello to the new Leitner from me :)
I was talking about http://docs.python.org/dev/library/stdtypes.html#memory-views. Cython has support for memoriviews built-in (see http://docs.cython.org/src/userguide/memoryviews.html ), including a simple syntax. The idea is to wrap the whole input_string (encoded to bytes) into a memoryview and do slicing in Python (in your code), and to add support for memoryviews as keys (and maybe binary keys as well) to the library so it won't need 'offset' parameter. The slices have 'tobytes' method, so if these slices are needed in Python land, it is possible to use them as bytes. I don't think you need a new type unless you want to slice "str" directly (instead of "bytes"). Copyless "str" slicing can be more complex than "bytes" slicing because there is nothing in stdlib to help, and it is not possible to compute byte offset for a given char offset for variable-character-sized encodings without iterating over the string. Good point about startswith; now I'm not opposed to start/end arguments. But for str they should work on char offsets, and this could be harder to implement. For byte keys they could also be useful (and easier to implement), but I don't know what is better, start/end or memoryviews. I actually like start/end for simplicity (and they'll work in Python 2.6 unlike memoryviews).
Cool!
I think of this in a slightly different way: most method should always return results of the same type regardless of arguments; returning results of different types will lead to problems and hard-to-debug errors (the exception is generic functions like "max"). So in this logic if we add support for "bytes" as keys we should still return unicode, or we should create an another function that will always return bytes. Also, the encoding is currently done by the library, and it assumes utf8 in various places, so there is no need to force user to always decode results from utf8 (as e.g. you do in the example). When could user possibly need values of prefixes/keys/etc. as bytes, given that the result is text data encoded to utf8? Another point is that decoding from utf8 is faster from Cython because Cython optimizes it by using C API utf8-decoding method instead of "bytes" object's method call. That said, now I like your suggestion of raw_prefixes which will accept bytes (and maybe memoryviews) as input and provide start/end arguments. But should it return bytes? When could be this feature useful? It looks like unified "prefixes" method with "start/end" arguments is a bad idea, because start/end should mean different things for bytes and unicode (byte offset vs char offset). If we go with start/end (not memoryviews) I'd prefer adding raw_prefixes method and leaving prefixes as-is. "Start/end" API is not an issue for memoryviews, so with memoryviews we can have a single "prefixes" method that supports unicode, bytes and memoryviews as keys.
I'd prefer longer names, "longest_prefix" and "raw_longest_prefix" :) Based on method name, for "prefix" method I'd expect it to raise an exception if several prefixes are found; "longest_prefix" purpose is clear. The API of DAWG was modelled after PyTrie (https://bitbucket.org/gsakkis/pytrie/src/1bbd6dec97df4a8f3ae0f0cf6b91586c1556e932/pytrie.py?at=default ), and I think it is better to continue following it because some other packages are also trying to follow it (all my other similar packages - datrie, marisa-trie, hat-trie). |
Added a
offset
kw-paramter todawg.prefixes()
(and.iterprefixes()
) to make it possible to iteratively scan a string for prefixes without having to take slices of each prefix of the query. I.e., before::New behaviour::