Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicodecsv is kind of slow; but maybe unavoidable? #46

Open
NelsonMinar opened this issue Feb 26, 2015 · 5 comments
Open

unicodecsv is kind of slow; but maybe unavoidable? #46

NelsonMinar opened this issue Feb 26, 2015 · 5 comments

Comments

@NelsonMinar
Copy link

Thank you so much for unicodecsv, it's been a big help for me in Python2. Not to sound ungrateful, but...

unicodecsv seems fairly slow. Some benchmarking suggests it's about 5-6x slower than the plain Py2 csv module. Of course it's doing more work, decoding bytes to strings! But for comparison the Py3 csv module (which does decoding) is only 2-3x slower than Py2. Is there room for improvement in unicodecsv?

I did some profiling and code reading and didn't see any obvious way unicodecsv could be made faster. So maybe there's no real way to optimize it. But wanted to file the issue both to document what I learned and get a second opinion.

My benchmark code and results are at https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/

@kengruven
Copy link

I don't see the file you were using, but for the 1M line CSV file I was playing with today, I found that the isinstance() calls in UnicodeReader#next were taking around 50% of the runtime. And unless the dialect requests QUOTE_NONNUMERIC, that's never going to hit.

I've submitted a pull request, /pull/47, which avoids the isinstance() call here in this case. It's still about 3x slower than the built-in (ASCII) 'csv' module, but it's significantly faster than before.

@NelsonMinar
Copy link
Author

I noticed a fair amount of time with isinstance too but assumed it was unavoidable. Sounds like your code is a good improvement if it works!

I spent some time looking at the speed of Python Unicode decoding and am more confused than ever as to exactly what's going on with the larger speed issue. https://nelsonslog.wordpress.com/2015/02/26/python-file-reading-benchmarks/

@jdunck
Copy link
Owner

jdunck commented Mar 3, 2015

@NelsonMinar thanks for the detailed benchmarking. I'll leave this open as a reminder to do other optimization work, but I've merged #47.

@jdunck
Copy link
Owner

jdunck commented Mar 11, 2015

I've just released 0.11.0, which includes changes in #47.

@jdunck jdunck closed this as completed Mar 11, 2015
@jdunck jdunck reopened this Mar 11, 2015
@NelsonMinar
Copy link
Author

Nice, thanks for the update! I just tested it and it makes my benchmark run in 70-80% of the time it used to. Very nice improvement for a simple change. Detailed timings: https://nelsonslog.wordpress.com/2015/03/11/unicodecsv-0-11-0-speed-improvement/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants