-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problem with loading canFam3ToHg38.over.chain #9
Comments
Hmm. Although the canFam3ToHg38 is a large file with 38M blocks, I don't see how those wouldn't fit in 32G (also, I'm trying to load it on a machine with even more memory and it seems to be stuck). I'll need to try a more decent intervaltree implementation (e.g. something with rebalancing), however I'm not sure when I'll get the time to look into it. |
So, apparently, the problem is not in the algorithm, but indeed in the memory consumption (the 64GB machine could handle the file). I tried to add a couple of changes which should considerably reduce the memory consumption. It still takes forever (several hours) to load, though. Try the newer version and see whether it helps. In principle, once you load the file (if it succeeds), you may try pickling the resulting LiftOver object - I suspect it would be faster to unpickle it than re-read and re-index everything. There's also a new flag for the |
Thanks for your answer. I have been thinking about the possibility of speeding up by using something like Cython or numba. Have you tried already, or do you have any thought on this? |
I tried using ncls (https://github.com/hunt-genes/ncls) and numpy. It seems to work. For canFam3ToHg38.over.chain, it takes about 2 minutes and 7.3 GB of memory to load the chain file. If you are interested, the modified file is at https://drive.google.com/open?id=1OA65_8EUrk9zZi-iQqMKYQASEAN27mA1. |
No, I haven't tried neither Numba nor Cython because the initial aim was to have a super-simple pure-Python tool (also, numba did not even exist when this was first written, I believe). And it does work fine for the "common" use-cases (i.e. hg18-to-hg37, etc), that canFam3 mapping is more of an exception in terms of the amount of re-mapping. Things like numba/cython/psyco/pypy are in general reasonable directions to look for speed-ups, however I suspect getting a considerable effect might not be straightforward. Also, I'd prefer to have a tool which does not require compilation or heavy dependencies. I tried running the current version on a Linux server with 32G memory, and the results are as follows:
Thus, if you need to just get things done, for now I'd suggest to either try a Linux machine (somewhy it runs faster than on Windows, also note that Python 3.7 seems about 1.5-2x the speed of 2.7) and, in case you need to load the file multiple times, pre-pickle it. |
(I somehow missed your last comment before posting mine, now saw an email notification though). Nice, I'll check the ncl option and will probably replace my data structure with that one if it is faster. |
I have a Windows system with 32GB ram and LiftOver("canFam3ToHg38.over.chain") took all of the memory and did not finish. cProfile.run of the same command with a portion of the chain file showed that most of the time was spent in add_interval function and another related function. Can there be a way to do this?
The text was updated successfully, but these errors were encountered: