Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvements for sampling #25

Open
feribg opened this issue Mar 5, 2019 · 2 comments
Open

Speed improvements for sampling #25

feribg opened this issue Mar 5, 2019 · 2 comments

Comments

@feribg
Copy link

feribg commented Mar 5, 2019

The code in the book for estimating uniqueness and building the ind matrix is quite crude and assumes a relatively small number of signals (and or bars). The major speed bump comes most likely because of large memory usage and therefore swap. Switching to sparse matrices fixes the problem even for large numbers of signals and bars:

def getIndMatrixSparse(barIx, t1):
    from scipy.sparse import csr_matrix
    rows = barIx[(barIx>=t1.index[0])&(barIx<=t1.max())]
    cols = t1
    indM = csr_matrix((len(rows), len(cols)), dtype=np.float)
    with tqdm(total=len(cols)) as pbar:
        for i,(entry,exit) in enumerate(t1.iteritems()):
            offsets = rows.searchsorted([entry,exit])
            indM[offsets[0]:offsets[1], i] = 1.
            pbar.update(1)
    return indM


def getAvgUniquenessSparse(indM):
    # Average uniqueness from indicator matrix
    c=indM.sum(axis=1) # concurrency
    u=csr_matrix(indM.multiply(1/c)) # uniqueness
    filtered = u.multiply(u > 0)
    #sparse workaround for a mean across axis 1 ignoring the 0's - equiv to df.mean(skipna=True)
    (x,y,z)=scipy.sparse.find(filtered)
    countings=np.bincount(y)
    sums=np.bincount(y,weights=z)
    averages=sums/countings
    return averages

I an open a PR although I can't rerun the NB so im not sure where and how to add it. I couldn't find a way to div with a csr matrix and multiply by 1/c we get a coo_matrix, so it needs another conversion to csr to do the mean calc. If someone is better versed in scipy.sparse, im happy to improve on it. Right now it does about 800Kx10K avgU calc in about 30ms on my laptop.

@Ta-nu-ki
Copy link

Ta-nu-ki commented Mar 7, 2019 via email

@feribg
Copy link
Author

feribg commented Mar 7, 2019

What's the intuition behind it? Because from the docs it reads

Disadvantages of the LIL format
arithmetic operations LIL + LIL are slow (consider CSR or CSC)
slow column slicing (consider CSC)

and the vast majority of the code i shared is basically either a column slicing op or arithmetic op (in getAvgU).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants