Speed improvements for sampling #25

feribg · 2019-03-05T07:45:37Z

The code in the book for estimating uniqueness and building the ind matrix is quite crude and assumes a relatively small number of signals (and or bars). The major speed bump comes most likely because of large memory usage and therefore swap. Switching to sparse matrices fixes the problem even for large numbers of signals and bars:

def getIndMatrixSparse(barIx, t1):
    from scipy.sparse import csr_matrix
    rows = barIx[(barIx>=t1.index[0])&(barIx<=t1.max())]
    cols = t1
    indM = csr_matrix((len(rows), len(cols)), dtype=np.float)
    with tqdm(total=len(cols)) as pbar:
        for i,(entry,exit) in enumerate(t1.iteritems()):
            offsets = rows.searchsorted([entry,exit])
            indM[offsets[0]:offsets[1], i] = 1.
            pbar.update(1)
    return indM


def getAvgUniquenessSparse(indM):
    # Average uniqueness from indicator matrix
    c=indM.sum(axis=1) # concurrency
    u=csr_matrix(indM.multiply(1/c)) # uniqueness
    filtered = u.multiply(u > 0)
    #sparse workaround for a mean across axis 1 ignoring the 0's - equiv to df.mean(skipna=True)
    (x,y,z)=scipy.sparse.find(filtered)
    countings=np.bincount(y)
    sums=np.bincount(y,weights=z)
    averages=sums/countings
    return averages

I an open a PR although I can't rerun the NB so im not sure where and how to add it. I couldn't find a way to div with a csr matrix and multiply by 1/c we get a coo_matrix, so it needs another conversion to csr to do the mean calc. If someone is better versed in scipy.sparse, im happy to improve on it. Right now it does about 800Kx10K avgU calc in about 30ms on my laptop.

The text was updated successfully, but these errors were encountered:

Ta-nu-ki · 2019-03-07T18:31:14Z

Hey! Try to use scipy.sparse.lil_matrix instead of scipy.sparse.csr_matrix and I believe you'll get another speedup. вт, 5 мар. 2019 г. в 10:45, Feras <[email protected]>:

…

The code in the book for estimating uniqueness and building the ind matrix is quite crude and assumes a relatively small number of signals (and or bars). The major speed bump comes most likely because of large memory usage and therefore swap. Switching to sparse matrices fixes the problem even for large numbers of signals and bars: def getIndMatrixSparse(barIx, t1): from scipy.sparse import csr_matrix rows = barIx[(barIx>=t1.index[0])&(barIx<=t1.max())] cols = t1 indM = csr_matrix((len(rows), len(cols)), dtype=np.float) with tqdm(total=len(cols)) as pbar: for i,(t0,t1) in enumerate(t1.iteritems()): start_int = rows.searchsorted(t0) end_int = rows.searchsorted(t1) indM[start_int:end_int, i] = 1. pbar.update(1) return indM def getAvgUniquenessSparse(indM): from scipy.sparse import csr_matrix # Average uniqueness from indicator matrix c=indM.sum(axis=1) # concurrency u=csr_matrix(indM.multiply(1/c)) # uniqueness avgU=u[u>0].mean() # avg. uniqueness return avgU I an open a PR although I can't rerun the NB so im not sure where and how to add it. I couldn't find a way to div with a csr matrix and multiply by 1/c we get a coo_matrix, so it needs another conversion to csr to do the mean calc. If someone is better versed in scipy.sparse, im happy to improve on it. Right now it does about 800Kx10K avgU calc in about 30ms on my laptop. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#25>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJp_ooarSJCvnLKyLf09_4sO_WX1PLiGks5vTiCkgaJpZM4bd-R_> .

feribg · 2019-03-07T19:00:35Z

What's the intuition behind it? Because from the docs it reads

Disadvantages of the LIL format
arithmetic operations LIL + LIL are slow (consider CSR or CSC)
slow column slicing (consider CSC)

and the vast majority of the code i shared is basically either a column slicing op or arithmetic op (in getAvgU).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvements for sampling #25

Speed improvements for sampling #25

feribg commented Mar 5, 2019 •

edited

Loading

Ta-nu-ki commented Mar 7, 2019 via email

feribg commented Mar 7, 2019

Speed improvements for sampling #25

Speed improvements for sampling #25

Comments

feribg commented Mar 5, 2019 • edited Loading

Ta-nu-ki commented Mar 7, 2019 via email

feribg commented Mar 7, 2019

feribg commented Mar 5, 2019 •

edited

Loading