Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data validation for data transfers #166

Closed
camilolaiton opened this issue Feb 15, 2023 · 6 comments
Closed

Data validation for data transfers #166

camilolaiton opened this issue Feb 15, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@camilolaiton
Copy link
Contributor

We noticed that one of the SmartSPIM datasets was not correctly uploaded to AWS from the VAST system when we launch multiple uploads at the same time and without any errors. Therefore, we need to find a way to verify that the number of tiles/images of each subfolder corresponds to what the collect function is returning.

It is important to mention that there are additional files inside of the datasets. Then, the number of collected files will be a sum of:
n_collected_files = # of tiles of an dataset + # of images in derivatives + # metadata files

Here are some screenshoots of the current tracebacks and the # of collected files for a dataset:

image

image

@miketaormina
Copy link

Is there a benefit to using os.walk over pathlib.Path.rglob in the above linked collect_filepaths? Something like
filepaths = [i for i in folder.rglob('*') if i.suffix in include_exts]
I'm pretty sure the exclude is easy too if you make a test with Path.relative_to().

@carshadi
Copy link
Member

carshadi commented Feb 16, 2023

The traceback occurs with every job, and appears to be a (harmless?) bug related to shutting down the cluster in dask_mpi dask/dask-mpi#94. If there is an issue with the file collection, that traceback is probably unrelated. FWIW, I have not seen the issues with missing data that @camilolaiton has seen. Camilo you had mentioned that you might have been using an old version of this repository? Does it make sense to see if that is the issue before we start changing things? Are we sure this is a bug? (although I agree that validation is needed)

@camilolaiton
Copy link
Contributor Author

camilolaiton commented Feb 16, 2023

Yeah, I'm testing that right now with 8 datasets. It is important to mention that this issue has only happened this week (I was uploading 12 datasets, 1 job per dataset), not before with the uploading of more than 10 datasets with dask mpi (might have been an isolated case?). I updated the repo and some other files were updated but not the s3_upload.py script. However, I do believe we need to perform the validation just to make sure we're correctly uploading all the tiles (and other files) of a dataset, it is good that this is happenning right now so we could address it.

@miketaormina
Copy link

Not sure where you ended up on how necessary this is, but this is how I was checking that an acquisition completed. It just checks that each tile folder contains the same number of images and returns a list of those that differ from the first one. I think it should work on the VAST formatted ones, as long as you point it to the SmartSPIM folder that contains the channel folders (in that case the "MIP" check is unnecessary):

from pathlib import Path

def verify_n_files(exp_dir):
    offending_dirs = []

    all_tiles = [i for i in exp_dir.glob('*/*/*') if (i.is_dir() and 'MIP' not in i.parent.parent.name)]
    n_ims = len([i for i in all_tiles[0].glob('*png*')])
    for t in all_tiles:
        n = len([i for i in t.glob('*png*')])
        if n != n_ims:
            offending_dirs.append(t)
    
    all_clear = len(offending_dirs)==0
    return (all_clear, offending_dirs)

@camilolaiton
Copy link
Contributor Author

camilolaiton commented Mar 17, 2023

In #176 I added a script that validates the SmartSPIM datasets before uploading data. The problem of the missing tiles (for the smartspim at least) had two branches.

The first one was that the acquisition software didn't really output the number of tiles expected in the stacks for some brains, something that didn't happen before and I believe was introduced in the latest updates. On the other hand, we were passing through an update in our VAST system and the HPC didn't have the old VAST system path in some nodes using the "path hack" that IT established to access the old mount. This was already solved by migrating our services to the new VAST system.

Lastly, I believe we need to add validation scripts to all our datasets before uploading @carshadi @dyf @sharmishtaa.

PD: We also added the option to stop the jobs whenever we don't find any files to upload which was a bug we had in an earlier version.

@camilolaiton
Copy link
Contributor Author

Closing due to inactivity. Feel free to reopen if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants