Data validation for data transfers #166

camilolaiton · 2023-02-15T22:54:58Z

We noticed that one of the SmartSPIM datasets was not correctly uploaded to AWS from the VAST system when we launch multiple uploads at the same time and without any errors. Therefore, we need to find a way to verify that the number of tiles/images of each subfolder corresponds to what the collect function is returning.

It is important to mention that there are additional files inside of the datasets. Then, the number of collected files will be a sum of:
n_collected_files = # of tiles of an dataset + # of images in derivatives + # metadata files

Here are some screenshoots of the current tracebacks and the # of collected files for a dataset:

The text was updated successfully, but these errors were encountered:

miketaormina · 2023-02-16T17:24:19Z

Is there a benefit to using os.walk over pathlib.Path.rglob in the above linked collect_filepaths? Something like
filepaths = [i for i in folder.rglob('*') if i.suffix in include_exts]
I'm pretty sure the exclude is easy too if you make a test with Path.relative_to().

carshadi · 2023-02-16T17:39:59Z

The traceback occurs with every job, and appears to be a (harmless?) bug related to shutting down the cluster in dask_mpi dask/dask-mpi#94. If there is an issue with the file collection, that traceback is probably unrelated. FWIW, I have not seen the issues with missing data that @camilolaiton has seen. Camilo you had mentioned that you might have been using an old version of this repository? Does it make sense to see if that is the issue before we start changing things? Are we sure this is a bug? (although I agree that validation is needed)

camilolaiton · 2023-02-16T21:24:36Z

Yeah, I'm testing that right now with 8 datasets. It is important to mention that this issue has only happened this week (I was uploading 12 datasets, 1 job per dataset), not before with the uploading of more than 10 datasets with dask mpi (might have been an isolated case?). I updated the repo and some other files were updated but not the s3_upload.py script. However, I do believe we need to perform the validation just to make sure we're correctly uploading all the tiles (and other files) of a dataset, it is good that this is happenning right now so we could address it.

miketaormina · 2023-02-17T22:45:01Z

Not sure where you ended up on how necessary this is, but this is how I was checking that an acquisition completed. It just checks that each tile folder contains the same number of images and returns a list of those that differ from the first one. I think it should work on the VAST formatted ones, as long as you point it to the SmartSPIM folder that contains the channel folders (in that case the "MIP" check is unnecessary):

from pathlib import Path

def verify_n_files(exp_dir):
    offending_dirs = []

    all_tiles = [i for i in exp_dir.glob('*/*/*') if (i.is_dir() and 'MIP' not in i.parent.parent.name)]
    n_ims = len([i for i in all_tiles[0].glob('*png*')])
    for t in all_tiles:
        n = len([i for i in t.glob('*png*')])
        if n != n_ims:
            offending_dirs.append(t)
    
    all_clear = len(offending_dirs)==0
    return (all_clear, offending_dirs)

camilolaiton · 2023-03-17T16:42:43Z

In #176 I added a script that validates the SmartSPIM datasets before uploading data. The problem of the missing tiles (for the smartspim at least) had two branches.

The first one was that the acquisition software didn't really output the number of tiles expected in the stacks for some brains, something that didn't happen before and I believe was introduced in the latest updates. On the other hand, we were passing through an update in our VAST system and the HPC didn't have the old VAST system path in some nodes using the "path hack" that IT established to access the old mount. This was already solved by migrating our services to the new VAST system.

Lastly, I believe we need to add validation scripts to all our datasets before uploading @carshadi @dyf @sharmishtaa.

PD: We also added the option to stop the jobs whenever we don't find any files to upload which was a bug we had in an earlier version.

camilolaiton · 2023-06-05T23:07:55Z

Closing due to inactivity. Feel free to reopen if necessary.

camilolaiton added the bug Something isn't working label Feb 15, 2023

camilolaiton assigned camilolaiton and carshadi Feb 15, 2023

camilolaiton mentioned this issue Feb 15, 2023

Data Validation AllenNeuralDynamics/aind-smartspim-stitch#13

Closed

camilolaiton closed this as completed Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data validation for data transfers #166

Data validation for data transfers #166

camilolaiton commented Feb 15, 2023

miketaormina commented Feb 16, 2023

carshadi commented Feb 16, 2023 •

edited

Loading

camilolaiton commented Feb 16, 2023 •

edited

Loading

miketaormina commented Feb 17, 2023

camilolaiton commented Mar 17, 2023 •

edited

Loading

camilolaiton commented Jun 5, 2023

Data validation for data transfers #166

Data validation for data transfers #166

Comments

camilolaiton commented Feb 15, 2023

miketaormina commented Feb 16, 2023

carshadi commented Feb 16, 2023 • edited Loading

camilolaiton commented Feb 16, 2023 • edited Loading

miketaormina commented Feb 17, 2023

camilolaiton commented Mar 17, 2023 • edited Loading

camilolaiton commented Jun 5, 2023

carshadi commented Feb 16, 2023 •

edited

Loading

camilolaiton commented Feb 16, 2023 •

edited

Loading

camilolaiton commented Mar 17, 2023 •

edited

Loading