-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data validation for data transfers #166
Comments
Is there a benefit to using |
The traceback occurs with every job, and appears to be a (harmless?) bug related to shutting down the cluster in dask_mpi dask/dask-mpi#94. If there is an issue with the file collection, that traceback is probably unrelated. FWIW, I have not seen the issues with missing data that @camilolaiton has seen. Camilo you had mentioned that you might have been using an old version of this repository? Does it make sense to see if that is the issue before we start changing things? Are we sure this is a bug? (although I agree that validation is needed) |
Yeah, I'm testing that right now with 8 datasets. It is important to mention that this issue has only happened this week (I was uploading 12 datasets, 1 job per dataset), not before with the uploading of more than 10 datasets with dask mpi (might have been an isolated case?). I updated the repo and some other files were updated but not the |
Not sure where you ended up on how necessary this is, but this is how I was checking that an acquisition completed. It just checks that each tile folder contains the same number of images and returns a list of those that differ from the first one. I think it should work on the VAST formatted ones, as long as you point it to the
|
In #176 I added a script that validates the SmartSPIM datasets before uploading data. The problem of the missing tiles (for the smartspim at least) had two branches. The first one was that the acquisition software didn't really output the number of tiles expected in the stacks for some brains, something that didn't happen before and I believe was introduced in the latest updates. On the other hand, we were passing through an update in our VAST system and the HPC didn't have the old VAST system path in some nodes using the "path hack" that IT established to access the old mount. This was already solved by migrating our services to the new VAST system. Lastly, I believe we need to add validation scripts to all our datasets before uploading @carshadi @dyf @sharmishtaa. PD: We also added the option to stop the jobs whenever we don't find any files to upload which was a bug we had in an earlier version. |
Closing due to inactivity. Feel free to reopen if necessary. |
We noticed that one of the SmartSPIM datasets was not correctly uploaded to AWS from the VAST system when we launch multiple uploads at the same time and without any errors. Therefore, we need to find a way to verify that the number of tiles/images of each subfolder corresponds to what the collect function is returning.
It is important to mention that there are additional files inside of the datasets. Then, the number of collected files will be a sum of:
n_collected_files = # of tiles of an dataset + # of images in derivatives + # metadata files
Here are some screenshoots of the current tracebacks and the # of collected files for a dataset:
The text was updated successfully, but these errors were encountered: