-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement a new validate
command
#220
base: main
Are you sure you want to change the base?
Conversation
This base is still missing some features Still not implemented as a cli switch Refer to #47
Supports many more features Simpler codebase Straight to the point Logs many more errors Still in development: missing some features
Missing cleanup + checking against genotype file
Further Tasks: - Testing - Optimizing - Bug-catching
validate-hapfile
commandvalidate-hapfile
command
mad props, @ayimany ! This is a very well written PR. The code is super clean and easy to follow. I'm excited to try it out and will let you know once I finish reviewing Thanks again for doing this! This will tremendously help many users of haptools |
…ools into impl-validate-command
since we may want the validate command to validate other kinds of files besides hap files in the future
validate-hapfile
commandvalidate
command
validate
commandvalidate
command
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thanks for taking this project on, @ayimany. Your code is super thorough and well-considered.
I left some suggestions. Most of them are small things related to the logging or the tests.
In addition to the suggested changes, it might also be a good idea to write docstrings for all functions and classes. You can look at the sim_phenotype.py
module for examples of how to do this. We follow the conventions of numpydoc outlined here:
https://numpydoc.readthedocs.io/en/latest/format.html#documenting-classes
self.errc += 1 | ||
|
||
for i in range( | ||
HapFileValidator.KEY_HAPLOTYPE, HapFileValidator.KEY_VARIANT + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add a global variable that defines the line types (as I suggested in the comment above this one), can you also use the length of the variable instead of KEY_VARIANT
here? That way, it'll be flexible if we add more line types.
self, var_ids: list[str], underscores_to_semicolons: bool = False | ||
): | ||
ids: set[tuple[str, Line]] = set() | ||
for chrom, dt in self.vrids.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to tell you! In addition to checking whether 'Variant' IDs are in the PVAR file, we should also check that 'Repeat' IDs are in there.
We should probably create another parameter (called --repeat-pvar
) to allow the user to specify a PVAR file for repeats
Also, we might need to verify that the files aren't file descriptors? Just in case someone is trying to use process substitution?
…ools into impl-validate-command
Resolves #47
Overview
This branch introduces the functionality of a new sub-command meant to validate the structure of
.hap
files.Usage and Documentation
It requires no new dependencies.
It can be invoked through:
haptools validate [--sorted/--not-sorted] [--genotypes <filename.pgen>] [--verbosity] <filename.hap>
Implementation Details
The code in this implementation is concentrated in the
haptools/val_hapfile.py
module.The classes and functions that make up this module are the following:
HapFileIO (Class)
This class consists of a set of methods made to validate the existence and readability of the provided
.hap
file. It also cleans up the file prior to reading it's content.It requires a filename and an optional logger. The filename should be a
.hap
file.The main method that should be used to verify if the file exists is
HapFileIO::validate_existence
. It will check for the following:click
arguments already check for existence as well.The method that should be used to filter and read the content within the
.hap
file isHapFileIO::lines
. It takes in a filename as aPath
and a booleansorted
to determine if the file should be sorted. It will:Line
object per lineline
Sorting actually occurs wether
sorted
isTrue
or not. Whatsorted
does is remove unnecessary lines based on their positions which would otherwise be necessary if sortedLine (Class)
Stores information about a line, such as it's number and content for future use.
HapFileValidator (Class)
The validator will have to use a
HapFileIO
as its main source for reading the file's content. This is done through a method and not the constructor. (Which only accepts a logger)To load the data into the validator, use
HapFileValidator::extract_and_store_content
which takes in theHapFileIO
to be used and and an optional boolean to determine whether the file should be sorted or not.All of the following methods in the class test for different aspects of the
.hap
file.validate_version_declarations
Determines if the version declaration is present, repeated or invalid
validate_column_additions
Determines if the extra column declarations are correctly formatted and well-formed. If so, they are added to the list of registered extra columns
validate_columns_fulfill_minreqs
Validates if all columns fulfill the minimum requirements
validate_haplotypes
Validates the haplotype row format
validate_repeats
Validates the repeat row format
validate_variants
Validates the variant row format
store_ids
Stores the IDs of each haplotype, repeat and variant for future use. Should be called before any ID validation methods
validate_variant_ids
Validates the ID presence of each variant. Each needs to be unique per haplotype and not collide with chromosome IDs
validate_extra_fields
Makes sure that the added extra fields conform to their addition signature
reorder_extra_fields
Parses the order[H|R|V] lines and reorders the extra fields if they are valid
compare_haps_to_pvar
Compares the variants in the
.hap
file to those in the.pvar
fileis_hapfile_valid (Function)
Performs all of the possible checks available in the
HapFileValidator
class. Returns a boolean which isTrue
when there are no errors or warningsTests
Only one case hasn't been fully tested due to OS limitations and the way they handle their file permissions. I am talking about validating whether the user has enough permissions to read the
.hap
file.test_generated_haplotypes
Tests the dummy
.hap
generated by thehaptools
test suitetest_with_empty_lines
Tests a
.hap
with empty linestest_with_out_of_header_metas_sorted
Test a sorted
.hap
with meta lines out of the headertest_with_out_of_header_metas_unsorted
Test an unsorted
.hap
with meta lines out of the headertest_with_10_extras_reordered
Tests a
.hap
file with 10 extra columnstest_with_unexistent_reorders
Tests a
.hap
with anorder[H|R|V]
which mentions a non-existent extra columntest_with_unexistent_fields
Tests a
.hap
with a data line that is not anH
,R
orV
test_with_inadequate_version
Tests a
.hap
with an incorrectly formatted versiontest_with_no_version
Tests a
.hap
with no present versiontest_with_multiple_versions
Tests a
.hap
with several versions presenttest_with_inadequate_version_columns
Tests a
.hap
with a version column of only 2 fieldstest_with_invalid_column_addition_column_count
Tests a
.hap
with an extra column declaration of invalid column counttest_with_invalid_column_addition_types
Tests a
.hap
with a column addition for a type which is notH
,R
orV
test_with_invalid_column_addition_data_types
Tests a
.hap
with a column addition of unrecognized data type (nots
,d
or.nf
)test_with_insufficient_columns
Tests a
.hap
with insufficient mandatory columnstest_with_inconvertible_starts
Tests a
.hap
with start positions that can't be converted to integerstest_with_inconvertible_ends
Tests a
.hap
with end positions that can't be converted to integerstest_with_inconvertible_starts_var
Tests a
.hap
with start positions that can't be converted to integers in variantstest_with_inconvertible_ends_var
Tests a
.hap
with end positions that can't be converted to integers in variantstest_valhap_with_start_after_end
Tests a
.hap
with the start position placed after the end positiontest_is_directory
Tests a validation command with a filename that points to a directory
test_with_variant_id_of_chromosome
Tests a
.hap
with a variant whose ID is the same as a chromosome IDtest_with_hrid_of_chromosome
Tests a
.hap
with a haplotype or repeat with the same ID as a chromosometest_with_unexistent_col_in_order
Tests a
.hap
with anorder[H|R|V]
field that references a non-existent extra column nametest_with_unassociated_haplotype
Tests a
.hap
with a haplotype that does not have at least one matching repeattest_with_unrecognizable_allele
Tests a
.hap
with a variant whose allele is notG
,C
,T
orA
test_with_duplicate_ids
Tests a
.hap
with duplicate IDs forH
andR
fieldstest_with_duplicate_vids_per_haplotype
Tests a
.hap
with duplicate IDs for variants with the same haplotype associationtest_with_excol_of_wrong_type
Tests a
.hap
with a data line which contains an extra column ofd
data type but receivess
test_with_multiple_order_defs
Tests a
.hap
with multipleorder[H|R|V]
of the same typetest_with_insufficient_excols_in_reorder
Tests a
.hap
with anorder[H|R|V]
that does not reference all extra columnstest_with_variant_inexistent_haplotype_id
Tests a
.hap
with with a variant that references a non-existent haplotypetest_with_missing_variant_in_pvar
Tests a
.hap
along with a.pvar
file which is missing an ID present in the.hap
test_unreadable_hapfile
Passes a non-existent file to the validator
Future work
It would be wise to document the code further on.
Looking towards developing optimizations for this command would be of great help too although we should evaluate how frequently this command is to be used and how big the input files usually are in order to determine the severity of this issue.
Checklist