Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a validate subcommand to check whether a .hap file is valid #47

Open
28 of 33 tasks
aryarm opened this issue Jun 10, 2022 · 3 comments · May be fixed by #220
Open
28 of 33 tasks

feat: a validate subcommand to check whether a .hap file is valid #47

aryarm opened this issue Jun 10, 2022 · 3 comments · May be fixed by #220
Labels
enhancement New feature or request

Comments

@aryarm
Copy link
Member

aryarm commented Jun 10, 2022

Important: Please familiarize yourself with the .hap file specification before reading this issue!


Originating from item 2 in the "Future Work" section of PR #43:

It would be useful to have a validate command that simply validates the .hap file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.

At first, this command should reject unsorted .hap files, but at some point we should also add a --no-sort parameter to support unsorted files, since those are also technically valid input.

For each violation of the standard, it would be nice if the validate subcommand reported the exact line that contains the issue, and ideally, it would quote the problematic part of the line, as well.

We should probably also add an optional argument that specifies the subcommand that this .hap file will be used as input for. That way, we can import its custom Haplotype class and acquire the expected extra field types from there.

Here are some rules it should check the .hap file follows:

  • are the line types all supported?
  • do H and R lines have at least four fields and do V lines have at least 5?
  • are there any extra fields in a line besides those that have been declared?
  • sorted lines (with the option to disable this check when the --no-sort flag is set)
  • can each of the values in the file be properly cast to the expected type?
    • for example, can the "Start Position" and "End Position" fields be cast to integers?
    • and can the values in each extra field be properly cast to the expected type?
  • is the start position less than the end position for each of the H, R, and V lines?
    • and for each H line: Is its start position $\le$ all of the start positions of all of the V lines that belong to it?
    • and for each H line: Is its end position $\ge$ all of the end positions of all of the V lines that belong to it?
  • are there any haplotypes with IDs that are the same as some chromosome names?
    • this is not allowed in our format b/c otherwise it would break the bgzip and tabix indexing
  • do the variant alleles contain only As, Cs, Gs, and Ts?
  • do the variant IDs match those in the genotypes file? (we should check this quickly without loading the genotypes if at all possible - potentially using pysam.tabix_iterator?)
    • We could make this an optional check by having it only happen when a --genotypes parameter specifying the path to the genotypes file is present.
    • If pysam.tabix_iterator works for this, then we should consider adding read_variants() and read_samples() methods to the GenotypesVCF class.
    • we should also check that the requested allele is present in the genotypes file
  • are the haplotype IDs unique? An H line can never have the same ID as an R line, but an H (or R) line can have the same ID as a V line
  • are the variant IDs within each haplotype unique?
  • are any fields empty (ie do they evaluate to the empty string)?
  • are any lines empty or completely blank?
  • do any of the variant lines refer to a haplotype that simply doesn't appear within the file?
  • are all haplotypes associated with at least one variant line?

And here are some rules for the header of the .hap file:

  • is there a version declared in the header of the file? (this is considered a best practice)
    • also, is the version string in a valid format?
    • and is the version up to date? (we can check this by using Haplotypes.check_version())
  • are all metadata names recognized? to check this, we should check whether there are any lines with a #, followed by a tab, followed by a recognized metadata name: currently, "version", "orderH", "orderV", and "orderR"
  • for "orderH", "orderV", and "orderR" metadata, are all of the extra fields declared there also declared in the header later on?
  • unless --no-sort is specified, do the metadata lines appear before the extra field declarations?
    • this is considered a best practice
  • extra fields are properly declared
    • is the declared type a valid python format specification? at the moment, we just support 's', 'd', and 'f'
    • do the extra lines in the header have all of the required fields?
    • is the order of the extra fields properly declared within a metadata line? (this is considered a best practice)
  • are there any extra field declarations for unrecognized line types (ie ones that aren't H, R, or V)?
    • these are technically ok, but it is considered best practice to exclude them
    • basically, are there any lines with a # followed immediately by a symbol other than "H", "R", or "V"?
@aryarm aryarm changed the title create a validate subcommand to check whether a .hap file is valid feat: a validate subcommand to check whether a .hap file is valid Jun 10, 2022
@aryarm
Copy link
Member Author

aryarm commented Jun 17, 2022

we could also use this subcommand to validate other kinds of files like .map or .dat files, @mlamkin7

In that case, maybe the best way to structure this code would be to create a Validator class with methods inside of it for each of the file formats? Or a Validator abstract class that gets implemented for each of the file formats?

Update: After some discussion, we decided not to use this to validate other kinds of files, after all.

@aryarm aryarm added the enhancement New feature or request label Jun 17, 2022
@aryarm
Copy link
Member Author

aryarm commented Sep 6, 2022

probably the best way to start on this PR would be to create a validate.py module with a single validate_haps() function. The validate_haps() function can simply make a call to a new Haplotypes.validate() method that we create.

Inside that function, you can iterate over the .hap file line by line and verify that everything looks correct. As an example of how to do this, you can take a look at what I did in the Haplotypes.__iter__() method:

with hook_compressed(self.fname, mode="rt") as haps:
self.log.info("Not taking advantage of indexing.")
header_lines = []
for line in haps:
line = line.rstrip("\n")
line_type = self._line_type(line)
if line[0] == "#":
# store header for later
try:
header_lines.append(line)
except AttributeError:
# this happens when we encounter a line beginning with a #
# after already having seen a valid line type (like H or V)
# These are usually just comment lines, so we can ignore it
pass
else:
if header_lines:
metas, extras = self.check_header(header_lines)
types = self._get_field_types(extras, metas.get("order"))
header_lines = None
self.log.info("Finished reading header.")
if line_type == "H":

@aryarm
Copy link
Member Author

aryarm commented Jul 12, 2023

also note:
we could create an option to the .hap file to allow whitespace (ie consecutive spaces) to be converted to tabs (field delimiters) or allow for another option that can parse whitespace as a field delimiter when reading the hap file

that way, we can account for situations where the user might have a text editor that automatically inserts spaces when the tab key is pressed

update: sed 's/^[ \t]*//;s/[ \t]*$//;s/[[:blank:]]\{2,\}/\t/g' can also be used to remove trailing/leading whitespace and convert consecutive whitespace to a tab

ayimany pushed a commit that referenced this issue Jul 14, 2023
This base is still missing some features
Still not implemented as a cli switch
Refer to #47
@ayimany ayimany linked a pull request Jul 27, 2023 that will close this issue
3 tasks
@ayimany ayimany linked a pull request Jul 27, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant