Skip to content

Commit

Permalink
Merge branch 'V0.1.6' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
ziadbkh committed Aug 14, 2024
2 parents 5681397 + 2284b47 commit fef1bda
Show file tree
Hide file tree
Showing 11 changed files with 290 additions and 246 deletions.
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "mgikit"
version = "0.1.5"
version = "0.1.6"
edition = "2021"
authors = ["Ziad Al Bkhetan <[email protected]>"]
repository = "https://github.com/sagc-bioinformatics/mgikit"
Expand Down
Binary file removed bins/mgikit-V0.1.5.zip
Binary file not shown.
366 changes: 186 additions & 180 deletions docs/pages/demultiplex.md

Large diffs are not rendered by default.

30 changes: 27 additions & 3 deletions docs/pages/mgikit-multiqc.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,18 @@ type: guides
---

## mgikit Reports

The demultiplex command generates multiple reports with file names that start with the flowcell and lane being demultiplexed.
a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as described at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).

1. `flowcell.L0*.mgikit.info`

This report contains the number of reads per sample respectively to each possible mismatch. It has (2 + allowed mismatches during demultiplexing) columns.
This report contains the number of reads per sample respectively to each possible mismatch.
For example:

| **sample** | **0-mismatches** | **1-mismatches** |
|:------------:|:------------------:|:------------------:|
| S01 | 3404 | 5655 |
| :--------: | :--------------: | :--------------: |
| S01 | 3404 | 5655 |

This means that there was only one mismatch allowed during this execution and the sample S01 has 3404 reads with indexes matching perfectly and 5655 reads with indexes that differ by 1 base compared to the indexes provided in the sample sheet.

Expand All @@ -27,8 +28,31 @@ This file is used for the mgikit plugin to visualise quality control reports thr

This file contains summary information related to the cluster count and quality scores, summarised for each sample as well as at the whole lane scale. This file is used for the mgikit plugin to visualise quality control reports through MultiQC.

**Report content:**

- **Lane statistics columns**

1. `Run ID-Lane`: Run ID and lane number.
2. `Mb Total Yield`: total number of bases in a million.
3. `M Total Clusters`: total number of reads in million.
4. `% bases ≥ Q30`: percentage of bases with a quality score greater than 30 of all bases.
5. `Mean Quality`: The average quality score for the bases.
6. `% Perfect Index`: The percentage of reads with perfectly matching indices of all reads.

- **Sample general info**

1. `Sample ID`: sample ID taken from the sample sheet.
2. `M Clusters`: total number of reads in million.
3. `Mb Yield ≥ Q30`: total number of bases with a quality score greater than 30 in million.
4. `% R1 Yield ≥ Q30`: percentage of bases with a quality score greater than 30 of all bases calculated only for forward reads.
5. `% R2 Yield ≥ Q30`: percentage of bases with a quality score greater than 30 of all bases calculated only for reverse reads.
6. `% R3 Yield ≥ Q30`: percentage of bases with a quality score greater than 30 of all bases calculated only for indices.
7. `% Perfect Index`: The percentage of reads with perfectly matching indices of all reads.

3. `flowcell.L0*.mgikit.sample_stats`

This file contains the informaiton in the above mentioned reports but in simple format. This is used to merge the reports from multiple lanes into one report for the whole run.

4. `flowcell.L0*.mgikit.undetermined_barcode.complete`

This report contains the undetermined barcodes including their frequency.
Expand Down
71 changes: 34 additions & 37 deletions docs/pages/reformat.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,69 +14,67 @@ This command should be used for each sample separately (either paired-end or sin

## Command arguments

+ **`-f or --read1`**: the path to the forward reads fastq file for both paired-end and single-end input data.
- **`-f or --read1`**: the path to the forward reads fastq file for both paired-end and single-end input data.

+ **`-r or --read2`**: the path to the reverse reads fastq file.
- **`-r or --read2`**: the path to the reverse reads fastq file.

+ **`-i or --input`**: the path to the directory that contains the input fastq files.
- **`-i or --input`**: the path to the directory that contains the input fastq files.

Either `-i` or `-f/-r`, `-f` should be provided for a run.
{% include callout.html type="note" content="Either `-i` or `-f/-r`, `-f` should be provided for a run." %}

+ **`-o or --output`**: The path the output directory.
- **`-o or --output`**: The path the output directory.

The tool will create the directory if it does not exist
or overwrite the content if the directory exists and the parameter `--force` is used. The tool will exit
with an error if the directory exists, and `--force` is not used. If this parameter is not provided, the tools
will create a directory (in the working directory) with a name based on the date and time
of the run as follows `mgiKit_Y-m-dTHMS`. where `Y`, `m`, `d`, `H`, `M`, and `S` are the date and time format.
The tool will create the directory if it does not exist
or overwrite the content if the directory exists and the parameter `--force` is used. The tool will exit
with an error if the directory exists, and `--force` is not used. If this parameter is not provided, the tools
will create a directory (in the working directory) with a name based on the date and time
of the run as follows `mgiKit_Y-m-dTHMS`. where `Y`, `m`, `d`, `H`, `M`, and `S` are the date and time format.

+ **`--reports`**: The path of the output reports directory.
- **`--reports`**: The path of the output reports directory.

By default, the tool writes the files of the run reports in the same output directory as the
demultiplexed fastq files (`-o` or `--output` parameter). This parameter is used to write the reports in
a different folder as specified with this parameter.
By default, the tool writes the files of the run reports in the same output directory as the

+ **`--lane`**: Lane number such as `L01`.
demultiplexed fastq files (`-o` or `--output` parameter). This parameter is used to write the reports in
a different folder as specified with this parameter.

This parameter is used to provide the lane number when the parameter `-i` or `--input` is not
provided. The lane number is used for QC reports and it is mandatory when Illumina format is
requested for file naming.
- **`--lane`**: Lane number such as `L01`.

+ **`--instrument`**: The id of the sequncing machine.
This parameter is used to provide the lane number when the parameter `-i` or `--input` is not

This parameter is used to provide the instrument id when the parameter `-i` or `--input`
is not provided. The parameter is mandatory when Illumina format is requested for read header and
file naming.
provided. The lane number is used for QC reports and it is mandatory when Illumina format is requested for file naming.

+ **`--run`**: The run id. It is taken from Bioinf.csv as the date and time of starting the run.
- **`--instrument`**: The id of the sequncing machine.

This parameter is used to provide the run id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.
This parameter is used to provide the instrument id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.

+ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data then written once to the disk. Smaller buffers will need less memory but makes the tool slower. Largeer buffers need more memory.
- **`--run`**: The run id. It is taken from Bioinf.csv as the date and time of starting the run.

+ **`--compression-level`**: The level of compression (between 0 and 12). 0 is fast but no compression, 12 is slow but high compression. [default: 1]
This parameter is used to provide the run id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.

+ **`--force`**: this flag is to force the run and overwrite the existing output directory if exists.
- **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data then written once to the disk. Smaller buffers will need less memory but makes the tool slower. Largeer buffers need more memory.

+ **`--flexible`**: By default, the tool will calculate the length of the first read and its all parts and use this information in the analysis for a quicker determination of the read boundaries. `--flexible` option, will make the tool determine the read boundaries based on the `new line` character (`\n`).
- **`--compression-level`**: The level of compression (between 0 and 12). 0 is fast but no compression, 12 is slow but high compression. [default: 1]

+ **`--info-file`**: The name of the info file that contains the run information. Only needed when using the `--input` parameter. [default: BioInfo.csv]
- **`--force`**: this flag is to force the run and overwrite the existing output directory if exists.

+ **`--disable-illumina`**: reads will be left as is and only quality reports will be generated.
- **`--flexible`**: By default, the tool will calculate the length of the first read and its all parts and use this information in the analysis for a quicker determination of the read boundaries. `--flexible` option, will make the tool determine the read boundaries based on the `new line` character (`\n`).

+ **`--umi-length`**: The length of UMI expected at the end of the read (r1 for single-end, or r2 for paired-end) [Default: 0].
- **`--info-file`**: The name of the info file that contains the run information. Only needed when using the `--input` parameter. [default: BioInfo.csv]

+ **`--report-level`**: The level of reporting. 0 no reports will be generated, 1 data quality and demultiplexing reports. 2: all reports (reports on data quality, demultiplexing, undetermined and ambiguous barcodes).[default: 2]
- **`--disable-illumina`**: reads will be left as is and only quality reports will be generated.

+ **`--sample-index`**: The index of the sample in the sample sheet. It is required for file naming. [default: 1]
- **`--umi-length`**: The length of UMI expected at the end of the read (r1 for single-end, or r2 for paired-end) [Default: 0].

- **`--report-level`**: The level of reporting. 0 no reports will be generated, 1 data quality and demultiplexing reports. 2: all reports (reports on data quality, demultiplexing, undetermined and ambiguous barcodes).[default: 2]

- **`--sample-index`**: The index of the sample in the sample sheet. It is required for file naming. [default: 1]

- **`--barcode`**: The barcode of the specific sample to calculate the mismatches for the reports. If not provided, no mismatches will be calculated.

+ **`--barcode`**: The barcode of the specific sample to calculate the mismatches for the reports. If not provided, no mismatches will be calculated.

## Usage Examples

**1. Demultiplexing a run with dual indexes (i7 and i5)**


```bash
target/release/mgikit reformat \
-f testing_data/input/extras_test/FC01_L01_sample1_1.fq.gz \
Expand All @@ -85,4 +83,3 @@ target/release/mgikit reformat \
--sample-index 1 \
--info-file testing_data/input/extras_test/BioInfo.csv
```

7 changes: 2 additions & 5 deletions docs/pages/report.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,8 @@ if the run has multiple lanes, there will be lane-specific reports. The reports

## Command arguments

+ **`--qc-report`**: The path to the QC report, you can add multiple paths by reusing the same parameter. For example, `--qc-report file1 --qc-report file2`. This argument takes multiple values and is mandatory. The tool expects here the reports generated for each lane in the run and you also can combine the reports generated from multiple runs for the same samples.

- **`--qc-report`**: The path to the QC report, you can add multiple paths by reusing the same parameter. For example, `--qc-report file1 --qc-report file2`. This argument takes multiple values and is mandatory. The tool expects here the reports generated for each lane in the run and you also can combine the reports generated from multiple runs for the same samples.

+ **`-o or --output`**: The path and prefix of output files. The tools will create two files at the same path with the same prefix and end with `.info` and `.general`.
- **`-o or --output`**: The path and prefix of output files. The tools will create two files at the same path with the same prefix and end with `.info` and `.general`.

## Usage Examples


27 changes: 13 additions & 14 deletions docs/pages/template.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,32 @@ toc: true
type: guides
---

This command is used to detect the location and form of the indexes within the read barcode. It simply goes through a small number of the reads and investigates the number of matches with the indexes in the sample sheet within each possible location in the read barcode and considers the indexes as is and their reverse complementary.
This command is used to detect the location and form of the indexes within the read barcode. It simply goes through a small number of the reads and investigates the number of matches with the indexes in the sample sheet within each possible location in the read barcode and considers the indexes as is and their reverse complementary.

It reports matches for all possible combinations and uses the read template that had the maximum number of matches. This process happens for each sample individually and therefore, the best matching template for each sample will be reported.
It reports matches for all possible combinations and uses the read template that had the maximum number of matches. This process happens for each sample individually and therefore, the best matching template for each sample will be reported.

Using this comprehensive scan, the tool can detect the templates for mixed libraries.
Using this comprehensive scan, the tool can detect the templates for mixed libraries.

## Parameters

**Fastq input file**

+ **`-f or --read1`**: the path to the forward reads fastq file for both paired-end and single-end input data.
- **`-f or --read1`**: the path to the forward reads fastq file for both paired-end and single-end input data.

+ **`-r or --read2`**: the path to the reverse reads fastq file.
- **`-r or --read2`**: the path to the reverse reads fastq file.

+ **`-s or --sample-sheet`**: the path to the sample sheet file.
- **`-s or --sample-sheet`**: the path to the sample sheet file.

This is the same format as above, but only sample_id and i7 are required. i5 is required for dual indexes data.
This is the same format as above, but only sample_id and i7 are required. i5 is required for dual indexes data.

+ **`-o or --output`**: The path and prefix of output files. The tools will create two files at the same path with the same prefix and end with `_template.tsv` and `_details.tsv`.
- **`-o or --output`**: The path and prefix of output files. The tools will create two files at the same path with the same prefix and end with `_template.tsv` and `_details.tsv`.

+ **`--testing-reads`**: The number of reads to be investigated to check and detect the templates. The default is 5,000 reads. A Larger number increases the performance time.
- **`--testing-reads`**: The number of reads to be investigated to check and detect the templates. The default is 5,000 reads. A Larger number increases the performance time.

+ **`--barcode-length`**: The length of the read barcode at the end of the read2 in paired-end or read1 in single end to be investigated. By default, the barcode length is set to be the length difference between read2 and read1.
- **`--barcode-length`**: The length of the read barcode at the end of the read2 in paired-end or read1 in single end to be investigated. By default, the barcode length is set to be the length difference between read2 and read1.

+ **`--no-umi`**: If the barcode contains extra base pairs other than the indexes, the tool considers the longest as an umi. If this parameter is enabled, the tool will ignore all extra base pairs in the barcode and trim them from the read.
- **`--no-umi`**: If the barcode contains extra base pairs other than the indexes, the tool considers the longest as an umi. If this parameter is enabled, the tool will ignore all extra base pairs in the barcode and trim them from the read.

+ **`--popular-template`**: by default, the tool reports the template that matches the maximum number of reads to each corresponding sample. If this option is enabled, the tool will use the most frequent template across all samples as the final template for all samples.

+ **`--max-umi-length`**: if barcode length is not provided, the tool will set the barcode length to the length difference between read2 and read1. If the barcode length is greater than the sum of indexes lengths and this parameter, the tool will stop. The default is 10 bp. You can disable this parameter by either providing a large number or providing the barcode length (`--barcode-length`) parameter manually.
- **`--popular-template`**: by default, the tool reports the template that matches the maximum number of reads to each corresponding sample. If this option is enabled, the tool will use the most frequent template across all samples as the final template for all samples.

- **`--max-umi-length`**: if barcode length is not provided, the tool will set the barcode length to the length difference between read2 and read1. If the barcode length is greater than the sum of indexes lengths and this parameter, the tool will stop. The default is 10 bp. You can disable this parameter by either providing a large number or providing the barcode length (`--barcode-length`) parameter manually.
1 change: 1 addition & 0 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -528,6 +528,7 @@ fn main() {
)

)
.arg_required_else_help(true)
.get_matches();


Expand Down
Loading

0 comments on commit fef1bda

Please sign in to comment.