Command-Line Interface

bge-toolkit

bge-toolkit qc

bge-toolkit qc concordance

Aggregations

bge-toolkit qc concordance allows you to specify different grouping variables and aggregation variables when computing global or variant concordance (not sample concordance). Be aware the memory requirements / compute time / cost is proportional to the number of aggregations and grouping variables. You can change the driver and work memory with the global flags --worker-memory and --driver-memory.

The following aggregations are implemented:

Grouping

Exome MAF
Exome MAC
Imputation MAF
Imputation MAC

Aggregation

QualApprox
INFO
DP (computed as the sum of AD)
GQ
MAX_GP (computed as the maximum value of GP)

See this section for more details about the bins that are computed.

Notes

The following output files are written to --output-dir:

Global Concordance

global_concordance_table.ht (Contains the raw concordance data that can be loaded in Python with bge_toolkit.qc.ConcordanceTable.load())
global-results/*.tsv (Contains formatted tables for each combination of grouping and aggregation variable)
global-plots/nonref-conc/*.png (Contains figures of non-ref concordance for each grouping variable across aggregation variables)
global-plots/f1-score/*.png (Contains figures of f1-score for each grouping variable across aggregation variables)

Variant Concordance

variant_overlaps.ht (Contains a Hail table with the list of overlapping variants)
variant_conc.ht (Contains the raw concordance data that can be loaded in Python with bge_toolkit.qc.ConcordanceTable.load())

Sample Concordance

sample_overlaps.ht (Contains a Hail table with the list of overlapping samples)
sample_overlaps.tsv (Contains a TSV file with two columns for whether samples appear in the exome and imputation datasets)
sample_conc.ht (Contains the raw concordance data that can be loaded in Python with bge_toolkit.qc.ConcordanceTable.load())
sample-results/ALL_ALL.tsv (Contains the global concordance for each sample in the dataset)

Examples

Compute global concordance stratified by the Exome MAF with input files as MatrixTables for “chr1”.

$ bge-toolkit \
     --driver-cores 2 \
     qc concordance \
     --exome "gs://my-bucket/exome.mt" \
     --imputation "gs://my-bucket/imputation.mt" \
     --output-dir "gs://my-bucket/concordance/" \
     --contig "chr1" \
     --EXOME-MAF \
     --global-conc

Compute global concordance stratified by Exome MAF and Imputation MAF with PLINK input files for a downsampled dataset.

$ bge-toolkit \
     --driver-cores 2 \
     qc concordance \
     --exome "gs://my-bucket/exome-plink" \
     --imputation "gs://my-bucket/imputation-plink" \
     --output-dir "gs://my-bucket/concordance/" \
     --downsample-variants 0.1 \
     --downsample-samples 0.1 \
     --EXOME-MAF \
     --IMPUTATION-MAF \
     --global-conc

Compute variant, sample, and global concordance statistics with VCF input files.

$ bge-toolkit \
     --driver-cores 2 \
     qc concordance \
     --exome "gs://my-bucket/exome.vcf.bgz" \
     --imputation "gs://my-bucket/imputation.vcf.bgz" \
     --output-dir "gs://my-bucket/concordance/" \
     --variant-conc \
     --sample-conc \
     --global-conc

bge-toolkit qc sample-qc

Examples

$ bge-toolkit qc sample-qc \
    --exome "gs://MY-BUCKET/data.mt" \
    --output-dir "gs://MY-BUCKET/test-sample-qc/080525-v2/" \
    --exome-regions "gs://MY-BUCKET/Twist_Alliance_Clinical_Research_Exome_Covered_Targets_hg38-34.9MB.bed" \
    --low-complexity-regions "gs://MY-BUCKET/LCRFromHengHg38.bed" \
    --dragen \
    --reported-sex-path "gs://MY-BUCKET/metadata.ht" \
    --reported-sex-col "reported_sex" \
    --chimera-rate-path "gs://MY-BUCKET/metadata.ht" \
    --chimera-rate-col "CHIMERA"

Notes

The following output files are written to --output-dir:

sample_qc_stats.ht (Contains the raw sample qc data that can be loaded in Python with SampleQCResult.load())
pcs/pc1_pc2.png (Contains a plot of PC1 versus PC2 colored by ancestry population label)
pcs/pc1_pc3.png (Contains a plot of PC1 versus PC3 colored by ancestry population label)
pcs/pc2_pc3.png (Contains a plot of PC2 versus PC3 colored by ancestry population label)
qc/*_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged)
qc/*_density.png (Contains density plots of different QC metrics stratified by ancestry population label)
passing_sample_ids.tsv (A TSV file containing a list of sample IDs for samples that passed all QC metrics)
qc/*_pass_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged for passing samples only)
qc/*_pass_density.png (Contains density plots of different QC metrics stratified by ancestry population label for passing samples only)

The structure of sample_qc_stats.ht is as follows:

contamination:
    - charr: The CHARR statistic.
    - is_passing: ``charr`` is less than the ``charr_thresh``

chimera_reads:
    - chimera_rate: The rate of chimera reads.
    - is_passing: The rate of chimera reads is below ``threshold``.

sample_qc_metrics:
    - is_passing: Passes every statistic.
    - r_ti_tv: Transition / Transversion ratio.
    - n_singleton: Number of singletons.
    - n_insertion: Number of insertions.
    - n_deletion: Number of deletions.
    - n_transition: Number of transitions.
    - n_transversion: Number of transversions.
    - r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
    - r_insertion_deletion: Ratio of insertions to deletions.
    - fail_r_ti_tv: An outlier in ``r_ti_tv``.
    - fail_n_singleton: An outlier in ``n_singleton``.
    - fail_n_insertion: An outlier in ``n_insertion``.
    - fail_n_deletion: An outlier in ``n_deletion``.
    - fail_n_transition: An outlier in ``n_transition``.
    - fail_n_transversion: An outlier in ``n_transversion``.
    - fail_r_het_hom_var: An outlier in ``r_het_hom_var``.
    - fail_r_insertion_deletion: An outlier in ``r_insertion_deletion``.

gq_fraction:
    - gq_fraction: The percentage of genotypes with GQ >= ``gq_thresh``.
    - is_passing: ``gq_fraction`` is greater than the ``fraction_thresh``

sex_info:
    - is_female: The imputed sex. ``True`` is for females, ``False`` is for males.
    - is_passing: Either ``reported sex == imputed sex`` or ``True``.
    - sex_check: ``reported sex == imputed sex`` or ``Null``.
    - f_stat: Inbreeding coefficient on the non-PAR X chromosome.
    - n_called: Number of genotypes considered.
    - expected_homs: Expected number of homozygotes.
    - observed_homs: Observed number of heterozygotes.

ancestry:
    - ancestry_pop: The ancestry population label.

pcs:
    - scores: An array with the top ``k`` principal components.

s: sample ID