Command-Line Interface ====================== bge-toolkit ----------- .. custom-typer:: bge_toolkit.cli.__main__:app :prog: bge-toolkit bge-toolkit qc -------------- .. custom-typer:: bge_toolkit.qc.cli:qc_app :prog: bge-toolkit qc bge-toolkit qc concordance -------------------------- .. custom-typer:: bge_toolkit.qc.cli:concordance_app :prog: bge-toolkit qc concordance ============ Aggregations ============ ``bge-toolkit qc concordance`` allows you to specify different grouping variables and aggregation variables when computing global or variant concordance (not sample concordance). Be aware the memory requirements / compute time / cost is proportional to the number of aggregations and grouping variables. You can change the driver and work memory with the global flags ``--worker-memory`` and ``--driver-memory``. The following aggregations are implemented: Grouping ~~~~~~~~ - Exome MAF - Exome MAC - Imputation MAF - Imputation MAC Aggregation ~~~~~~~~~~~ - QualApprox - INFO - DP (computed as the sum of AD) - GQ - MAX_GP (computed as the maximum value of GP) See :ref:`this section ` for more details about the bins that are computed. ===== Notes ===== The following output files are written to ``--output-dir``: 1. Global Concordance - global_concordance_table.ht (Contains the raw concordance data that can be loaded in Python with ``bge_toolkit.qc.ConcordanceTable.load()``) - global-results/\*.tsv (Contains formatted tables for each combination of grouping and aggregation variable) - global-plots/nonref-conc/\*.png (Contains figures of non-ref concordance for each grouping variable across aggregation variables) - global-plots/f1-score/\*.png (Contains figures of f1-score for each grouping variable across aggregation variables) 2. Variant Concordance - variant_overlaps.ht (Contains a Hail table with the list of overlapping variants) - variant_conc.ht (Contains the raw concordance data that can be loaded in Python with ``bge_toolkit.qc.ConcordanceTable.load()``) 3. Sample Concordance - sample_overlaps.ht (Contains a Hail table with the list of overlapping samples) - sample_overlaps.tsv (Contains a TSV file with two columns for whether samples appear in the exome and imputation datasets) - sample_conc.ht (Contains the raw concordance data that can be loaded in Python with ``bge_toolkit.qc.ConcordanceTable.load()``) - sample-results/ALL_ALL.tsv (Contains the global concordance for each sample in the dataset) ======== Examples ======== 1. Compute global concordance stratified by the Exome MAF with input files as MatrixTables for "chr1". .. code-block:: $ bge-toolkit \ --driver-cores 2 \ qc concordance \ --exome "gs://my-bucket/exome.mt" \ --imputation "gs://my-bucket/imputation.mt" \ --output-dir "gs://my-bucket/concordance/" \ --contig "chr1" \ --EXOME-MAF \ --global-conc 2. Compute global concordance stratified by Exome MAF and Imputation MAF with PLINK input files for a downsampled dataset. .. code-block:: $ bge-toolkit \ --driver-cores 2 \ qc concordance \ --exome "gs://my-bucket/exome-plink" \ --imputation "gs://my-bucket/imputation-plink" \ --output-dir "gs://my-bucket/concordance/" \ --downsample-variants 0.1 \ --downsample-samples 0.1 \ --EXOME-MAF \ --IMPUTATION-MAF \ --global-conc 3. Compute variant, sample, and global concordance statistics with VCF input files. .. code-block:: $ bge-toolkit \ --driver-cores 2 \ qc concordance \ --exome "gs://my-bucket/exome.vcf.bgz" \ --imputation "gs://my-bucket/imputation.vcf.bgz" \ --output-dir "gs://my-bucket/concordance/" \ --variant-conc \ --sample-conc \ --global-conc bge-toolkit qc sample-qc ------------------------ .. custom-typer:: bge_toolkit.qc.cli:sample_qc_app :prog: bge-toolkit qc sample-qc ======== Examples ======== .. code-block:: $ bge-toolkit qc sample-qc \ --exome "gs://MY-BUCKET/data.mt" \ --output-dir "gs://MY-BUCKET/test-sample-qc/080525-v2/" \ --exome-regions "gs://MY-BUCKET/Twist_Alliance_Clinical_Research_Exome_Covered_Targets_hg38-34.9MB.bed" \ --low-complexity-regions "gs://MY-BUCKET/LCRFromHengHg38.bed" \ --dragen \ --reported-sex-path "gs://MY-BUCKET/metadata.ht" \ --reported-sex-col "reported_sex" \ --chimera-rate-path "gs://MY-BUCKET/metadata.ht" \ --chimera-rate-col "CHIMERA" ===== Notes ===== The following output files are written to ``--output-dir``: - sample_qc_stats.ht (Contains the raw sample qc data that can be loaded in Python with :meth:`.SampleQCResult.load`) - pcs/pc1_pc2.png (Contains a plot of PC1 versus PC2 colored by ancestry population label) - pcs/pc1_pc3.png (Contains a plot of PC1 versus PC3 colored by ancestry population label) - pcs/pc2_pc3.png (Contains a plot of PC2 versus PC3 colored by ancestry population label) - qc/\*_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged) - qc/\*_density.png (Contains density plots of different QC metrics stratified by ancestry population label) - passing_sample_ids.tsv (A TSV file containing a list of sample IDs for samples that passed all QC metrics) - qc/\*_pass_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged for passing samples only) - qc/\*_pass_density.png (Contains density plots of different QC metrics stratified by ancestry population label for passing samples only) The structure of ``sample_qc_stats.ht`` is as follows: .. code-block:: contamination: - charr: The CHARR statistic. - is_passing: ``charr`` is less than the ``charr_thresh`` chimera_reads: - chimera_rate: The rate of chimera reads. - is_passing: The rate of chimera reads is below ``threshold``. sample_qc_metrics: - is_passing: Passes every statistic. - r_ti_tv: Transition / Transversion ratio. - n_singleton: Number of singletons. - n_insertion: Number of insertions. - n_deletion: Number of deletions. - n_transition: Number of transitions. - n_transversion: Number of transversions. - r_het_hom_var: Ratio of heterozygotes to number of homozygote variants. - r_insertion_deletion: Ratio of insertions to deletions. - fail_r_ti_tv: An outlier in ``r_ti_tv``. - fail_n_singleton: An outlier in ``n_singleton``. - fail_n_insertion: An outlier in ``n_insertion``. - fail_n_deletion: An outlier in ``n_deletion``. - fail_n_transition: An outlier in ``n_transition``. - fail_n_transversion: An outlier in ``n_transversion``. - fail_r_het_hom_var: An outlier in ``r_het_hom_var``. - fail_r_insertion_deletion: An outlier in ``r_insertion_deletion``. gq_fraction: - gq_fraction: The percentage of genotypes with GQ >= ``gq_thresh``. - is_passing: ``gq_fraction`` is greater than the ``fraction_thresh`` sex_info: - is_female: The imputed sex. ``True`` is for females, ``False`` is for males. - is_passing: Either ``reported sex == imputed sex`` or ``True``. - sex_check: ``reported sex == imputed sex`` or ``Null``. - f_stat: Inbreeding coefficient on the non-PAR X chromosome. - n_called: Number of genotypes considered. - expected_homs: Expected number of homozygotes. - observed_homs: Observed number of heterozygotes. ancestry: - ancestry_pop: The ancestry population label. pcs: - scores: An array with the top ``k`` principal components. s: sample ID