Command-Line Interface

bge-toolkit

bge-toolkit Usage: bge-toolkit [OPTIONS] COMMAND [ARGS]... Command line interface for the BGE-Toolkit. ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮ --app-nameTEXT   A name for this pipeline. In the Spark        backend, this becomes the Spark application   name. In the Batch backend, this is a prefix  for the name of every Batch.                  [default: None]                               --masterTEXT   Spark Backend only. URL identifying the Spark leader (master) node or `local[N]` for local  clusters.                                     [default: None]                               --localTEXT   Spark Backend only. Local-mode core limit     indicator. Must either be `local[N]` where N  is a positive integer or `local[*]`. The      latter indicates Spark should use all cores   available. `local[*]` does not respect most   containerization CPU limits. This option is   only used if `master` is unset and            `spark.master` is not set in the Spark        configuration.                                [default: local[*]]                           --logTEXT   Local path for Hail log file. Does not        currently support distributed file systems    like Google Storage, S3, or HDFS.             [default: None]                               --quietPrint fewer log messages. --appendAppend to the end of the log file. --min-block-sizeINTEGERMinimum file block size in MB.[default: 0] --branching-factorINTEGERBranching factor for tree aggregation. [default: 50]                          --tmp-dirTEXT   Networked temporary directory. Must be a      network-visible file path. Defaults to /tmp   in the default scheme.                        [default: None]                               --default-referenceTEXT   Default reference genome.[default: None] --idempotentIf ``True``, calling this function is a no-op if Hail has already been initialized.         --global-seedINTEGERGlobal random seed.[default: None] --spark-confTEXT   Spark backend only. Spark configuration       parameters in json format.                    [default: None]                               --skip-logging-configurationSpark Backend only. Skip logging              configuration in java and python.             --local-tmpdirTEXT   Local temporary directory. Used on driver and executor nodes. Must use the file scheme.     Defaults to TMPDIR, or /tmp.                  [default: None]                               --optimizer-iterationsINTEGER[default: None] --backendTEXT   The backend to use. Can be one of local,      spark, or batch                               [default: None]                               --driver-coresINTEGERBatch backend only. Number of cores to use    for the driver process. May be 1, 2, 4, or 8. [default: 1]                                  --driver-memoryTEXT   Batch backend only. Memory tier to use for    the driver process. May be standard or        highmem.                                      [default: standard]                           --worker-coresINTEGERBatch backend only. Number of cores to use    for the worker processes. May be 1, 2, 4, or  8.                                            [default: 1]                                  --worker-memoryTEXT   Batch backend only. Memory tier to use for    the worker processes. May be standard or      highmem.                                      [default: standard]                           --gcs-requester-pays-configurationTEXT   If a string is provided, configure the Google Cloud Storage file system to bill usage to    the project identified by that string. If a   tuple is provided, configure the Google Cloud Storage file system to bill usage to the      specified project for buckets specified in    the list.                                     [default: None]                               --regionsTEXT   List of regions to run jobs in when using the Batch backend. Use :data:`.ANY_REGION` to     specify any region is allowed or use `None`   to use the underlying default regions from    the hailctl environment configuration. For    example, use `hailctl config set              batch/regions region1,region2` to set the     default regions to use.                       [default: None]                               --gcs-bucket-allow-listTEXT   A json list of buckets that Hail should be    permitted to read from or write to, even if   their default policy is to use "cold"         storage.                                      [default: None]                               --copy-spark-log-on-errorSpark backend only. If `True`, copy the log   from the spark driver node to `tmp_dir` on    error.                                        --install-completionInstall completion for the current shell. --show-completionShow completion for the current shell, to     copy it or customize the installation.        --helpShow this message and exit. ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮ qc Run QC-related tools on BGE datasets.                                                       ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

bge-toolkit qc

qc Usage: bge-toolkit qc [OPTIONS] COMMAND [ARGS]... Run QC-related tools on BGE datasets. ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮ --install-completionInstall completion for the current shell. --show-completionShow completion for the current shell, to copy it or customize the installation.                                                      --helpShow this message and exit. ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────╮ concordance Run concordance on BGE exome and imputation datasets.                              sample-qc   Compute sample QC statistics from an exome dataset.                                ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

bge-toolkit qc concordance

concordance Usage: bge-toolkit qc concordance [OPTIONS] COMMAND [ARGS]... Run concordance on BGE exome and imputation datasets. ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮ *--exomeTEXT         Exome dataset to compare to.[default: None] [required]                   *--imputationTEXT         Imputation dataset to compare to.[default: None] [required]                        *--output-dirTEXT         Output directory.[default: None][required] --EXOME-MAFBin concordance counts by Exome Minor Allele      Frequency. (Global+Variant)                       --EXOME-MACBin concordance counts by Exome Minor Allele      Counts. (Global+Variant)                          --IMPUTATION-MAFBin concordance counts by Imputation Minor Allele Frequency. (Global+Variant)                       --IMPUTATION-MACBin concordance counts by Imputation Minor Allele Counts. (Global+Variant)                          --INFOBin concordance counts by imputation INFO score.  (Global+Variant)                                  --DPBin concordance counts by exome genotype DP.      (Global+Variant)                                  --GQBin concordance counts by exome genotype GQ.      (Global+Variant)                                  --MAX-GPBin concordance counts by maximum value of        imputation GP. (Global+Variant)                   --QUAL-APPROXBin concordance counts by variant Approx Qual     Score. (Global+Variant)                           --sample-listTEXT         Filter to samples listed in the file. [default: None]                       --variant-listTEXT         Filter to variants listed in the file. [default: None]                        --contigTEXT         Filter to variants in the contig.[default: None] --n-samplesINTEGER      Filter to the first N overlapping samples. [default: None]                            --n-variantsINTEGER      Filter to the first N variants.[default: None] --downsample-samplesFLOAT        Downsample to X fraction of samples. [default: None]                      --downsample-variantsFLOAT        Downsample to X fraction of variants. [default: None]                       --join-type[outer|inner]Join type.[default: inner] --variant-concRun variant concordance. --sample-concRun sample concordance. --global-concRun global concordance. --log-pathTEXT         Log file path to write to.[default: None] --n-exome-partitionsINTEGER      Number of partitions to load the exome dataset    with if it's a MatrixTable.                       [default: None]                                   --n-imp-partitionsINTEGER      Number of partitions to load the imputation       dataset with if it's a MatrixTable.               [default: None]                                   --install-completionInstall completion for the current shell. --show-completionShow completion for the current shell, to copy it or customize the installation.                    --helpShow this message and exit. ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Aggregations

bge-toolkit qc concordance allows you to specify different grouping variables and aggregation variables when computing global or variant concordance (not sample concordance). Be aware the memory requirements / compute time / cost is proportional to the number of aggregations and grouping variables. You can change the driver and work memory with the global flags --worker-memory and --driver-memory.

The following aggregations are implemented:

Grouping

  • Exome MAF

  • Exome MAC

  • Imputation MAF

  • Imputation MAC

Aggregation

  • QualApprox

  • INFO

  • DP (computed as the sum of AD)

  • GQ

  • MAX_GP (computed as the maximum value of GP)

See this section for more details about the bins that are computed.

Notes

The following output files are written to --output-dir:

  1. Global Concordance

  • global_concordance_table.ht (Contains the raw concordance data that can be loaded in Python with bge_toolkit.qc.ConcordanceTable.load())

  • global-results/*.tsv (Contains formatted tables for each combination of grouping and aggregation variable)

  • global-plots/nonref-conc/*.png (Contains figures of non-ref concordance for each grouping variable across aggregation variables)

  • global-plots/f1-score/*.png (Contains figures of f1-score for each grouping variable across aggregation variables)

  1. Variant Concordance

  • variant_overlaps.ht (Contains a Hail table with the list of overlapping variants)

  • variant_conc.ht (Contains the raw concordance data that can be loaded in Python with bge_toolkit.qc.ConcordanceTable.load())

  1. Sample Concordance

  • sample_overlaps.ht (Contains a Hail table with the list of overlapping samples)

  • sample_overlaps.tsv (Contains a TSV file with two columns for whether samples appear in the exome and imputation datasets)

  • sample_conc.ht (Contains the raw concordance data that can be loaded in Python with bge_toolkit.qc.ConcordanceTable.load())

  • sample-results/ALL_ALL.tsv (Contains the global concordance for each sample in the dataset)

Examples

  1. Compute global concordance stratified by the Exome MAF with input files as MatrixTables for “chr1”.

$ bge-toolkit \
     --driver-cores 2 \
     qc concordance \
     --exome "gs://my-bucket/exome.mt" \
     --imputation "gs://my-bucket/imputation.mt" \
     --output-dir "gs://my-bucket/concordance/" \
     --contig "chr1" \
     --EXOME-MAF \
     --global-conc
  1. Compute global concordance stratified by Exome MAF and Imputation MAF with PLINK input files for a downsampled dataset.

$ bge-toolkit \
     --driver-cores 2 \
     qc concordance \
     --exome "gs://my-bucket/exome-plink" \
     --imputation "gs://my-bucket/imputation-plink" \
     --output-dir "gs://my-bucket/concordance/" \
     --downsample-variants 0.1 \
     --downsample-samples 0.1 \
     --EXOME-MAF \
     --IMPUTATION-MAF \
     --global-conc
  1. Compute variant, sample, and global concordance statistics with VCF input files.

$ bge-toolkit \
     --driver-cores 2 \
     qc concordance \
     --exome "gs://my-bucket/exome.vcf.bgz" \
     --imputation "gs://my-bucket/imputation.vcf.bgz" \
     --output-dir "gs://my-bucket/concordance/" \
     --variant-conc \
     --sample-conc \
     --global-conc

bge-toolkit qc sample-qc

sample-qc Usage: bge-toolkit qc sample-qc [OPTIONS] COMMAND [ARGS]... Compute sample QC statistics from an exome dataset. ╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮ *--exomeTEXT   Exome dataset called with GATK to filter out  low quality samples.                          [default: None]                               [required]                                    *--output-dirTEXT   Output directory.[default: None][required] *--exome-regionsTEXT   BED file with list of exome intervals. [default: None]                        [required]                             *--low-complexity-regionsTEXT   BED file with list of exome intervals. [default: None]                        [required]                             --gatkGATK was used to generate the callset. --dragenDRAGEN was used to generate the callset. --hq-sites-dp-threshINTEGERDP threshold for marking a genotype high      quality.                                      [default: 10]                                 --hq-sites-gq-threshINTEGERGQ threshold for marking a genotype high      quality.                                      [default: 20]                                 --hq-sites-ab-threshINTEGERAB threshold for marking a heterozygous       genotype high quality.                        [default: 0.2]                                --hq-sites-maf-threshINTEGERMAF threshold for marking a variant high      quality.                                      [default: 0.01]                               --hq-sites-mac-threshINTEGERMAC threshold for marking a variant high      quality.                                      [default: 10]                                 --hq-sites-call-rate-threshINTEGERCall rate threshold for marking a variant     high quality.                                 [default: 0.95]                               --pre-pruned-variantsTEXT   Path to Hail Table with variant sites in the  same reference genome as the exome data.      [default: None]                               --r2-threshFLOAT  r2 threshold for pruning variants when        computing PCs and relatedness.                [default: 0.2]                                --bp-window-sizeINTEGERBP window size for pruning variants when      computing PCs and relatedness                 [default: 1000000]                            --reported-sex-pathTEXT   Path to Hail table where reported sex field   is located.                                   [default: None]                               --reported-sex-colTEXT   Column name with the reported sex coded as    True for female and False for male            [default: None]                               --sex-check-male-fhet-threshFLOAT  Sex check male Fhet threshold.[default: 0.8] --sex-check-female-fhet-threshFLOAT  Sex check female Fhet threshold. [default: 0.2]                   --pcs-kINTEGERNumber of principal components to compute. [default: 10]                              --ancestry-pop-loadingsTEXT   path to gnomAD loadings table                 [default:                                     gs://gcp-public-data--gnomad/release/4.0/pca… --ancestry-onnx-rfTEXT   path to gnomAD onnx file                      [default:                                     gs://gcp-public-data--gnomad/release/4.0/pca… --ancestry-pcs-kINTEGERNumber of principal components to use. This   is dependent on the gnomAD data.              [default: 20]                                 --ancestry-min-probFLOAT  Minimum probability to be called for an       ancestry population.                          [default: 0.75]                               --chimera-rate-pathTEXT   Path to Hail Table with chimera read rates. [default: None]                             --chimera-rate-colTEXT   Column name of chimera read rates in the Hail Table.                                        [default: None]                               --chimera-thresholdFLOAT  Maximum rate of chimera reads for a sample to be considered passing.                        [default: 0.05]                               --contamination-charr-threshFLOAT  Contamination rate threshold as computed      using CHARR.                                  [default: 0.05]                               --contamination-min-afFLOAT  Minimum AF when computing contamination       rating using CHARR.                           [default: 0.05]                               --contamination-max-afFLOAT  Maximum AF when computing contamination       rating using CHARR.                           [default: 0.95]                               --contamination-min-dpINTEGERMinimum DP when computing contamination       rating using CHARR.                           [default: 10]                                 --contamination-max-dpINTEGERMaximum DP when computing contamination       rating using CHARR.                           [default: 100]                                --contamination-min-gqINTEGERMinimum GQ when computing contamination       rating using CHARR.                           [default: 20]                                 --contamination-ref-af-pathTEXT   Path to Hail Table with contamination rates   per sample.                                   [default: None]                               --contamination-ref-af-col-nameTEXT   Column name of contamination rate in Hail     Table.                                        [default: None]                               --coverage-gq-threshINTEGERMinimum GQ for considering a variant          well-genotyped in an individual.              [default: 20]                                 --coverage-fraction-threshFLOAT  Minimum number of genotypes meeting the GQ    threshold for being considered passing.       [default: 0.9]                                --sample-listTEXT   Filter to samples listed in the file. [default: None]                       --variant-listTEXT   Filter to variants listed in the file. [default: None]                        --contigTEXT   Filter to variants in the contig. [default: None]                   --n-samplesINTEGERFilter to the first N overlapping samples. [default: None]                            --n-variantsINTEGERFilter to the first N variants. [default: None]                 --downsample-samplesFLOAT  Downsample to X fraction of samples. [default: None]                      --downsample-variantsFLOAT  Downsample to X fraction of variants. [default: None]                       --log-pathTEXT   Log file path to write to.[default: None] --n-partitionsINTEGERNumber of partitions to load the dataset with if it's a MatrixTable.                        [default: None]                               --reference-genomeTEXT   Reference genome string to use when loading   BED file intervals.                           [default: None]                               --install-completionInstall completion for the current shell. --show-completionShow completion for the current shell, to     copy it or customize the installation.        --helpShow this message and exit. ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Examples

$ bge-toolkit qc sample-qc \
    --exome "gs://MY-BUCKET/data.mt" \
    --output-dir "gs://MY-BUCKET/test-sample-qc/080525-v2/" \
    --exome-regions "gs://MY-BUCKET/Twist_Alliance_Clinical_Research_Exome_Covered_Targets_hg38-34.9MB.bed" \
    --low-complexity-regions "gs://MY-BUCKET/LCRFromHengHg38.bed" \
    --dragen \
    --reported-sex-path "gs://MY-BUCKET/metadata.ht" \
    --reported-sex-col "reported_sex" \
    --chimera-rate-path "gs://MY-BUCKET/metadata.ht" \
    --chimera-rate-col "CHIMERA"

Notes

The following output files are written to --output-dir:

  • sample_qc_stats.ht (Contains the raw sample qc data that can be loaded in Python with SampleQCResult.load())

  • pcs/pc1_pc2.png (Contains a plot of PC1 versus PC2 colored by ancestry population label)

  • pcs/pc1_pc3.png (Contains a plot of PC1 versus PC3 colored by ancestry population label)

  • pcs/pc2_pc3.png (Contains a plot of PC2 versus PC3 colored by ancestry population label)

  • qc/*_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged)

  • qc/*_density.png (Contains density plots of different QC metrics stratified by ancestry population label)

  • passing_sample_ids.tsv (A TSV file containing a list of sample IDs for samples that passed all QC metrics)

  • qc/*_pass_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged for passing samples only)

  • qc/*_pass_density.png (Contains density plots of different QC metrics stratified by ancestry population label for passing samples only)

The structure of sample_qc_stats.ht is as follows:

contamination:
    - charr: The CHARR statistic.
    - is_passing: ``charr`` is less than the ``charr_thresh``

chimera_reads:
    - chimera_rate: The rate of chimera reads.
    - is_passing: The rate of chimera reads is below ``threshold``.

sample_qc_metrics:
    - is_passing: Passes every statistic.
    - r_ti_tv: Transition / Transversion ratio.
    - n_singleton: Number of singletons.
    - n_insertion: Number of insertions.
    - n_deletion: Number of deletions.
    - n_transition: Number of transitions.
    - n_transversion: Number of transversions.
    - r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
    - r_insertion_deletion: Ratio of insertions to deletions.
    - fail_r_ti_tv: An outlier in ``r_ti_tv``.
    - fail_n_singleton: An outlier in ``n_singleton``.
    - fail_n_insertion: An outlier in ``n_insertion``.
    - fail_n_deletion: An outlier in ``n_deletion``.
    - fail_n_transition: An outlier in ``n_transition``.
    - fail_n_transversion: An outlier in ``n_transversion``.
    - fail_r_het_hom_var: An outlier in ``r_het_hom_var``.
    - fail_r_insertion_deletion: An outlier in ``r_insertion_deletion``.

gq_fraction:
    - gq_fraction: The percentage of genotypes with GQ >= ``gq_thresh``.
    - is_passing: ``gq_fraction`` is greater than the ``fraction_thresh``

sex_info:
    - is_female: The imputed sex. ``True`` is for females, ``False`` is for males.
    - is_passing: Either ``reported sex == imputed sex`` or ``True``.
    - sex_check: ``reported sex == imputed sex`` or ``Null``.
    - f_stat: Inbreeding coefficient on the non-PAR X chromosome.
    - n_called: Number of genotypes considered.
    - expected_homs: Expected number of homozygotes.
    - observed_homs: Observed number of heterozygotes.

ancestry:
    - ancestry_pop: The ancestry population label.

pcs:
    - scores: An array with the top ``k`` principal components.

s: sample ID