Sample QC Reference

This is the Python API documentation for the concordance component of the BGE Toolkit.

Use import bge_toolkit.qc to access this functionality.

sample_qc

If you want to use the default implementation as described in Sealock et. al. “Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses”, then you can use the function sample_qc() with Hail MatrixTable and Table inputs.

An example of using the Python API is:

>>> from bge_toolkit.qc import sample_qc
>>> import hail as hl
>>> out_dir = "gs://my-bucket/my-sample-qc-output/"
>>> exome_regions_path = "gs://jigold-batch-tmp-ezxyx/Twist_Alliance_Clinical_Research_Exome_Covered_Targets_hg38-34.9MB.bed"
>>> low_complexity_regions_path = "gs://jigold-batch-tmp-ezxyx/LCRFromHengHg38.bed"
>>> exome_regions = hl.import_locus_intervals(exome_regions_path, reference_genome='GRCh38')
>>> low_complexity_regions = hl.import_locus_intervals(low_complexity_regions_path, reference_genome='GRCh38')
>>> exo = hl.read_matrix_table("gs://my-bucket/my-data.mt")
>>> sample_qc_stats = sample_qc(mt=exo, out_dir=out_dir, is_gatk=False, exome_regions=exome_regions, low_complexity_regions=low_complexity_regions)
>>> sample_qc_stats.write(f'{out_dir}sample_qc.ht')
>>> sample_qc_stats.pcs.export(f'{out_dir}sample_pcs.tsv')
>>> sample_qc_stats.plot_pcs(1, 2)
>>> sample_qc_stats.plot_gq_fraction_density()
>>> sample_qc_stats.plot_qc_metric_boxplots()['r_ti_tv']

The following output files are written to out_dir:

  • sample_qc_stats.ht (Contains the raw sample qc data that can be loaded in Python with SampleQCResult.load())

  • pcs/pc1_pc2.png (Contains a plot of PC1 versus PC2 colored by ancestry population label)

  • pcs/pc1_pc3.png (Contains a plot of PC1 versus PC3 colored by ancestry population label)

  • pcs/pc2_pc3.png (Contains a plot of PC2 versus PC3 colored by ancestry population label)

  • qc/*_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged)

  • qc/*_density.png (Contains density plots of different QC metrics stratified by ancestry population label)

  • passing_sample_ids.tsv (A TSV file containing a list of sample IDs for samples that passed all QC metrics)

  • qc/*_pass_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged for passing samples only)

  • qc/*_pass_density.png (Contains density plots of different QC metrics stratified by ancestry population label for passing samples only)

The structure of sample_qc_stats.ht is as follows:

contamination:
    - charr: The CHARR statistic.
    - is_passing: ``charr`` is less than the ``charr_thresh``

chimera_reads:
    - chimera_rate: The rate of chimera reads.
    - is_passing: The rate of chimera reads is below ``threshold``.

sample_qc_metrics:
    - is_passing: Passes every statistic.
    - r_ti_tv: Transition / Transversion ratio.
    - n_singleton: Number of singletons.
    - n_insertion: Number of insertions.
    - n_deletion: Number of deletions.
    - n_transition: Number of transitions.
    - n_transversion: Number of transversions.
    - r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
    - r_insertion_deletion: Ratio of insertions to deletions.
    - fail_r_ti_tv: An outlier in ``r_ti_tv``.
    - fail_n_singleton: An outlier in ``n_singleton``.
    - fail_n_insertion: An outlier in ``n_insertion``.
    - fail_n_deletion: An outlier in ``n_deletion``.
    - fail_n_transition: An outlier in ``n_transition``.
    - fail_n_transversion: An outlier in ``n_transversion``.
    - fail_r_het_hom_var: An outlier in ``r_het_hom_var``.
    - fail_r_insertion_deletion: An outlier in ``r_insertion_deletion``.

gq_fraction:
    - gq_fraction: The percentage of genotypes with GQ >= ``gq_thresh``.
    - is_passing: ``gq_fraction`` is greater than the ``fraction_thresh``

sex_info:
    - is_female: The imputed sex. ``True`` is for females, ``False`` is for males.
    - is_passing: Either ``reported sex == imputed sex`` or ``True``.
    - sex_check: ``reported sex == imputed sex`` or ``Null``.
    - f_stat: Inbreeding coefficient on the non-PAR X chromosome.
    - n_called: Number of genotypes considered.
    - expected_homs: Expected number of homozygotes.
    - observed_homs: Observed number of heterozygotes.

ancestry:
    - ancestry_pop: The ancestry population label.

pcs:
    - scores: An array with the top ``k`` principal components.

s: sample ID
bge_toolkit.qc.sample_qc(*, mt, out_dir, is_gatk, exome_regions, low_complexity_regions, hq_sites_dp_thresh=10, hq_sites_gq_thresh=20, hq_sites_ab_thresh=0.2, hq_sites_maf_thresh=0.01, hq_sites_mac_thresh=10, hq_sites_call_rate_thresh=0.95, pre_pruned_variants=None, r2_thresh=0.2, bp_window_size=1000000, sex_reported_is_female=None, sex_check_male_fhet_thresh=0.8, sex_check_female_fhet_thresh=0.2, pcs_k=10, ancestry_pca_loadings='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht', ancestry_onnx_rf='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.RF_fit.onnx', ancestry_num_pcs=20, ancestry_min_prob=0.75, chimera_rate=None, chimera_threshold=0.05, contamination_charr_thresh=0.05, contamination_min_af=0.05, contamination_max_af=0.95, contamination_min_dp=10, contamination_max_dp=100, contamination_min_gq=20, contamination_ref_af=None, coverage_gq_thresh=20, coverage_fraction_thresh=0.9, log=None, make_plots=True)

Run the default implementation of Sample QC.

Parameters:
  • mt (Union(MatrixTable, str)) – Hail MatrixTable or path to MatrixTable to QC.

  • out_dir (str) – Path to output directory to write sample QC results to.

  • is_gatk (bool) – True if data was called with GATK.

  • exome_regions (Table) – Table of locus intervals specifying valid exome regions.

  • low_complexity_regions (Table) – Table of locus intervals specifying low complexity regions to remove.

  • hq_sites_dp_thresh (int) – For finding high quality sites, the minimum DP threshold for which a genotype is kept. DP is the sum of AD if is_gatk=False.

  • hq_sites_gq_thresh (int) – For finding high quality sites, the minimum GQ threshold for a kept genotype.

  • hq_sites_ab_thresh (float) – For finding high quality sites, the allowable allelic balance range for a kept heterozygote genotype.

  • hq_sites_maf_thresh (float) – For finding high quality sites, the minimum minor allele frequency as computed by hl.variant_qc in order for a variant to be considered to be common.

  • hq_sites_mac_thresh (int) – For finding high quality sites, the minimum number of minor alleles observed as computed by hl.variant_qc in order for a variant to be considered common.

  • hq_sites_call_rate_thresh (float) – For finding high quality sites, the minimum call rate of a variant as computed by hl.variant_qc in order for a variant to be considered.

  • pre_pruned_variants (Optional(Table)) – For subsetting data to LD pruned sites, the list of pre-pruned variants to use.

  • r2_thresh (float) – For subsetting data to LD pruned sites, the R2 threshold to use when running hl.ld_prune.

  • bp_window_size (int) – For subsetting data to LD pruned sites, the window for considering variants to prune in base pairs.

  • sex_reported_is_female (Optional(BooleanExpression)) – For running a sex check, the reported sex as a Boolean Expression keyed by the Sample ID where True is Female and False is Male.

  • sex_check_male_fhet_thresh (float) – For running a sex check, the minimum Fstat threshold for calling a sample “Male”.

  • sex_check_female_fhet_thresh (float) – For running a sex check, the maximum Fstat threshold for calling a sample “Female”.

  • pcs_k (int) – For computing principal components, the number of principal components to compute.

  • ancestry_pca_loadings (str) – For computing ancestry pop labels, path to the gnomAD PCA loadings Hail Table.

  • ancestry_onnx_rf (str) – For computing ancestry pop labels, path to the gnomAD random forest model.

  • ancestry_num_pcs (int) – For computing ancestry pop labels, number of PCs to use when applying the random forest model. This is dependent on the onnx model.

  • ancestry_min_prob (float) – For computing ancestry pop labels, the minimum probability for a predicted population before giving a sample that label.

  • chimera_rate (Optional(NumericExpression)) – For filtering samples based on chimera rate, the chimera rate as a NumericExpression keyed by the sample ID.

  • chimera_threshold (float) – For filtering samples based on chimera rate, the maximum threshold to use at which a sample is considered passing.

  • contamination_charr_thresh (float) – For filtering samples based on contamination rate, the threshold for which a sample is considered to be passing from CHARR.

  • contamination_min_af (float) – For filtering samples based on contamination rate, the minimum allele frequency when using the compute_charr function in Hail.

  • contamination_max_af (float) – For filtering samples based on contamination rate, the maximum allele frequency when using the compute_charr function in Hail.

  • contamination_min_dp (int) – For filtering samples based on contamination rate, the minimum depth when using the compute_charr function in Hail.

  • contamination_max_dp (int) – For filtering samples based on contamination rate, the maximum depth when using the compute_charr function in Hail.

  • contamination_min_gq (int) – For filtering samples based on contamination rate, the minimum GQ when using the compute_charr function in Hail.

  • contamination_ref_af (Optional[NumericExpression]) – A float row field on the MatrixTable with the reference allele frequency when using the compute_charr function in Hail.

  • coverage_gq_thresh (int) – For filtering samples based on GQ coverage stats, the minimum GQ for a genotype to be considered passing.

  • coverage_fraction_thresh (float) – For filtering samples based on GQ coverage stats, the minimum rate at which GQ is above the minimum threshold.

  • log_path (Optional[str]) – Log file path to write to.

  • hail_init_kwargs (Optional[dict]) – Keyword arguments to hl.init().

  • log (Optional[logging.Logger]) – Logging object to log to.

  • make_plots (bool) – Save default QC plots to out_dir.

Returns:

A SampleQCResult object.

High Level Functions

The sample_qc() function is implemented in terms of the following three lower-level functions:

bge_toolkit.qc.select_high_quality_common_sites(mt, is_gatk, *, low_complexity_regions, exome_regions, dp_thresh=10, gq_thresh=20, ab_thresh=0.2, maf_thresh=0.01, mac_thresh=10, call_rate_thresh=0.95)

Select high quality sites for downstream sample QC.

Parameters:
  • mt (MatrixTable) – The MatrixTable to select high quality sites from.

  • is_gatk (bool) – True if the data was called using GATK. False if the data was called using DRAGEN.

  • low_complexity_regions (Table) – A Hail Table with intervals representing regions of the genome to filter out.

  • exome_regions (Table) – A Hail Table with intervals representing regions of the genome to keep.

  • dp_thresh (int) – The minimum DP threshold for which a genotype is kept. DP is the sum of AD if is_gatk=False.

  • gq_thresh (int) – The minimum GQ threshold for a kept genotype.

  • ab_thresh (float) – The allowable allelic balance range for a kept heterozygote genotype.

  • maf_thresh (float) – The minimum minor allele frequency as computed by hl.variant_qc in order for a variant to be considered to be common.

  • mac_thresh (int) – The minimum number of minor alleles observed as computed by hl.variant_qc in order for a variant to be considered common.

  • call_rate_thresh (float) – The minimum call rate of a variant as computed by hl.variant_qc in order for a variant to be considered.

Returns:

A Table with locus and alleles for common, LD-pruned high quality sites.

bge_toolkit.qc.infer_ancestry(mt, *, is_gatk=True, dp_thresh=10, gq_thresh=20, ab_thresh=0.2, pca_loadings='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht', onnx_rf='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.RF_fit.onnx', requester_pays_config=None, num_pcs=20, min_prob=0.75, log=None)

Use a Random Forest Model derived from gnomAD data to infer ancestry population labels.

Does basic QC automatically: - Removes genotypes that do not meet the DP or GQ threshold - Removes hetrozygote genotypes that do not meet the AB threshold - Removes variants with low call rates

The result is a Hail Table with the following fields:
  • mt.col_key: The column key of the input MatrixTable

  • ancestry: * ancestry_pop: The ancestry population label.

Parameters:
  • mt (MatrixTable) – The MatrixTable with the samples to infer ancestry from. This dataset should be filtered to high quality variants.

  • is_gatk (bool) – True if data was called with GATK.

  • dp_thresh (int) – The minimum DP threshold for which a genotype is kept. DP is the sum of AD if is_gatk=False.

  • gq_thresh (int) – The minimum GQ threshold for a kept genotype.

  • ab_thresh (float) – The allowable allelic balance range for a kept heterozygote genotype.

  • pca_loadings (str) – Path to the gnomAD PCA loadings Hail Table.

  • onnx_rf (str) – Path to the gnomAD random forest model.

  • num_pcs (int) – Number of PCs to use when applying the random forest model.

  • min_prob (float) – The minimum probability for a predicted population before giving a sample that label.

  • log (Optional[logging.Logger]) – An optional logging object.

Returns:

sample ID and a nested struct called ancestry with one field ancestry_pop.

Return type:

A Hail Table with two columns

bge_toolkit.qc.calculate_sample_qc_stats(*, mt, high_quality_sites, pre_pruned_variants=None, ld_prune_r2_thresh=0.2, ld_prune_bp_window_size=1000000, chimera_rate=None, chimera_threshold=0.05, contamination_charr_thresh=0.05, contamination_min_af=0.05, contamination_max_af=0.95, contamination_min_dp=10, contamination_max_dp=100, contamination_min_gq=20, contamination_ref_AF=None, pcs_k=10, ancestry_pop_labels=None, coverage_gq_thresh=20, coverage_fraction_thresh=0.9, sex_reported_is_female=None, sex_male_fhet_thresh=0.8, sex_female_fhet_thresh=0.2, log=None)

Calculate Sample QC statistics from a Hail MatrixTable.

Parameters:
  • mt (MatrixTable) – A Hail MatrixTable to compute sample QC statistics on.

  • high_quality_sites (Table) – A Hail Table with the list of high quality sites (common variants, in exome regions, not in low complexity regions).

  • pre_pruned_variants (Optional[Table]) – The path to a table with LD pruned variants.

  • ld_prune_r2_thresh (float) – The LD-Prune R2 value when filtering out variants in linkage disequilibrium.

  • ld_prune_bp_window_size (int) – The window size for LD pruning variants.

  • chimera_rate (NumericExpression) – A numeric expression containing the chimeric read rate.

  • chimera_threshold (float) – The maximum threshold for which to consider a sample passing.

  • contamination_charr_thresh (float) – The threshold for which a sample is considered to be passing.

  • contamination_min_af (float) – The minimum allele frequency when using the compute_charr function in Hail.

  • contamination_max_af (float) – The maximum allele frequency when using the compute_charr function in Hail.

  • contamination_min_dp (int) – The minimum depth when using the compute_charr function in Hail.

  • contamination_max_dp (int) – The maximum depth when using the compute_charr function in Hail.

  • contamination_min_gq (int) – The minimum GQ when using the compute_charr function in Hail.

  • contamination_ref_AF (Float64Expression) – A float row field on the MatrixTable with the reference allele frequency when using the compute_charr function in Hail.

  • pcs_k (int) – The number of principal components to compute.

  • ancestry_pop_labels (Optional(Table)) – A Hail Table with ancestry population labels such as projected from gnomAD. There should be a column ancestry with one field ancestry_pop.

  • coverage_gq_thresh (int) – The minimum GQ for a genotype to be considered passing.

  • coverage_fraction_thresh (float) – The minimum rate at which GQ is above the minimum threshold.

  • sex_reported_is_female (Optional[hl.BooleanExpression]) – A field from a dataset where True means the sample is “Female” and False means “Male”.

  • sex_male_fhet_thresh (float) – The minimum Fstat threshold for calling a sample “Male”.

  • sex_female_fhet_thresh (float) – The maximum Fstat threshold for calling a sample “Female”.

  • log (Optional[logging.Logger]) – An optional logging object.

Returns:

A SampleQCResult with plotting and exporting functionality.

Lower Level Functions

select_high_quality_common_sites(), infer_ancestry(), and calculate_sample_qc_stats() are implemented in terms of these lower level functions which you can use to build your own QC pipeline:

bge_toolkit.qc.filter_to_high_quality_exome_intervals(mt, low_complexity_regions, exome_regions, is_gatk=True)

Filter a MatrixTable to the exome with no low complexity regions.

Parameters:
  • mt (MatrixTable) – A Hail MatrixTable to filter the sites.

  • low_complexity_regions (Table) – A Table of intervals to filter out. Use import_locus_intervals from a BED file to generate the table.

  • exome_regions (Table) – A Table of intervals to filter to. Use import_locus_intervals from a BED file to generate the table.

  • is_gatk (bool) – True if the dataset was called using GATK.

Result:

A filtered MatrixTable to only exome regions without the low complexity regions of the genome.

bge_toolkit.qc.filter_contamination_rate(*, mt, charr_thresh=0.05, min_af=0.05, max_af=0.95, min_dp=10, max_dp=100, min_gq=20, ref_AF=None)

Identify samples with high contamination rates.

Uses the Hail compute_charr function to compute the estimate of contamination rate.

The result is a Hail Table with the following fields:
  • mt.col_key: The column key of the input MatrixTable

  • contamination: * charr: The CHARR statistic. * is_passing: charr is less than the charr_thresh

Parameters:
  • mt (MatrixTable) – A Hail MatrixTable to compute contamination statistics from.

  • charr_thresh (float) – The threshold for which a sample is considered to be passing.

  • min_af (float) – The minimum allele frequency when using the compute_charr function in Hail.

  • max_af (float) – The maximum allele frequency when using the compute_charr function in Hail.

  • min_dp (int) – The minimum depth when using the compute_charr function in Hail.

  • max_dp (int) – The maximum depth when using the compute_charr function in Hail.

  • min_gq (int) – The minimum GQ when using the compute_charr function in Hail.

  • ref_AF (Float64Expression) – A float row field on the MatrixTable with the reference allele frequency when using the compute_charr function in Hail.

Returns:

A Hail Table with the contamination statistic in a new column “contamination” that is a struct with one field “charr”.

bge_toolkit.qc.filter_chimeric_reads(*, chimera_rate, threshold)

Determine whether a sample passes based on chimeric read percentages.

The result is a Hail Table with the following fields:
  • ds.col_key: The column key of the input expression.

  • chimera_reads: * chimera_rate: The rate of chimera reads. * is_passing: The rate of chimera reads is below threshold.

Parameters:
  • chimera_rate (NumericExpression) – A numeric expression containing the chimeric read rate.

  • threshold (float) – The maximum threshold for which to consider a sample passing.

Returns:

A Hail Table with the chimeric read data along with an annotation for whether it is passing.

bge_toolkit.qc.filter_genotypes(mt, is_gatk, *, dp_thresh=10, gq_thresh=20, ab_thresh=0.2)

Filter genotypes that are of poor quality from an exome dataset.

If the dataset was called using DRAGEN, DP is the sum of AD and only heterozygote and homozygous alternate calls are subject to the filter on read depth.

Parameters:
  • mt (MatrixTable) – A MatrixTable to filter the genotypes of.

  • is_gatk (bool) – True if the dataset was called using GATK.

  • dp_thresh (int) – The minimum DP for a kept genotype. DP is computed as the sum of the “AD” field.

  • gq_thresh (int) – The minimum GQ threshold for a kept genotype.

  • ab_thresh (float) – The allowable allelic balance range for a kept heterozygote genotype.

Returns:

A MatrixTable with poor quality genotypes removed.

bge_toolkit.qc.filter_to_high_quality_common_sites(mt, *, maf_thresh=0.01, mac_thresh=10, call_rate_thresh=0.95)

Filter a MatrixTable to common, LD pruned sites.

Parameters:
  • mt (MatrixTable) – The MatrixTable to subset to filter to common, LD pruned sites.

  • maf_thresh (float) – The minimum minor allele frequency as computed by hl.variant_qc in order for a variant to be considered to be common.

  • mac_thresh (int) – The minimum number of minor alleles observed as computed by hl.variant_qc in order for a variant to be considered common.

  • call_rate_thresh (float) – The minimum call rate of a variant as computed by hl.variant_qc in order for a variant to be considered.

Returns:

A Hail MatrixTable subsetted to common, LD-pruned sites.

bge_toolkit.qc.compute_pcs(mt, *, k=10)

Compute principal components.

Only computed from autosomal variants.

The result is a Hail Table with the following fields:
  • mt.col_key: The column key of the input MatrixTable.

  • pcs: * scores: An array with the top k principal components.

Parameters:
  • mt (MatrixTable) – A Hail MatrixTable to compute principal components from. Should be LD pruned and filtered to high quality common variants.

  • k (int) – The number of principal components to compute.

Returns:

A Hail Table with the principal components.

bge_toolkit.qc.estimate_call_rate_through_gq_stats(mt, gq_thresh=20, fraction_thresh=0.9)

Flag samples with the proportion of calls with GQ less than a threshold being too low.

Only computed for autosomal variants.

The result is a Hail Table with the following fields:
  • mt.col_key: The column key of the input MatrixTable

  • gq_fraction: * gq_fraction: The percentage of genotypes with GQ >= gq_thresh. * is_passing: gq_fraction is greater than the fraction_thresh

Parameters:
  • mt (MatrixTable) – The MatrixTable to compute GQ stats from.

  • gq_thresh (int) – The minimum GQ for a genotype to be considered passing.

  • fraction_thresh (float) – The minimum rate at which GQ is above the minimum threshold.

Returns:

A Hail Table with the struct gq_fraction and include in the column gq_fraction.

bge_toolkit.qc.impute_and_check_sex(mt, maf_threshold=0.0, male_fhet_thresh=0.8, female_fhet_thresh=0.2, maf=None, reported_is_female=None)

Impute and Check Sex.

The result is a Hail Table with the following fields:
  • mt.col_key: The column key of the input MatrixTable

  • sex_info: * is_female: The imputed sex. True is for females, False is for males. * is_passing: Either reported sex == imputed sex or True. * sex_check: reported sex == imputed sex or Null. * f_stat: Inbreeding coefficient on the non-PAR X chromosome. * n_called: Number of genotypes considered. * expected_homs: Expected number of homozygotes. * observed_homs: Observed number of heterozygotes.

Parameters:
  • mt (MatrixTable) – A Hail MatrixTable to impute sex from.

  • maf_threshold (float) – The minor allele frequency threshold in order for a variant to be included in the analysis.

  • male_fhet_thresh (float) – The minimum Fstat threshold for calling a sample “Male”.

  • female_fhet_thresh (float) – The maximum Fstat threshold for calling a sample “Female”.

  • maf (Optional[hl.NumericExpression]) – A field defining the alternate allele frequency for each row. If

  • None

  • call. (AAF will be computed from)

  • reported_is_female (Optional[hl.BooleanExpression]) – A field from a dataset where True means the sample is “Female” and False means “Male”.

Returns:

A Hail Table with sex information annotated.

bge_toolkit.qc.identify_outliers_in_qc_stats(*, mt, ancestry_pop=None)

Identify outliers in QC statistics.

All statistics computed with hl.sample_qc.

The result is a Hail Table with the following fields:
  • mt.col_key: The column key of the input MatrixTable

  • sample_qc_metrics: * is_passing: Passes every statistic. * r_ti_tv: Transition / Transversion ratio. * n_singleton: Number of singletons. * n_insertion: Number of insertions. * n_deletion: Number of deletions. * n_transition: Number of transitions. * n_transversion: Number of transversions. * r_het_hom_var: Ratio of heterozygotes to number of homozygote variants. * r_insertion_deletion: Ratio of insertions to deletions. * fail_r_ti_tv: An outlier in r_ti_tv. * fail_n_singleton: An outlier in n_singleton. * fail_n_insertion: An outlier in n_insertion. * fail_n_deletion: An outlier in n_deletion. * fail_n_transition: An outlier in n_transition. * fail_n_transversion: An outlier in n_transversion. * fail_r_het_hom_var: An outlier in r_het_hom_var. * fail_r_insertion_deletion: An outlier in r_insertion_deletion.

Parameters:
  • mt (MatrixTable) – A Hail MatrixTable to compute sample QC statistics on. This should be LD-pruned, high quality variants.

  • ancestry_pop (Optional[StringExpression]) – A Hail Expression with the ancestry population label for each sample for computing outliers based on ancestry-stratified statistics.

Returns:

A Hail Table with sample QC statistics as well as outliers flagged for each metric.

SampleQCResult

A SampleQCResult is an object that provides a wrapper for interacting with the sample QC results by making plots and exploring the metrics.

class bge_toolkit.qc.SampleQCResult(table)

Bases: object

An interface for working with Sample QC Results.

property ancestry_pop

A HailExpression that contains the imputed ancestry population.

The result is a StringExpression.

See infer_ancestry() for the implementation and more details.

Returns:

A Hail StringExpression with the ancestry population label.

property chimera_reads

A HailExpression that contains the chimeric rate results.

The result is a StructExpression with the following fields:

  • chimera_rate: The rate of chimera reads.

  • is_passing: The rate of chimera reads is below threshold.

See filter_chimeric_reads() for the implementation and more details.

Returns:

A Hail StructExpression with the schema defined above.

close()

Unpersist any temporary tables.

property contamination

A HailExpression that contains the contamination results.

The result is a StructExpression with the following fields:

  • charr: The CHARR statistic.

  • is_passing: charr is less than the charr_thresh

See filter_contamination_rate() for the implementation and more details.

Returns:

A Hail StructExpression with the schema defined above.

export(path, exprs, named_exprs, *, delimiter='\t', missing='NA', header=True)

Export the sample QC results to a file.

Examples

>>> sample_qc_results.export('gs://my-bucket/analysis/sample_qc.tsv',
...                          exprs=[sample_qc_results.relatedness.is_passing],
...                          named_exprs={'r_ti_tv': sample_qc_results.qc_metrics.r_ti_tv,
...                                       'PC1': sample_qc_results.pcs.scores[0]})
Parameters:
  • path (str) – The file to export data to.

  • exprs (List[hl.Expression]) – A list of expressions to export.

  • named_exprs (Dict[str, hl.Expression]) – Named expressions to export.

  • delimiter (str) – The delimiter of the resulting exported file.

  • missing (str) – The value to use for missing values.

  • header (bool) – Include a header line.

filter_to(filter=None)

Filter the Sample QC results to a user-specified set of samples based on a BooleanExpression.

Parameters:

filter (hl.BooleanExpression) – A BooleanExpression using expressions derived from this SampleQCResult to identify selected samples.

Returns:

A SampleQCResult with only filtered samples.

property gq_fraction

A HailExpression that contains the GQ fraction results as a proxy for call rate.

The result is a StructExpression with the following fields:

  • gq_fraction: The percentage of genotypes with GQ >= gq_thresh.

  • is_passing: gq_fraction is greater than the fraction_thresh

See estimate_call_rate_through_gq_stats() for the implementation and more details.

Returns:

A Hail StructExpression with the schema defined above.

static load(path)

Load a SampleQCResult from a Hail Table on disk.

Parameters:

path (str) – The path at which the Hail Table resides.

Returns:

A SampleQCResult object.

passing_samples(is_passing=None)

Filter the Sample QC results to passing samples.

The default implementation for defining a passing sample is if it meets the following criteria:
  1. contamination.is_passing is True

  2. chimera_reads.is_passing is True or is missing.

  3. relatedness.is_passing is True for unrelated samples.

  4. sex_info.is_passing is True or is missing.

  5. gq_fraction.is_passing is True.

  6. qc_metrics.is_passing is True meaning the sample is not an outlier based on ancestry stratified QC metrics.

Parameters:

is_passing (hl.BooleanExpression) – A BooleanExpression using expressions derived from this SampleQCResult.

Returns:

A SampleQCResult with only passing samples.

property pcs

A HailExpression that contains the principal components.

The result is a StructExpression with the following fields:

  • scores: An array with the top k principal components.

See compute_pcs() for the implementation and more details.

Returns:

A Hail StructExpression with the schema defined above.

plot_chimera_reads()

Plot chimera reads as a boxplot.

Colors samples that fail the threshold as a different color.

Returns:

A plotnine ggplot object.

plot_gq_fraction_density()

Plot GQ fraction as a density plot.

Returns:

A plotnine ggplot object.

plot_pcs(pc_i, pc_j)

Plot principal components.

Parameters:
  • pc_i (int) – This is a 1-indexed principal component index. 1=First principal component.

  • pc_j (int) – This is a 1-indexed principal component index. 1=First principal component.

Returns:

A plotnine ggplot object.

plot_qc_metric_boxplots()

Plot QC metrics as boxplots.

Possible metrics:
  • r_ti_tv: Transition / Transversion ratio.

  • n_singleton: Number of singletons.

  • n_insertion: Number of insertions.

  • n_deletion: Number of deletions.

  • n_transition: Number of transitions.

  • n_transversion: Number of transversions.

  • r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.

  • r_insertion_deletion: Ratio of insertions to deletions.

Returns:

A Dictionary of metric name to plotnine ggplot object.

plot_qc_metric_density()

Plot QC metrics as density plots.

Possible metrics:
  • r_ti_tv: Transition / Transversion ratio.

  • n_singleton: Number of singletons.

  • n_insertion: Number of insertions.

  • n_deletion: Number of deletions.

  • n_transition: Number of transitions.

  • n_transversion: Number of transversions.

  • r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.

  • r_insertion_deletion: Ratio of insertions to deletions.

Returns:

A Dictionary of metric name to plotnine ggplot object.

property qc_metrics

A HailExpression that contains the sample qc metrics results with outliers flagged.

The result is a StructExpression with the following fields:

  • is_passing: Passes every statistic.

  • r_ti_tv: Transition / Transversion ratio.

  • n_singleton: Number of singletons.

  • n_insertion: Number of insertions.

  • n_deletion: Number of deletions.

  • n_transition: Number of transitions.

  • n_transversion: Number of transversions.

  • r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.

  • r_insertion_deletion: Ratio of insertions to deletions.

  • fail_r_ti_tv: An outlier in r_ti_tv.

  • fail_n_singleton: An outlier in n_singleton.

  • fail_n_insertion: An outlier in n_insertion.

  • fail_n_deletion: An outlier in n_deletion.

  • fail_n_transition: An outlier in n_transition.

  • fail_n_transversion: An outlier in n_transversion.

  • fail_r_het_hom_var: An outlier in r_het_hom_var.

  • fail_r_insertion_deletion: An outlier in r_insertion_deletion.

See identify_outliers_in_qc_stats() for the implementation and more details.

Returns:

A Hail StructExpression with the schema defined above.

property sample_id

A HailExpression that contains the key of the sample qc results.

The key in most cases is a single field s with the sample ID.

Returns:

A Hail StructExpression with the key of the sample QC results.

property sex_info

A HailExpression that contains the imputed sex and sex check results.

The result is a StructExpression with the following fields:

  • is_female: The imputed sex. True is for females, False is for males.

  • is_passing: Either reported sex == imputed sex or True.

  • sex_check: reported sex == imputed sex or Null.

  • f_stat: Inbreeding coefficient on the non-PAR X chromosome.

  • n_called: Number of genotypes considered.

  • expected_homs: Expected number of homozygotes.

  • observed_homs: Observed number of heterozygotes.

See impute_and_check_sex() for the implementation and more details.

Returns:

A Hail StructExpression with the schema defined above.

property table

Underlying Hail Table containing the Sample QC statistics.

Examples:

>>> t = qc_result.table
>>> t = t.annotate(my_field=...)
Returns:

A Hail Table with Sample QC statistics.

write(path, *, overwrite=True)

Write the underlying table to disk.

Parameters:
  • path (str) – The path to write the Hail Table to.

  • overwrite (bool) – Whether to overwrite the file if it already exists.