Sample QC Reference
This is the Python API documentation for the concordance component of the BGE Toolkit.
Use import bge_toolkit.qc to access this functionality.
sample_qc
If you want to use the default implementation as described in Sealock et. al. “Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses”,
then you can use the function sample_qc() with Hail MatrixTable and Table inputs.
An example of using the Python API is:
>>> from bge_toolkit.qc import sample_qc
>>> import hail as hl
>>> out_dir = "gs://my-bucket/my-sample-qc-output/"
>>> exome_regions_path = "gs://jigold-batch-tmp-ezxyx/Twist_Alliance_Clinical_Research_Exome_Covered_Targets_hg38-34.9MB.bed"
>>> low_complexity_regions_path = "gs://jigold-batch-tmp-ezxyx/LCRFromHengHg38.bed"
>>> exome_regions = hl.import_locus_intervals(exome_regions_path, reference_genome='GRCh38')
>>> low_complexity_regions = hl.import_locus_intervals(low_complexity_regions_path, reference_genome='GRCh38')
>>> exo = hl.read_matrix_table("gs://my-bucket/my-data.mt")
>>> sample_qc_stats = sample_qc(mt=exo, out_dir=out_dir, is_gatk=False, exome_regions=exome_regions, low_complexity_regions=low_complexity_regions)
>>> sample_qc_stats.write(f'{out_dir}sample_qc.ht')
>>> sample_qc_stats.pcs.export(f'{out_dir}sample_pcs.tsv')
>>> sample_qc_stats.plot_pcs(1, 2)
>>> sample_qc_stats.plot_gq_fraction_density()
>>> sample_qc_stats.plot_qc_metric_boxplots()['r_ti_tv']
The following output files are written to out_dir:
sample_qc_stats.ht (Contains the raw sample qc data that can be loaded in Python with
SampleQCResult.load())pcs/pc1_pc2.png (Contains a plot of PC1 versus PC2 colored by ancestry population label)
pcs/pc1_pc3.png (Contains a plot of PC1 versus PC3 colored by ancestry population label)
pcs/pc2_pc3.png (Contains a plot of PC2 versus PC3 colored by ancestry population label)
qc/*_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged)
qc/*_density.png (Contains density plots of different QC metrics stratified by ancestry population label)
passing_sample_ids.tsv (A TSV file containing a list of sample IDs for samples that passed all QC metrics)
qc/*_pass_boxplot.png (Contains boxplots of different QC metrics stratified by ancestry population label with outliers flagged for passing samples only)
qc/*_pass_density.png (Contains density plots of different QC metrics stratified by ancestry population label for passing samples only)
The structure of sample_qc_stats.ht is as follows:
contamination:
- charr: The CHARR statistic.
- is_passing: ``charr`` is less than the ``charr_thresh``
chimera_reads:
- chimera_rate: The rate of chimera reads.
- is_passing: The rate of chimera reads is below ``threshold``.
sample_qc_metrics:
- is_passing: Passes every statistic.
- r_ti_tv: Transition / Transversion ratio.
- n_singleton: Number of singletons.
- n_insertion: Number of insertions.
- n_deletion: Number of deletions.
- n_transition: Number of transitions.
- n_transversion: Number of transversions.
- r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
- r_insertion_deletion: Ratio of insertions to deletions.
- fail_r_ti_tv: An outlier in ``r_ti_tv``.
- fail_n_singleton: An outlier in ``n_singleton``.
- fail_n_insertion: An outlier in ``n_insertion``.
- fail_n_deletion: An outlier in ``n_deletion``.
- fail_n_transition: An outlier in ``n_transition``.
- fail_n_transversion: An outlier in ``n_transversion``.
- fail_r_het_hom_var: An outlier in ``r_het_hom_var``.
- fail_r_insertion_deletion: An outlier in ``r_insertion_deletion``.
gq_fraction:
- gq_fraction: The percentage of genotypes with GQ >= ``gq_thresh``.
- is_passing: ``gq_fraction`` is greater than the ``fraction_thresh``
sex_info:
- is_female: The imputed sex. ``True`` is for females, ``False`` is for males.
- is_passing: Either ``reported sex == imputed sex`` or ``True``.
- sex_check: ``reported sex == imputed sex`` or ``Null``.
- f_stat: Inbreeding coefficient on the non-PAR X chromosome.
- n_called: Number of genotypes considered.
- expected_homs: Expected number of homozygotes.
- observed_homs: Observed number of heterozygotes.
ancestry:
- ancestry_pop: The ancestry population label.
pcs:
- scores: An array with the top ``k`` principal components.
s: sample ID
- bge_toolkit.qc.sample_qc(*, mt, out_dir, is_gatk, exome_regions, low_complexity_regions, hq_sites_dp_thresh=10, hq_sites_gq_thresh=20, hq_sites_ab_thresh=0.2, hq_sites_maf_thresh=0.01, hq_sites_mac_thresh=10, hq_sites_call_rate_thresh=0.95, pre_pruned_variants=None, r2_thresh=0.2, bp_window_size=1000000, sex_reported_is_female=None, sex_check_male_fhet_thresh=0.8, sex_check_female_fhet_thresh=0.2, pcs_k=10, ancestry_pca_loadings='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht', ancestry_onnx_rf='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.RF_fit.onnx', ancestry_num_pcs=20, ancestry_min_prob=0.75, chimera_rate=None, chimera_threshold=0.05, contamination_charr_thresh=0.05, contamination_min_af=0.05, contamination_max_af=0.95, contamination_min_dp=10, contamination_max_dp=100, contamination_min_gq=20, contamination_ref_af=None, coverage_gq_thresh=20, coverage_fraction_thresh=0.9, log=None, make_plots=True)
Run the default implementation of Sample QC.
- Parameters:
mt (Union(MatrixTable, str)) – Hail MatrixTable or path to MatrixTable to QC.
out_dir (str) – Path to output directory to write sample QC results to.
is_gatk (bool) – True if data was called with GATK.
exome_regions (Table) – Table of locus intervals specifying valid exome regions.
low_complexity_regions (Table) – Table of locus intervals specifying low complexity regions to remove.
hq_sites_dp_thresh (int) – For finding high quality sites, the minimum DP threshold for which a genotype is kept. DP is the sum of AD if
is_gatk=False.hq_sites_gq_thresh (int) – For finding high quality sites, the minimum GQ threshold for a kept genotype.
hq_sites_ab_thresh (float) – For finding high quality sites, the allowable allelic balance range for a kept heterozygote genotype.
hq_sites_maf_thresh (float) – For finding high quality sites, the minimum minor allele frequency as computed by hl.variant_qc in order for a variant to be considered to be common.
hq_sites_mac_thresh (int) – For finding high quality sites, the minimum number of minor alleles observed as computed by hl.variant_qc in order for a variant to be considered common.
hq_sites_call_rate_thresh (float) – For finding high quality sites, the minimum call rate of a variant as computed by hl.variant_qc in order for a variant to be considered.
pre_pruned_variants (Optional(Table)) – For subsetting data to LD pruned sites, the list of pre-pruned variants to use.
r2_thresh (float) – For subsetting data to LD pruned sites, the R2 threshold to use when running
hl.ld_prune.bp_window_size (int) – For subsetting data to LD pruned sites, the window for considering variants to prune in base pairs.
sex_reported_is_female (Optional(BooleanExpression)) – For running a sex check, the reported sex as a Boolean Expression keyed by the Sample ID where
Trueis Female andFalseis Male.sex_check_male_fhet_thresh (float) – For running a sex check, the minimum Fstat threshold for calling a sample “Male”.
sex_check_female_fhet_thresh (float) – For running a sex check, the maximum Fstat threshold for calling a sample “Female”.
pcs_k (int) – For computing principal components, the number of principal components to compute.
ancestry_pca_loadings (str) – For computing ancestry pop labels, path to the gnomAD PCA loadings Hail Table.
ancestry_onnx_rf (str) – For computing ancestry pop labels, path to the gnomAD random forest model.
ancestry_num_pcs (int) – For computing ancestry pop labels, number of PCs to use when applying the random forest model. This is dependent on the onnx model.
ancestry_min_prob (float) – For computing ancestry pop labels, the minimum probability for a predicted population before giving a sample that label.
chimera_rate (Optional(NumericExpression)) – For filtering samples based on chimera rate, the chimera rate as a NumericExpression keyed by the sample ID.
chimera_threshold (float) – For filtering samples based on chimera rate, the maximum threshold to use at which a sample is considered passing.
contamination_charr_thresh (float) – For filtering samples based on contamination rate, the threshold for which a sample is considered to be passing from CHARR.
contamination_min_af (float) – For filtering samples based on contamination rate, the minimum allele frequency when using the
compute_charrfunction in Hail.contamination_max_af (float) – For filtering samples based on contamination rate, the maximum allele frequency when using the
compute_charrfunction in Hail.contamination_min_dp (int) – For filtering samples based on contamination rate, the minimum depth when using the
compute_charrfunction in Hail.contamination_max_dp (int) – For filtering samples based on contamination rate, the maximum depth when using the
compute_charrfunction in Hail.contamination_min_gq (int) – For filtering samples based on contamination rate, the minimum GQ when using the
compute_charrfunction in Hail.contamination_ref_af (Optional[NumericExpression]) – A float row field on the MatrixTable with the reference allele frequency when using the
compute_charrfunction in Hail.coverage_gq_thresh (int) – For filtering samples based on GQ coverage stats, the minimum GQ for a genotype to be considered passing.
coverage_fraction_thresh (float) – For filtering samples based on GQ coverage stats, the minimum rate at which GQ is above the minimum threshold.
log_path (Optional[str]) – Log file path to write to.
hail_init_kwargs (Optional[dict]) – Keyword arguments to hl.init().
log (Optional[logging.Logger]) – Logging object to log to.
make_plots (bool) – Save default QC plots to
out_dir.
- Returns:
A
SampleQCResultobject.
High Level Functions
The sample_qc() function is implemented in terms of the following three lower-level functions:
- bge_toolkit.qc.select_high_quality_common_sites(mt, is_gatk, *, low_complexity_regions, exome_regions, dp_thresh=10, gq_thresh=20, ab_thresh=0.2, maf_thresh=0.01, mac_thresh=10, call_rate_thresh=0.95)
Select high quality sites for downstream sample QC.
- Parameters:
mt (MatrixTable) – The MatrixTable to select high quality sites from.
is_gatk (bool) –
Trueif the data was called using GATK.Falseif the data was called using DRAGEN.low_complexity_regions (Table) – A Hail Table with intervals representing regions of the genome to filter out.
exome_regions (Table) – A Hail Table with intervals representing regions of the genome to keep.
dp_thresh (int) – The minimum DP threshold for which a genotype is kept. DP is the sum of AD if
is_gatk=False.gq_thresh (int) – The minimum GQ threshold for a kept genotype.
ab_thresh (float) – The allowable allelic balance range for a kept heterozygote genotype.
maf_thresh (float) – The minimum minor allele frequency as computed by hl.variant_qc in order for a variant to be considered to be common.
mac_thresh (int) – The minimum number of minor alleles observed as computed by hl.variant_qc in order for a variant to be considered common.
call_rate_thresh (float) – The minimum call rate of a variant as computed by hl.variant_qc in order for a variant to be considered.
- Returns:
A Table with locus and alleles for common, LD-pruned high quality sites.
- bge_toolkit.qc.infer_ancestry(mt, *, is_gatk=True, dp_thresh=10, gq_thresh=20, ab_thresh=0.2, pca_loadings='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.pca_loadings.ht', onnx_rf='gs://gcp-public-data--gnomad/release/4.0/pca/gnomad.v4.0.RF_fit.onnx', requester_pays_config=None, num_pcs=20, min_prob=0.75, log=None)
Use a Random Forest Model derived from gnomAD data to infer ancestry population labels.
Does basic QC automatically: - Removes genotypes that do not meet the DP or GQ threshold - Removes hetrozygote genotypes that do not meet the AB threshold - Removes variants with low call rates
- The result is a Hail Table with the following fields:
mt.col_key: The column key of the input MatrixTable
ancestry: * ancestry_pop: The ancestry population label.
- Parameters:
mt (MatrixTable) – The MatrixTable with the samples to infer ancestry from. This dataset should be filtered to high quality variants.
is_gatk (bool) – True if data was called with GATK.
dp_thresh (int) – The minimum DP threshold for which a genotype is kept. DP is the sum of AD if
is_gatk=False.gq_thresh (int) – The minimum GQ threshold for a kept genotype.
ab_thresh (float) – The allowable allelic balance range for a kept heterozygote genotype.
pca_loadings (str) – Path to the gnomAD PCA loadings Hail Table.
onnx_rf (str) – Path to the gnomAD random forest model.
num_pcs (int) – Number of PCs to use when applying the random forest model.
min_prob (float) – The minimum probability for a predicted population before giving a sample that label.
log (Optional[logging.Logger]) – An optional logging object.
- Returns:
sample ID and a nested struct called
ancestrywith one fieldancestry_pop.- Return type:
A Hail Table with two columns
- bge_toolkit.qc.calculate_sample_qc_stats(*, mt, high_quality_sites, pre_pruned_variants=None, ld_prune_r2_thresh=0.2, ld_prune_bp_window_size=1000000, chimera_rate=None, chimera_threshold=0.05, contamination_charr_thresh=0.05, contamination_min_af=0.05, contamination_max_af=0.95, contamination_min_dp=10, contamination_max_dp=100, contamination_min_gq=20, contamination_ref_AF=None, pcs_k=10, ancestry_pop_labels=None, coverage_gq_thresh=20, coverage_fraction_thresh=0.9, sex_reported_is_female=None, sex_male_fhet_thresh=0.8, sex_female_fhet_thresh=0.2, log=None)
Calculate Sample QC statistics from a Hail MatrixTable.
- Parameters:
mt (MatrixTable) – A Hail MatrixTable to compute sample QC statistics on.
high_quality_sites (Table) – A Hail Table with the list of high quality sites (common variants, in exome regions, not in low complexity regions).
pre_pruned_variants (Optional[Table]) – The path to a table with LD pruned variants.
ld_prune_r2_thresh (float) – The LD-Prune R2 value when filtering out variants in linkage disequilibrium.
ld_prune_bp_window_size (int) – The window size for LD pruning variants.
chimera_rate (NumericExpression) – A numeric expression containing the chimeric read rate.
chimera_threshold (float) – The maximum threshold for which to consider a sample passing.
contamination_charr_thresh (float) – The threshold for which a sample is considered to be passing.
contamination_min_af (float) – The minimum allele frequency when using the
compute_charrfunction in Hail.contamination_max_af (float) – The maximum allele frequency when using the
compute_charrfunction in Hail.contamination_min_dp (int) – The minimum depth when using the
compute_charrfunction in Hail.contamination_max_dp (int) – The maximum depth when using the
compute_charrfunction in Hail.contamination_min_gq (int) – The minimum GQ when using the
compute_charrfunction in Hail.contamination_ref_AF (Float64Expression) – A float row field on the MatrixTable with the reference allele frequency when using the
compute_charrfunction in Hail.pcs_k (int) – The number of principal components to compute.
ancestry_pop_labels (Optional(Table)) – A Hail Table with ancestry population labels such as projected from gnomAD. There should be a column
ancestrywith one fieldancestry_pop.coverage_gq_thresh (int) – The minimum GQ for a genotype to be considered passing.
coverage_fraction_thresh (float) – The minimum rate at which GQ is above the minimum threshold.
sex_reported_is_female (Optional[hl.BooleanExpression]) – A field from a dataset where
Truemeans the sample is “Female” andFalsemeans “Male”.sex_male_fhet_thresh (float) – The minimum Fstat threshold for calling a sample “Male”.
sex_female_fhet_thresh (float) – The maximum Fstat threshold for calling a sample “Female”.
log (Optional[logging.Logger]) – An optional logging object.
- Returns:
A
SampleQCResultwith plotting and exporting functionality.
Lower Level Functions
select_high_quality_common_sites(), infer_ancestry(), and calculate_sample_qc_stats()
are implemented in terms of these lower level functions which you can use to build your own QC pipeline:
- bge_toolkit.qc.filter_to_high_quality_exome_intervals(mt, low_complexity_regions, exome_regions, is_gatk=True)
Filter a MatrixTable to the exome with no low complexity regions.
- Parameters:
mt (MatrixTable) – A Hail MatrixTable to filter the sites.
low_complexity_regions (Table) – A Table of intervals to filter out. Use import_locus_intervals from a BED file to generate the table.
exome_regions (Table) – A Table of intervals to filter to. Use import_locus_intervals from a BED file to generate the table.
is_gatk (bool) – True if the dataset was called using GATK.
- Result:
A filtered MatrixTable to only exome regions without the low complexity regions of the genome.
- bge_toolkit.qc.filter_contamination_rate(*, mt, charr_thresh=0.05, min_af=0.05, max_af=0.95, min_dp=10, max_dp=100, min_gq=20, ref_AF=None)
Identify samples with high contamination rates.
Uses the Hail
compute_charrfunction to compute the estimate of contamination rate.- The result is a Hail Table with the following fields:
mt.col_key: The column key of the input MatrixTable
contamination: * charr: The CHARR statistic. * is_passing:
charris less than thecharr_thresh
- Parameters:
mt (MatrixTable) – A Hail MatrixTable to compute contamination statistics from.
charr_thresh (float) – The threshold for which a sample is considered to be passing.
min_af (float) – The minimum allele frequency when using the
compute_charrfunction in Hail.max_af (float) – The maximum allele frequency when using the
compute_charrfunction in Hail.min_dp (int) – The minimum depth when using the
compute_charrfunction in Hail.max_dp (int) – The maximum depth when using the
compute_charrfunction in Hail.min_gq (int) – The minimum GQ when using the
compute_charrfunction in Hail.ref_AF (Float64Expression) – A float row field on the MatrixTable with the reference allele frequency when using the
compute_charrfunction in Hail.
- Returns:
A Hail Table with the contamination statistic in a new column “contamination” that is a struct with one field “charr”.
- bge_toolkit.qc.filter_chimeric_reads(*, chimera_rate, threshold)
Determine whether a sample passes based on chimeric read percentages.
- The result is a Hail Table with the following fields:
ds.col_key: The column key of the input expression.
chimera_reads: * chimera_rate: The rate of chimera reads. * is_passing: The rate of chimera reads is below
threshold.
- Parameters:
chimera_rate (NumericExpression) – A numeric expression containing the chimeric read rate.
threshold (float) – The maximum threshold for which to consider a sample passing.
- Returns:
A Hail Table with the chimeric read data along with an annotation for whether it is passing.
- bge_toolkit.qc.filter_genotypes(mt, is_gatk, *, dp_thresh=10, gq_thresh=20, ab_thresh=0.2)
Filter genotypes that are of poor quality from an exome dataset.
If the dataset was called using DRAGEN,
DPis the sum ofADand only heterozygote and homozygous alternate calls are subject to the filter on read depth.- Parameters:
mt (MatrixTable) – A MatrixTable to filter the genotypes of.
is_gatk (bool) – True if the dataset was called using GATK.
dp_thresh (int) – The minimum DP for a kept genotype. DP is computed as the sum of the “AD” field.
gq_thresh (int) – The minimum GQ threshold for a kept genotype.
ab_thresh (float) – The allowable allelic balance range for a kept heterozygote genotype.
- Returns:
A MatrixTable with poor quality genotypes removed.
- bge_toolkit.qc.filter_to_high_quality_common_sites(mt, *, maf_thresh=0.01, mac_thresh=10, call_rate_thresh=0.95)
Filter a MatrixTable to common, LD pruned sites.
- Parameters:
mt (MatrixTable) – The MatrixTable to subset to filter to common, LD pruned sites.
maf_thresh (float) – The minimum minor allele frequency as computed by hl.variant_qc in order for a variant to be considered to be common.
mac_thresh (int) – The minimum number of minor alleles observed as computed by hl.variant_qc in order for a variant to be considered common.
call_rate_thresh (float) – The minimum call rate of a variant as computed by hl.variant_qc in order for a variant to be considered.
- Returns:
A Hail MatrixTable subsetted to common, LD-pruned sites.
- bge_toolkit.qc.compute_pcs(mt, *, k=10)
Compute principal components.
Only computed from autosomal variants.
- The result is a Hail Table with the following fields:
mt.col_key: The column key of the input MatrixTable.
pcs: * scores: An array with the top
kprincipal components.
- Parameters:
mt (MatrixTable) – A Hail MatrixTable to compute principal components from. Should be LD pruned and filtered to high quality common variants.
k (int) – The number of principal components to compute.
- Returns:
A Hail Table with the principal components.
- bge_toolkit.qc.estimate_call_rate_through_gq_stats(mt, gq_thresh=20, fraction_thresh=0.9)
Flag samples with the proportion of calls with GQ less than a threshold being too low.
Only computed for autosomal variants.
- The result is a Hail Table with the following fields:
mt.col_key: The column key of the input MatrixTable
gq_fraction: * gq_fraction: The percentage of genotypes with GQ >=
gq_thresh. * is_passing:gq_fractionis greater than thefraction_thresh
- Parameters:
mt (MatrixTable) – The MatrixTable to compute GQ stats from.
gq_thresh (int) – The minimum GQ for a genotype to be considered passing.
fraction_thresh (float) – The minimum rate at which GQ is above the minimum threshold.
- Returns:
A Hail Table with the struct
gq_fractionandincludein the columngq_fraction.
- bge_toolkit.qc.impute_and_check_sex(mt, maf_threshold=0.0, male_fhet_thresh=0.8, female_fhet_thresh=0.2, maf=None, reported_is_female=None)
Impute and Check Sex.
- The result is a Hail Table with the following fields:
mt.col_key: The column key of the input MatrixTable
sex_info: * is_female: The imputed sex.
Trueis for females,Falseis for males. * is_passing: Eitherreported sex == imputed sexorTrue. * sex_check:reported sex == imputed sexorNull. * f_stat: Inbreeding coefficient on the non-PAR X chromosome. * n_called: Number of genotypes considered. * expected_homs: Expected number of homozygotes. * observed_homs: Observed number of heterozygotes.
- Parameters:
mt (MatrixTable) – A Hail MatrixTable to impute sex from.
maf_threshold (float) – The minor allele frequency threshold in order for a variant to be included in the analysis.
male_fhet_thresh (float) – The minimum Fstat threshold for calling a sample “Male”.
female_fhet_thresh (float) – The maximum Fstat threshold for calling a sample “Female”.
maf (Optional[hl.NumericExpression]) – A field defining the alternate allele frequency for each row. If
None
call. (AAF will be computed from)
reported_is_female (Optional[hl.BooleanExpression]) – A field from a dataset where
Truemeans the sample is “Female” andFalsemeans “Male”.
- Returns:
A Hail Table with sex information annotated.
- bge_toolkit.qc.identify_outliers_in_qc_stats(*, mt, ancestry_pop=None)
Identify outliers in QC statistics.
All statistics computed with
hl.sample_qc.- The result is a Hail Table with the following fields:
mt.col_key: The column key of the input MatrixTable
sample_qc_metrics: * is_passing: Passes every statistic. * r_ti_tv: Transition / Transversion ratio. * n_singleton: Number of singletons. * n_insertion: Number of insertions. * n_deletion: Number of deletions. * n_transition: Number of transitions. * n_transversion: Number of transversions. * r_het_hom_var: Ratio of heterozygotes to number of homozygote variants. * r_insertion_deletion: Ratio of insertions to deletions. * fail_r_ti_tv: An outlier in
r_ti_tv. * fail_n_singleton: An outlier inn_singleton. * fail_n_insertion: An outlier inn_insertion. * fail_n_deletion: An outlier inn_deletion. * fail_n_transition: An outlier inn_transition. * fail_n_transversion: An outlier inn_transversion. * fail_r_het_hom_var: An outlier inr_het_hom_var. * fail_r_insertion_deletion: An outlier inr_insertion_deletion.
- Parameters:
mt (MatrixTable) – A Hail MatrixTable to compute sample QC statistics on. This should be LD-pruned, high quality variants.
ancestry_pop (Optional[StringExpression]) – A Hail Expression with the ancestry population label for each sample for computing outliers based on ancestry-stratified statistics.
- Returns:
A Hail Table with sample QC statistics as well as outliers flagged for each metric.
SampleQCResult
A SampleQCResult is an object that provides a wrapper for interacting
with the sample QC results by making plots and exploring the metrics.
- class bge_toolkit.qc.SampleQCResult(table)
Bases:
objectAn interface for working with Sample QC Results.
- property ancestry_pop
A HailExpression that contains the imputed ancestry population.
The result is a StringExpression.
See
infer_ancestry()for the implementation and more details.- Returns:
A Hail StringExpression with the ancestry population label.
- property chimera_reads
A HailExpression that contains the chimeric rate results.
The result is a StructExpression with the following fields:
chimera_rate: The rate of chimera reads.
is_passing: The rate of chimera reads is below
threshold.
See
filter_chimeric_reads()for the implementation and more details.- Returns:
A Hail StructExpression with the schema defined above.
- close()
Unpersist any temporary tables.
- property contamination
A HailExpression that contains the contamination results.
The result is a StructExpression with the following fields:
charr: The CHARR statistic.
is_passing:
charris less than thecharr_thresh
See
filter_contamination_rate()for the implementation and more details.- Returns:
A Hail StructExpression with the schema defined above.
- export(path, exprs, named_exprs, *, delimiter='\t', missing='NA', header=True)
Export the sample QC results to a file.
Examples
>>> sample_qc_results.export('gs://my-bucket/analysis/sample_qc.tsv', ... exprs=[sample_qc_results.relatedness.is_passing], ... named_exprs={'r_ti_tv': sample_qc_results.qc_metrics.r_ti_tv, ... 'PC1': sample_qc_results.pcs.scores[0]})
- Parameters:
path (str) – The file to export data to.
exprs (List[hl.Expression]) – A list of expressions to export.
named_exprs (Dict[str, hl.Expression]) – Named expressions to export.
delimiter (str) – The delimiter of the resulting exported file.
missing (str) – The value to use for missing values.
header (bool) – Include a header line.
- filter_to(filter=None)
Filter the Sample QC results to a user-specified set of samples based on a BooleanExpression.
- Parameters:
filter (hl.BooleanExpression) – A BooleanExpression using expressions derived from this
SampleQCResultto identify selected samples.- Returns:
A
SampleQCResultwith only filtered samples.
- property gq_fraction
A HailExpression that contains the GQ fraction results as a proxy for call rate.
The result is a StructExpression with the following fields:
gq_fraction: The percentage of genotypes with GQ >=
gq_thresh.is_passing:
gq_fractionis greater than thefraction_thresh
See
estimate_call_rate_through_gq_stats()for the implementation and more details.- Returns:
A Hail StructExpression with the schema defined above.
- static load(path)
Load a
SampleQCResultfrom a Hail Table on disk.- Parameters:
path (str) – The path at which the Hail Table resides.
- Returns:
A
SampleQCResultobject.
- passing_samples(is_passing=None)
Filter the Sample QC results to passing samples.
- The default implementation for defining a passing sample is if it meets the following criteria:
contamination.is_passingis Truechimera_reads.is_passingis True or is missing.relatedness.is_passingis True for unrelated samples.sex_info.is_passingis True or is missing.gq_fraction.is_passingis True.qc_metrics.is_passingis True meaning the sample is not an outlier based on ancestry stratified QC metrics.
- Parameters:
is_passing (hl.BooleanExpression) – A BooleanExpression using expressions derived from this
SampleQCResult.- Returns:
A
SampleQCResultwith only passing samples.
- property pcs
A HailExpression that contains the principal components.
The result is a StructExpression with the following fields:
scores: An array with the top
kprincipal components.
See
compute_pcs()for the implementation and more details.- Returns:
A Hail StructExpression with the schema defined above.
- plot_chimera_reads()
Plot chimera reads as a boxplot.
Colors samples that fail the threshold as a different color.
- Returns:
A plotnine ggplot object.
- plot_gq_fraction_density()
Plot GQ fraction as a density plot.
- Returns:
A plotnine ggplot object.
- plot_pcs(pc_i, pc_j)
Plot principal components.
- Parameters:
pc_i (int) – This is a 1-indexed principal component index. 1=First principal component.
pc_j (int) – This is a 1-indexed principal component index. 1=First principal component.
- Returns:
A plotnine ggplot object.
- plot_qc_metric_boxplots()
Plot QC metrics as boxplots.
- Possible metrics:
r_ti_tv: Transition / Transversion ratio.
n_singleton: Number of singletons.
n_insertion: Number of insertions.
n_deletion: Number of deletions.
n_transition: Number of transitions.
n_transversion: Number of transversions.
r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
r_insertion_deletion: Ratio of insertions to deletions.
- Returns:
A Dictionary of metric name to plotnine ggplot object.
- plot_qc_metric_density()
Plot QC metrics as density plots.
- Possible metrics:
r_ti_tv: Transition / Transversion ratio.
n_singleton: Number of singletons.
n_insertion: Number of insertions.
n_deletion: Number of deletions.
n_transition: Number of transitions.
n_transversion: Number of transversions.
r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
r_insertion_deletion: Ratio of insertions to deletions.
- Returns:
A Dictionary of metric name to plotnine ggplot object.
- property qc_metrics
A HailExpression that contains the sample qc metrics results with outliers flagged.
The result is a StructExpression with the following fields:
is_passing: Passes every statistic.
r_ti_tv: Transition / Transversion ratio.
n_singleton: Number of singletons.
n_insertion: Number of insertions.
n_deletion: Number of deletions.
n_transition: Number of transitions.
n_transversion: Number of transversions.
r_het_hom_var: Ratio of heterozygotes to number of homozygote variants.
r_insertion_deletion: Ratio of insertions to deletions.
fail_r_ti_tv: An outlier in
r_ti_tv.fail_n_singleton: An outlier in
n_singleton.fail_n_insertion: An outlier in
n_insertion.fail_n_deletion: An outlier in
n_deletion.fail_n_transition: An outlier in
n_transition.fail_n_transversion: An outlier in
n_transversion.fail_r_het_hom_var: An outlier in
r_het_hom_var.fail_r_insertion_deletion: An outlier in
r_insertion_deletion.
See
identify_outliers_in_qc_stats()for the implementation and more details.- Returns:
A Hail StructExpression with the schema defined above.
- property sample_id
A HailExpression that contains the key of the sample qc results.
The key in most cases is a single field s with the sample ID.
- Returns:
A Hail StructExpression with the key of the sample QC results.
- property sex_info
A HailExpression that contains the imputed sex and sex check results.
The result is a StructExpression with the following fields:
is_female: The imputed sex.
Trueis for females,Falseis for males.is_passing: Either
reported sex == imputed sexorTrue.sex_check:
reported sex == imputed sexorNull.f_stat: Inbreeding coefficient on the non-PAR X chromosome.
n_called: Number of genotypes considered.
expected_homs: Expected number of homozygotes.
observed_homs: Observed number of heterozygotes.
See
impute_and_check_sex()for the implementation and more details.- Returns:
A Hail StructExpression with the schema defined above.
- property table
Underlying Hail Table containing the Sample QC statistics.
Examples:
>>> t = qc_result.table >>> t = t.annotate(my_field=...)
- Returns:
A Hail Table with Sample QC statistics.
- write(path, *, overwrite=True)
Write the underlying table to disk.
- Parameters:
path (str) – The path to write the Hail Table to.
overwrite (bool) – Whether to overwrite the file if it already exists.