Concordance Reference
This is the Python API documentation for the concordance component of the BGE Toolkit.
Use import bge_toolkit.qc to access this functionality.
JointCallSet
A JointCallSet is an object that contains operations performed on the
combination of an exome Hail MatrixTable and an imputation Hail MatrixTable.
- class bge_toolkit.qc.JointCallSet(exo, imp)
Bases:
objectA Python object representing the join between two MatrixTables.
The JointCallSet supports two operations:
Concordance
Merge (Not Implemented Yet)
Examples
# Necessary imports
>>> from bge_toolkit.qc import ALL_AGG, ALL_GROUP, Statistic
>>> exo = 'gs://my-bucket/exome_joint_called.mt' >>> imp = 'gs://my-bucket/imputed_with_glimpse.mt' >>> joint_callset = JointCallSet(exo, imp)
# Define a custom binning function
>>> def mac_bin(s: hl.StructExpression): ... mac_bin = (hl.case() ... .when(s.AC <= 1, '1') ... .when(s.AC <= 5, '2-5') ... .when(s.AC <= 10, '6-10') ... .default('10+')) ... return mac_bin
# Import a binning function
>>> from bge_toolkit.common import dp_bin
# Specify aggregation variables
>>> exo_row_group_by_var = {'exo_mac': mac_bin} >>> exo_entry_agg_fields = {'DP': dp_bin} >>> exo_col_group_by_var = {'pop': lambda s: s.pop}
# Compute global concordance and make a plot
>>> global_conc = joint_callset.global_concordance(exo_row_group_by_var=exo_row_group_by_var, ... exo_col_group_by_var=exo_col_group_by_var, ... exo_entry_agg_fields=exo_entry_agg_fields) >>> global_view = global_conc.get_view(group_names=['exo_mac', 'pop'], agg_names=['DP'], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> global_view.plot(statistic=Statistic.NONREF_CONCORDANCE) >>> global_df = global_view.result()
# Compute variant concordance and export results for a specific variant
>>> variant_conc = joint_callset.variant_concordance(exo_col_group_by_var=exo_col_group_by_var, ... exo_entry_agg_fields=exo_entry_agg_fields) >>> variant_view = variant_conc.get_view(key=dict(locus=hl.Locus('chr1', 3452452, reference_genome='GRCh38'), alleles=['A', 'G']), ... group_names=[ALL_GROUP, 'pop'], ... agg_names=[ALL_AGG, 'DP']) >>> variant_view.export('gs://my-bucket/results.tsv')
# Compute sample concordance and get the results for all samples
>>> sample_conc = joint_callset.sample_concordance() >>> sample_view = sample_conc.get_view() >>> sample_view.result()
# When done, close to unpersist any cached data
>>> joint_callset.close()
- Parameters:
exo (hl.MatrixTable) – A MatrixTable that was generated from whole-exome sequencing.
imp (hl.MatrixTable) – A MatrixTable that was generated from imputation.
- close()
Cleanup any cached datasets before exiting.
- full_outer_join(join_type=JoinType.outer)
Return a Hail MatrixTable with a full outer join.
- Parameters:
join_type (JoinType) – How to join the two datasets. One of “inner” or “outer”.
- Returns:
Equivalent of running hl.experimental.full_outer_join_mt()
- Return type:
hl.MatrixTable
- global_concordance(*, exo_row_group_by_var=None, imp_row_group_by_var=None, exo_col_group_by_var=None, imp_col_group_by_var=None, exo_row_agg_fields=None, imp_row_agg_fields=None, exo_col_agg_fields=None, imp_col_agg_fields=None, exo_entry_agg_fields=None, imp_entry_agg_fields=None, join_type=JoinType.inner)
Compute the global concordance with various grouping and aggregation variables.
Examples
# Import shortcuts for ungrouped groupings and aggregations
>>> from bge_toolkit.qc.utils import ALL_AGG, ALL_GROUP
# Define a custom binning function
>>> def mac_bin(s: hl.StructExpression): ... mac_bin = (hl.case() ... .when(s.AC <= 1, '1') ... .when(s.AC <= 5, '2-5') ... .when(s.AC <= 10, '6-10') ... .default('10+')) ... return mac_bin
# Or import a binning function
>>> from bge_toolkit.common import dp_bin
# Specify aggregation variables
>>> exo_row_group_by_var = {'exo_mac': mac_bin} >>> exo_entry_agg_fields = {'DP': dp_bin} >>> exo_col_group_by_var = {'pop': lambda s: s.pop}
# Compute global concordance and make a plot
>>> global_conc = joint_callset.global_concordance(exo_row_group_by_var=exo_row_group_by_var, ... exo_col_group_by_var=exo_col_group_by_var, ... exo_entry_agg_fields=exo_entry_agg_fields) >>> global_view = global_conc.get_view(group_names=['exo_mac', 'pop'], agg_names=['DP', ALL_AGG], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> global_view.plot()
Notes
The total ungrouped concordance can be accessed with ALL_GROUP and ALL_AGG.
- Parameters:
exo_row_group_by_var (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_row (exome) in a full outer join and convert it to a binned variable for grouping.
imp_row_group_by_var (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_row (imputation) in a full outer join and convert it to a binned variable for grouping.
exo_col_group_by_var (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_col (exome) in a full outer join and convert it to a binned variable for grouping.
imp_col_group_by_var (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_col (imputation) in a full outer join and convert it to a binned variable for grouping.
exo_row_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_row (exome) in a full outer join and convert it to a binned variable for aggregation.
imp_row_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_row (imputation) in a full outer join and convert it to a binned variable for aggregation.
exo_col_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_col (exome) in a full outer join and convert it to a binned variable for aggregation.
imp_col_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_col (imputation) in a full outer join and convert it to a binned variable for aggregation.
exo_entry_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_entry (exome) in a full outer join and convert it to a binned variable for aggregation.
imp_entry_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_entry (imputation) in a full outer join and convert it to a binned variable for aggregation.
join_type (JoinType) – Specify which variants to calculate concordance from. Can be one of “inner” or “outer”. “outer” is substantially more expensive.
- Returns:
A table with the global concordance results.
- Return type:
- property n_sample_overlaps
Get the number of overlapping samples between the exome and imputation datasets.
- Returns:
The number of overlaps.
- Return type:
int
- property n_variant_overlaps
Get the number of overlapping variants between the exome and imputation datasets.
- Returns:
The number of overlaps.
- Return type:
int
- sample_concordance(*, join_type=JoinType.inner)
Compute the global sample concordance.
Examples
# Import shortcuts for ungrouped groupings and aggregations
>>> from bge_toolkit.qc.utils import ALL_AGG, ALL_GROUP
# Compute sample concordance and make a plot for a specific sample
>>> sample_conc = joint_callset.sample_concordance() >>> sample_view = sample_conc.get_view() >>> sample_view.result()
- Parameters:
join_type (JoinType) – Specify which variants to calculate concordance from. Can be one of “inner” or “outer”.
- Returns:
A table with the concordance results for all samples.
- Return type:
- sample_overlaps()
Get a HailTable with the union of all samples along with their membership in each dataset.
- Returns:
A table with all samples with two columns: EXOME and IMPUTATION. Both columns are booleans which represent a sample’s member in each dataset.
- Return type:
hl.Table
- variant_concordance(*, exo_col_group_by_var=None, imp_col_group_by_var=None, exo_col_agg_fields=None, imp_col_agg_fields=None, exo_entry_agg_fields=None, imp_entry_agg_fields=None, join_type=JoinType.inner)
Compute the variant concordance with various grouping and aggregation variables.
Examples
# Import shortcuts for ungrouped groupings and aggregations
>>> from bge_toolkit.qc.utils import ALL_AGG, ALL_GROUP
# Define a custom binning function
>>> def dp_bin(s: hl.StructExpression): ... x = (hl.case() ... .when(s.DP <= 10, 10) ... .when(s.DP <= 20, 20) ... .when(s.DP <= 30, 30) ... .when(s.DP <= 40, 40) ... .when(s.DP <= 50, 50) ... .when(s.DP <= 60, 60) ... .when(s.DP <= 70, 70) ... .when(s.DP <= 80, 80) ... .when(s.DP <= 90, 90) ... .default(100)) ... return x
# Specify aggregation variables
>>> exo_entry_agg_fields = {'DP': dp_bin} >>> exo_col_group_by_var = {'pop': lambda s: s.pop}
# Compute variant concordance and export results for a specific variant
>>> variant_conc = joint_callset.variant_concordance(exo_col_group_by_var=exo_col_group_by_var, ... exo_entry_agg_fields=exo_entry_agg_fields) >>> variant_view = variant_conc.get_view(key=dict(locus=hl.Locus('chr1', 3452452, reference_genome='GRCh38'), alleles=['A', 'G']), ... group_names=[ALL_GROUP, 'pop'], ... agg_names=[ALL_AGG, 'DP']) >>> variant_view.export('gs://my-bucket/results.tsv')
Notes
The total ungrouped concordance can be accessed with ALL_GROUP and ALL_AGG.
- Parameters:
exo_col_group_by_var (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_col (exome) in a full outer join and convert it to a binned variable for grouping.
imp_col_group_by_var (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_col (imputation) in a full outer join and convert it to a binned variable for grouping.
exo_col_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_col (exome) in a full outer join and convert it to a binned variable for aggregation.
imp_col_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_col (imputation) in a full outer join and convert it to a binned variable for aggregation.
exo_entry_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.left_entry (exome) in a full outer join and convert it to a binned variable for aggregation.
imp_entry_agg_fields (Optional[Dict[str, Callable[[hl.StructExpression], hl.Expression]]]) – An optional mapping from a grouping variable name to a function that specifies how to take a struct representing mt.right_entry (imputation) in a full outer join and convert it to a binned variable for aggregation.
join_type (JoinType) – Specify which variants to calculate concordance from. Can be one of “inner” or “outer”.
- Returns:
A table with the concordance results for all variants.
- Return type:
- variant_overlaps()
Get a HailTable with all overlapping variants.
- Returns:
A table with all overlapping variants.
- Return type:
hl.Table
JoinType
An Enum that specifies what type of join to do. Inner joins are substantially cheaper than outer joins!
Selectors
- bge_toolkit.qc.ALL_GROUP
Use this global variable to select all variants and samples (no grouping).
- bge_toolkit.qc.ALL_AGG
Use this global variable to select all variants and samples for aggregation (no grouping).
Concordance
A stand-alone Python function to compute concordance identical to the CLI
$ bge-toolit qc concordance
- bge_toolkit.qc.concordance(*, exome_path, imputation_path, out_dir, exome_maf=False, exome_mac=False, imputation_maf=False, imputation_mac=False, info_score=False, dp=False, gq=False, max_gp=False, qual_approx=False, sample_list=None, variant_list=None, contig=None, n_samples=None, n_variants=None, downsample_samples=None, downsample_variants=None, join_type=JoinType.inner, run_variants=False, run_samples=False, run_global=False, log_path=None, hail_init_kwargs=None, log=None, n_exome_partitions=None, n_imp_partitions=None)
Run the default implementation of concordance used in the CLI.
- Parameters:
exome_path (str) – Exome dataset to compare to.
imputation_path (str) – Imputation dataset to compare to.
out_dir (str) – Output directory.
exome_maf (bool) – Bin concordance counts by Exome Minor Allele Frequency.
exome_mac (bool) – Bin concordance counts by Exome Minor Allele Counts.
imputation_maf (bool) – Bin concordance counts by Imputation Minor Allele Frequency.
imputation_mac (bool) – Bin concordance counts by Imputation Minor Allele Counts.
info_score (bool) – Bin concordance counts by imputation INFO score.
dp (bool) – Bin concordance counts by exome genotype DP.
gq (bool) – Bin concordance counts by exome genotype GQ.
max_gp (bool) – Bin concordance counts by maximum value of imputation GP.
qual_approx (bool) – Bin concordance counts by variant Approx Qual Score.
sample_list (Optional[str]) – Filter to samples listed in the file.
variant_list (Optional[str]) – Filter to variants listed in the file.
contig (List[str]) – Filter to variants in the contig.
n_samples (Optional[int]) – Filter to the first N overlapping samples.
n_variants (Optional[int]) – Filter to the first N variants.
downsample_samples (Optional[float]) – Downsample to X fraction of samples.
downsample_variants (Optional[float]) – Downsample to X fraction of variants.
join_type (JoinType) – Join type.
run_variants (bool) – Run variant concordance.
run_samples (bool) – Run sample concordance.
run_global (bool) – Run global concordance.
log_path (Optional[str]) – Log file path to write to.
hail_init_kwargs (Optional[dict]) – Keyword arguments to hl.init().
log (Optional[logging.Logger]) – Logging object to log to.
n_exome_partitions (Optional[int]) – Number of partitions to load the exome dataset with if it’s a MatrixTable.
n_imp_partitions (Optional[int]) – Number of partitions to load the imputation dataset with if it’s a MatrixTable.
ConcordanceTable
A ConcordanceTable is an object that provides an interface for viewing concordance
results.
- class bge_toolkit.qc.ConcordanceTable(table)
Bases:
objectResult of running a concordance operation on a JointCallset.
- agg_vars()
A mapping of the aggregation variable names and their types.
Example
>>> concordance_table.group_vars() {'DP': hl.tint32, 'INFO': hl.tfloat64}
- Returns:
A dictionary with the aggregation variable names and the corresponding types of their values.
- Return type:
Dict[str, HailType]
- close()
Unpersist any cached views.
- get_view(*, key=None, group_names=None, agg_names=None, ordering=None)
Subset the concordance results for a particular key of the concordance results.
Examples
>>> global_view = glob_conc_table.get_view(group_names=['exo_mac'], agg_names=['DP', 'GQ'], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> global_view.plot() >>> global_view.export('gs://my-bucket/my-file.tsv', sep=' ')
>>> variant_view = variant_conc_table.get_view(key=dict(locus=hl.Locus('chr1', 3423242, reference_genome='GRCh38'), alleles=['A', 'G']))
>>> sample_view = sample_conc_table.get_view(key=dict(s="NA12878"))
- Parameters:
key (Optional[Dict[str, Any]]) – The key to subset the data to. For a global concordance result, this is None. For a variant concordance result, this is the locus and alleles. For sample concordance, this is the sample ID.
group_names (Optional[List[str]]) – The group by variables to select. The default is all variants and all samples or no grouping.
agg_names (Optional[List[str]]) – The aggregation variables to select. The default is all genotypes.
ordering (Optional[Dict[str, List[str]]]) – A mapping from group name or agg name to an ordered list of the factors for that variable.
- Returns:
A slice of the data that can be exported or plotted.
- Return type:
- group_vars()
A mapping of the group by variable names and their types.
Example
>>> concordance_table.group_vars() {'exome_mac': hl.tint32, 'exome_maf': hl.tfloat64}
- Returns:
A dictionary with the group by variable names and the corresponding types of their values.
- Return type:
Dict[str, HailType]
- static load(path)
Path to a Hail Table containing the raw concordance output.
Examples
>>> conc_table = ConcordanceTable.load('gs://my-bucket/global-conc.ht')
- Parameters:
path (str) – The path of the Hail Table to import.
- Returns:
A class for interacting with the results of a concordance operation.
- Return type:
- write(output_path, overwrite=True)
Write the underlying data to a Hail Table.
Example
>>> concordance_table.write('/tmp/my-output/my-data.ht')
- Parameters:
output_path (str) – Output path to write the underlying data to.
overwrite (bool) – Overwrite any existing files.
ConcordanceView
A ConcordanceView is an object that provides an interface for viewing concordance
results for a specific grouping and set of aggregation variables.
- class bge_toolkit.qc.ConcordanceView(*, table, group_vars, agg_vars, ordering)
Bases:
object- export(path, fields=None, **kwargs)
Export the underlying Pandas dataframe to a file.
Notes
The **kwargs are passed to the “to_csv” method of a pandas DataFrame.
Examples
>>> view = conc_table.get_view(group_names=['exo_mac'], agg_names=['DP', 'GQ'], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> view.export('gs://my-bucket/results.tsv', sep=" ")
The contents of results.tsv contains all combinations of group variables and aggregation variables.
- Parameters:
path (str) – Path to export results to.
fields (Optional[Dict[str, str]]) – Fields to export along with the new field name.
kwargs (dict) – Optional kwargs to pass to “pd.DataFrame.to_csv()”
- export_all(output_dir, *args, fields=None, **kwargs)
Export all possible combinations of group and agg variables.
Examples
>>> view = conc_table.get_view(group_names=['exo_mac'], agg_names=['DP', 'GQ'], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> view.export('gs://my-bucket/results/', sep=" ")
There is one table exported for each combination of grouping variable and aggregation variable. The files have the structure “GROUP_AGG.tsv”. For example, the view above would generate the following files:
exo_mac_DP.tsv
exo_mac_GQ.tsv
- Parameters:
output_dir (str) – Directory to write all files to.
fields (Optional[Dict[str, str]]) – Fields to export along with the new field name
- fields = ['concordance_rate', 'nonref_concordance_rate', 'f1_score', 'n_total', 'n_concordant', 'n_discordant,n_hets', 'n_hom_alts', 'het_to_hom_alt', 'hom_ref_to_het', 'het_to_hom_ref', 'MISSING_MISSING', 'MISSING_NOCALL', 'MISSING_HOMREF', 'MISSING_HET', 'MISSING_HOMALT', 'NOCALL_MISSING', 'NOCALL_NOCALL', 'NOCALL_HOMREF', 'NOCALL_HET', 'NOCALL_HOMALT', 'HOMREF_MISSING', 'HOMREF_NOCALL', 'HOMREF_HOMREF', 'HOMREF_HET', 'HOMREF_HOMALT', 'HET_MISSING', 'HET_NOCALL', 'HET_HOMREF', 'HET_HET', 'HET_HOMALT', 'HOMALT_MISSING', 'HOMALT_NOCALL', 'HOMALT_HOMREF', 'HOMALT_HET', 'HOMALT_HOMALT']
A list of all possible fields in the view.
- plot(*, path=None, statistic=Statistic.NONREF_CONCORDANCE, **kwargs)
Create a plot of the data that features a facet per aggregation variable in the view.
Notes
Only implemented for one grouping variable in the view. Otherwise, use plot_all or create a new view with one grouping variable.
Examples
>>> view = conc_table.get_view(group_names=['exo_mac'], agg_names=['DP', 'GQ'], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> p = view.plot() >>> p.save(filename='plot.png')
- Parameters:
path (Optional[str]) – An optional path to write the figure to.
statistic (Statistic) – A statistic to use as the plotting variable. One of Statistic.NONREF_CONCORDANCE and Statistic.F1_SCORE.
kwargs – Pass-through arguments to plotnine.save
- Returns
plotnine.ggplot: A plot object
- plot_all(*, out_dir=None, statistic=Statistic.NONREF_CONCORDANCE, **kwargs)
Create plots that feature a facet per aggregation variable in the view.
- Parameters:
out_dir (Optional[str]) – An optional directory to save plots to
statistic (Optional[Statistic]) – A statistic to use as the plotting variable.
kwargs – Keyword arguments to pass through to plotnine.save
- Returns
Dict[str, plotnine.ggplot]: A dictionary mapping a group name to the corresponding plot object.
- result()
Generate a pandas DataFrame of the underlying view.
Examples
>>> view = conc_table.get_view(group_names=['exo_mac'], agg_names=['DP', 'GQ'], ordering={'exo_mac': ['1', '2-5', '6-10', '10+']}) >>> df = view.result() >>> df.head(5)
Statistic
An Enum that specifies what type of concordance statistic to compute.
- class bge_toolkit.qc.Statistic(*values)
The statistic to use.
- nonref_concordance_rate
Non-reference concordance rate.
- concordance_rate
Concordance rate.
- f1_score
F1 score.
bge_toolkit.common
Binning Functions
- bge_toolkit.common.mac_bin(s)
MAC bin function.
Assumes the field “AC” exists in the input StructExpression.
Categories: [‘1’, ‘2-5’, ‘6-10’, ‘10+’]
- bge_toolkit.common.maf_bin(s)
MAF bin function.
Assumes the field “AF” exists in the input StructExpression.
Categories: [‘<1%’, ‘1-2%’, ‘2-5%’, ‘>5%’]
- bge_toolkit.common.gq_bin(s)
GQ bin function.
Assumes the field “GQ” exists in the input StructExpression.
Categories: [‘10’, ‘20’, ‘30’, ‘40’, ‘50’, ‘60’, ‘70’, ‘80’, ‘90’]
- bge_toolkit.common.dp_bin(s)
DP bin function.
Assumes the field “DP” exists in the input StructExpression.
Categories: [‘10’, ‘20’, ‘30’, ‘40’, ‘50’, ‘60’, ‘70’, ‘80’, ‘90’]
- bge_toolkit.common.max_gp_bin(s)
Max GP bin function.
Assumes the field “MAX_GP” exists in the input StructExpression.
Categories: [‘0.1’, ‘0.2’, ‘0.3’, ‘0.4’, ‘0.5’, ‘0.6’, ‘0.7’, ‘0.8’, ‘0.9’, ‘0.95’, ‘1.0’]
- bge_toolkit.common.qual_approx_bin(s)
Approx Qual bin function.
Assumes the field “QUALapprox” exists in the input StructExpression.
Categories: [‘10’, ‘20’, ‘30’, ‘40’, ‘50’, ‘60’, ‘70’, ‘80’, ‘90’]
- bge_toolkit.common.info_score_bin(s)
INFO score bin function.
Assumes the field “INFO” exists in the input StructExpression.
Categories: [‘0.1’, ‘0.2’, ‘0.3’, ‘0.4’, ‘0.5’, ‘0.6’, ‘0.7’, ‘0.8’, ‘0.9’, ‘0.95’, ‘1.0’]