Case Control Allele Frequency (AF) Estimation R Package
This repository contains the source code for the CaseControlAF R package which can be used to reconstruct the allele frequency (AF) for cases and controls separately given commonly available summary statistics.
The package contains two functions:
See full documentation, vignettes, and examples here: (https://wolffha.github.io/CCAFE_documentation/)
To install this package using BioConductor (Not yet available):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("CCAFE")
To download this package using devtools in R:
require(devtools)
devtools::install_github("https://github.com/wolffha/CCAFE/")
Use this function when you have the following statistics (for each variant)
data: a dataframe with a row for each variant and columns for OR and total AF
N_case: an integer for the number of case samples
N_control: an integer for the number of control samples
OR_colname: a string containing the exact column name in ‘data’ with the OR
AF_total_colname: a string containing the exact column name in ‘data’ with the total AF
Returns a dataframe with two columns: AF_case and AF_control. The number of rows is equal to the number of variants.
Use this function when you have the following statistics (for each variant)
Code adapted from ReACt GroupFreq function available here: (https://github.com/Paschou-Lab/ReAct/blob/main/GrpPRS_src/CountConstruct.c)
data: a dataframe where each row is a variant and columns for the OR, SE, chromosome, and position
N_case: an integer for the number of case samples
N_control: an integer for the number of control samples
OR_colname: a string containing the exact column name in data with the odds ratios
SE_colname: a string containing the exact column name in data with the standard errors
position_colname: a string containing the exact column name in data with the positions of the variants
chromosome_colname: a string containing the exact column name in data with the chromosome of the variants. Note, sex chromosomes can be either characters (‘X’, ‘x’, ‘Y’, ‘y’) or numeric where X=23 and Y=24
sex_chromosomes: boolean, TRUE if variants from sex chromosome(s) are included in the dataset
do_correction: boolean, TRUE if data is provided to correct the estimates using proxy MAFs
remove_sex_chromosomes: boolean, TRUE if variants on sex chromosomes should be removed. This is only necessary if sex_chromosomes == TRUE and the number of XX/XY individuals per case and control sample is NOT known
CaseControl_SE has the following optional inputs:
If sex_chromosomes == TRUE and remove_sex_chromosomes == FALSE, then the following inputs are required:
N_XX_case: the number of XX chromosome case individuals
N_XX_control: the number of XX chromosome control individuals
N_XY_case: the number of XY chromosome case individuals
N_XY_control: the number of XY chromosome control individuals
If do_correction == TRUE, then data must be provided that includes harmonized data with proxy MAFs
correction_data: a dataframe with the following EXACT column names: CHR, POS, proxy_MAF, containing data for variants harmonized between the observed and proxy datasets
Returns the data dataframe with three additional columns with names: MAF_case, MAF_control and MAF_total containing the estimated minor allele frequency in the cases, controls, and total sample. The number of rows is equal to the number of variants. If proxyMAFs_colname is not NA, will include three additional columns containing the adjusted estimated MAFs (MAF_case_adj, MAF_control_adj, MAF_total_adj)
NOTE: This method assumes we are estimating the minor allele frequency (MAF)
See full documentation, vignettes, and examples here: (https://wolffha.github.io/CCAFE_documentation/)