The gcatest package provides an implementation of the
Genotype Conditional Association Test (GCAT) (Song, Hao, and Storey 2015). GCAT is a test for
genetic association that is powered by Logistic Factor Analysis (LFA)
(Hao, Song, and Storey 2016). LFA is a
method of modeling population structure in a genome wide association
study. GCAT performs a test for association between each SNP and a trait
(either quantitative or binary). We have shown that GCAT is robust to
confounding from population structure.
We include a sample dataset with the package. sim_geno
is a simulated genotype matrix. sim_trait is a simulated
trait. There are 10,000 SNPs and 1,000 individuals. The first five SNPs
are associated with the trait. This simulations were done under the
Pritchard-Stephens-Donnelly model with \(K=3\), with Dirichlet parameter \(\alpha=0.1\) and variance allotment in the
trait corresponding to \(5\%\) genetic,
\(5\%\) environmental, and \(90\%\) noise. This dataset is simulated
under identical parameters as the PSD simulation in Figure 2 of the
paper (Song, Hao, and Storey 2015), except
that we have adjusted the size of the simulation to be appropriate for a
small demo.
gcatThe first step of gcat is to estimate the logistic
factors:
Then, we call the gcat function, which returns a vector
of p-values:
We can look at the p-values for the associated SNPs:
And also plot the histogram of the unassociated SNPs:
library(ggplot2)
dat <- data.frame(p = gcat_p[6:10000])
ggplot(dat, aes(p, after_stat(density))) + geom_histogram(binwidth=1/20) + theme_bw()The genio package provides the function
read_plink for parsing PLINK binary genotypes (extension:
.bed) into an R object of the format needed for the
gcat function. A BEDMatrix object (from the
eponymous function and package) is also supported, and can result in
reduced memory usage (at a small runtime penalty).