| Title: | Bayesian Screening and Variable Selection |
|---|---|
| Description: | Performs Bayesian variable screening and selection for ultra-high dimensional linear regression models.Also contains an user friendly web application to perform multi trait GWAS. |
| Authors: | Dongjin Li [aut, cre], Debarshi Chakraborty [aut], Somak Dutta [aut], Vivekananda Roy [ctb] |
| Maintainer: | Dongjin Li <[email protected]> |
| License: | GPL-3 |
| Version: | 4.0.0 |
| Built: | 2026-06-09 00:13:52 UTC |
| Source: | https://github.com/dongjli/bravo |
Fits SVEN on (x, y) using the supplied tuned parameters.
basic.sven.model(x, y, params, seed = 2441139)basic.sven.model(x, y, params, seed = 2441139)
x |
Cleaned SNP matrix. |
y |
Phenotype vector. |
params |
Named numeric vector with lambda and w. |
seed |
Random seed (default 2441139). |
Perform Bayesian iterated screening in Gaussian regression models
bits(X, y, lam = 1, w = 0.5, pp = FALSE, max.var = nrow(X), verbose = TRUE)bits(X, y, lam = 1, w = 0.5, pp = FALSE, max.var = nrow(X), verbose = TRUE)
X |
An |
y |
The response vector of length |
lam |
The slab precision parameter. Default: |
w |
The prior inclusion probability of each variable. Default: |
pp |
Boolean: If |
max.var |
The maximum number of variables to be included. |
verbose |
If |
A list with components
model.pp |
An integer vector of the screened model. |
postprobs |
The sequence of posterior probabilities until the last included variable. |
lam |
The value of lam, the slab precision parameter. |
w |
The value of w, the prior inclusion probability. |
Wang, R., Dutta, S., Roy, V. (2021) Bayesian iterative screening in ultra-high dimensional settings. https://arxiv.org/abs/2107.10175
n=50; p=100; TrueBeta <- c(rep(5,3),rep(0,p-3)) rho <- 0.6 x1 <- matrix(rnorm(n*p), n, p) X <- sqrt(1-rho)*x1 + sqrt(rho)*rnorm(n) y <- 0.5 + X %*% TrueBeta + rnorm(n) res<-bits(X,y, pp=TRUE) res$model.pp # the vector of screened model res$postprobs # the log (unnormalized) posterior probabilities corresponding to the model.pp.n=50; p=100; TrueBeta <- c(rep(5,3),rep(0,p-3)) rho <- 0.6 x1 <- matrix(rnorm(n*p), n, p) X <- sqrt(1-rho)*x1 + sqrt(rho)*rnorm(n) y <- 0.5 + X %*% TrueBeta + rnorm(n) res<-bits(X,y, pp=TRUE) res$model.pp # the vector of screened model res$postprobs # the log (unnormalized) posterior probabilities corresponding to the model.pp.
Runs SVEN and UNITE for one trait.
bwas(x, y, params, bigx, ehits = 20)bwas(x, y, params, bigx, ehits = 20)
x |
Cleaned SNP matrix. |
y |
Phenotype vector. |
params |
Named numeric vector with lambda and w. |
bigx |
Full SNP matrix. |
ehits |
Expected number of hits (default 20). |
Times a single SVEN run on the given SNP matrix using a random phenotype.
calc.runtime(X)calc.runtime(X)
X |
SNP matrix. |
Removes duplicate SNPs and filters low MAF variants.
clean(SNPmat, MAF_threshold = 0.05)clean(SNPmat, MAF_threshold = 0.05)
SNPmat |
A sparse SNP matrix (dgCMatrix). |
MAF_threshold |
Minor Allele Frequency cutoff (default 0.05). |
Builds a grid of lambda and w tuning parameters for SVEN.
create_param_mat(x)create_param_mat(x)
x |
Cleaned SNP matrix. |
Opens the Sparse Matrix Converter GUI in your browser.
dense_to_sparse_converter()dense_to_sparse_converter()
Reads a numeric genotype file and converts it to a sparse matrix format.
dense2sparse(file.name, num.genotypes, separator, progress = TRUE)dense2sparse(file.name, num.genotypes, separator, progress = TRUE)
file.name |
Path to the numeric genotype file. Could be (and should be) gzipped. |
num.genotypes |
Maximum number of genotypes to read. An upper bound is OK. |
separator |
"\t" or "," etc that separates the entries in a line. |
progress |
Whether to show a progress bar (default TRUE). |
Computes False Discovery Rate using SNP correlation as proximity measure.
FDR_corrected(model, truth, x, threshold = 0.9)FDR_corrected(model, truth, x, threshold = 0.9)
model |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
x |
SNP matrix. |
threshold |
Correlation threshold (default 0.9). |
Computes False Discovery Rate using base-pair window as proximity measure.
FDR_WS(model, truth, mapmat, winsize = 1000)FDR_WS(model, truth, mapmat, winsize = 1000)
model |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
mapmat |
Map matrix with chromosome and position columns. |
winsize |
Window size in base pairs (default 1000). |
Computes False Positive Rate using SNP correlation as proximity measure.
FPR_corrected(model, truth, x, threshold = 0.9)FPR_corrected(model, truth, x, threshold = 0.9)
model |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
x |
SNP matrix. |
threshold |
Correlation threshold (default 0.9). |
Computes False Positive Rate using base-pair window as proximity measure.
FPR_WS(model, truth, mapmat, winsize = 1000)FPR_WS(model, truth, mapmat, winsize = 1000)
model |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
mapmat |
Map matrix with chromosome and position columns. |
winsize |
Window size in base pairs (default 1000). |
Computes Jaccard index between selected and true causal SNPs.
jcidx(vars, truth)jcidx(vars, truth)
vars |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
This function computes the marginal inclusion probabilities of all variables from a fitted "sven" object.
mip.sven(object, threshold = 0)mip.sven(object, threshold = 0)
object |
A fitted "sven" object |
threshold |
marginal inclusion probabilities above this threshold are stored. Default 0. |
The object returned is a data frame if the sven was run with a single matrix,
or a list of two data frames if sven was run with a list of two matrices.
The first column are the variable names (or numbers if column names of were absent).
Only the nonzero marginal inclusion probabilities are stored.
Somak Dutta
Maintainer:
Somak Dutta <[email protected]>
n <- 50; p <- 100; nonzero <- 3 trueidx <- 1:3 truebeta <- c(4,5,6) X <- matrix(rnorm(n*p), n, p) # n x p covariate matrix y <- 0.5 + X[,trueidx] %*% truebeta + rnorm(n) res <- sven(X=X, y=y) res$model.map # the MAP model mip.sven(res) Z <- matrix(rnorm(n*p), n, p) # another covariate matrix y2 = 0.5 + X[,trueidx] %*% truebeta + Z[,1:2] %*% c(-2,-2) + rnorm(n) res2 <- sven(X=list(X,Z), y=y2) mip.sven(res2) # two data frames, one for X and another for Zn <- 50; p <- 100; nonzero <- 3 trueidx <- 1:3 truebeta <- c(4,5,6) X <- matrix(rnorm(n*p), n, p) # n x p covariate matrix y <- 0.5 + X[,trueidx] %*% truebeta + rnorm(n) res <- sven(X=X, y=y) res$model.map # the MAP model mip.sven(res) Z <- matrix(rnorm(n*p), n, p) # another covariate matrix y2 = 0.5 + X[,trueidx] %*% truebeta + Z[,1:2] %*% c(-2,-2) + rnorm(n) res2 <- sven(X=list(X,Z), y=y2) mip.sven(res2) # two data frames, one for X and another for Z
Full training step: cleans SNP matrix and tunes SVEN hyperparameters.
parameter_selection( X, R2 = 0.5, betamax = 1, n.cores = max(1, parallel::detectCores() - 1), hitsize = "all", MAF_threshold = 0.05 )parameter_selection( X, R2 = 0.5, betamax = 1, n.cores = max(1, parallel::detectCores() - 1), hitsize = "all", MAF_threshold = 0.05 )
X |
Raw SNP matrix (dgCMatrix). |
R2 |
Heritability (default 0.5). |
betamax |
Maximum effect size magnitude (default 1). |
n.cores |
Number of cores (default: detectCores() - 1). |
hitsize |
One of "all", "small", "medium", "large" (default "all"). |
MAF_threshold |
MAF cutoff (default 0.05). |
Runs the full GWAS pipeline for the i-th trait column.
pipeline_single_trait(svenetics_trained_object, i, hitsize = NULL)pipeline_single_trait(svenetics_trained_object, i, hitsize = NULL)
svenetics_trained_object |
A trained svenetics object from parameter_selection(). |
i |
Trait index. |
hitsize |
One of "small", "medium", "large", or NULL. |
This function makes point predictions and computes prediction intervals from a fitted "sven" object.
## S3 method for class 'sven' predict( object, newdata, model = c("WAM", "MAP"), interval = c("none", "MC", "Z"), return.draws = FALSE, Nsim = 10000, level = 0.95, alpha = 1 - level, ... )## S3 method for class 'sven' predict( object, newdata, model = c("WAM", "MAP"), interval = c("none", "MC", "Z"), return.draws = FALSE, Nsim = 10000, level = 0.95, alpha = 1 - level, ... )
object |
A fitted "sven" object |
newdata |
Matrix of new values for |
model |
The model to be used to make predictions. Model "MAP" gives the predictions calculated using the MAP model; model "WAM" gives the predictions calculated using the WAM. Default: "WAM". |
interval |
Type of interval calculation. If |
return.draws |
only required if |
Nsim |
only required if |
level |
Confidence level of the interval. Default: 0.95. |
alpha |
Type one error rate. Default: 1- |
... |
Further arguments passed to or from other methods. |
The object returned depends on "interval" argument. If interval = "none", the object is an
vector of the point predictions; otherwise, the object is an
matrix with the point predictions in the first column and the lower and upper bounds
of prediction intervals in the second and third columns, respectively.
if return.draws is TRUE, a list with the following components is returned:
prediction |
vector or matrix as above |
mc.draws |
an |
Dongjin Li and Somak Dutta
Maintainer:
Dongjin Li <[email protected]>
Li, D., Dutta, S., Roy, V.(2020) Model Based Screening Embedded Bayesian Variable Selection for Ultra-high Dimensional Settings http://arxiv.org/abs/2006.07561
n = 80; p = 100; nonzero = 5 trueidx <- 1:5 nonzero.value <- c(0.50, 0.75, 1.00, 1.25, 1.50) TrueBeta = numeric(p) TrueBeta[trueidx] <- nonzero.value X <- matrix(rnorm(n*p), n, p) y <- 0.5 + X %*% TrueBeta + rnorm(n) res <- sven(X=X, y=y) newx <- matrix(rnorm(20*p), 20, p) # predicted values at a new data matrix using MAP model yhat <- predict(object = res, newdata = newx, model = "MAP", interval = "none") # 95% Monte Carlo prediction interval using WAM MC.interval <- predict(object = res, model = "WAM", newdata = newx, interval = "MC", level=0.95) # 95% Z-prediction interval using MAP model Z.interval <- predict(object = res, model = "MAP", newdata = newx, interval = "Z", level = 0.95)n = 80; p = 100; nonzero = 5 trueidx <- 1:5 nonzero.value <- c(0.50, 0.75, 1.00, 1.25, 1.50) TrueBeta = numeric(p) TrueBeta[trueidx] <- nonzero.value X <- matrix(rnorm(n*p), n, p) y <- 0.5 + X %*% TrueBeta + rnorm(n) res <- sven(X=X, y=y) newx <- matrix(rnorm(20*p), 20, p) # predicted values at a new data matrix using MAP model yhat <- predict(object = res, newdata = newx, model = "MAP", interval = "none") # 95% Monte Carlo prediction interval using WAM MC.interval <- predict(object = res, model = "WAM", newdata = newx, interval = "MC", level=0.95) # 95% Z-prediction interval using MAP model Z.interval <- predict(object = res, model = "MAP", newdata = newx, interval = "Z", level = 0.95)
Simulates a phenotype and evaluates all parameter combinations via Jaccard index.
run_all_params(k, x, R2 = 0.5, nspike = 20, betamax)run_all_params(k, x, R2 = 0.5, nspike = 20, betamax)
k |
Simulation replicate index. |
x |
Cleaned SNP matrix. |
R2 |
Heritability (default 0.5). |
nspike |
Number of causal SNPs to simulate (default 20). |
betamax |
Maximum effect size magnitude. |
SVEN is an approach to selecting variables with embedded screening using a Bayesian hierarchical model. It is also a variable selection method in the spirit of the stochastic shotgun search algorithm. However, by embedding a unique model based screening and using fast Cholesky updates, SVEN produces a highly scalable algorithm to explore gigantic model spaces and rapidly identify the regions of high posterior probabilities. It outputs the log (unnormalized) posterior probability of a set of best (highest probability) models. For more details, see Li et al. (2023, https://doi.org/10.1080/10618600.2022.2074428)
sven( X, y, w = NULL, lam = NULL, Ntemp = 10, Tmax = NULL, Miter = 50, wam.threshold = 0.5, log.eps = -16, L = 20, verbose = FALSE )sven( X, y, w = NULL, lam = NULL, Ntemp = 10, Tmax = NULL, Miter = 50, wam.threshold = 0.5, log.eps = -16, L = 20, verbose = FALSE )
X |
The |
y |
The response vector of length |
w |
The prior inclusion probability of each variable. Default: NULL, whence it is set as
|
lam |
The slab precision parameter. Default: NULL, whence it is set as |
Ntemp |
The number of temperatures. Default: 10. |
Tmax |
The maximum temperature. Default: |
Miter |
The number of iterations per temperature. Default: |
wam.threshold |
The threshold probability to select the covariates for WAM. A covariate will be included in WAM if its corresponding marginal inclusion probability is greater than the threshold. Default: 0.5. |
log.eps |
The tolerance to choose the number of top models. See detail. Default: -16. |
L |
The minimum number of neighboring models screened. Default: 20. |
verbose |
If |
SVEN is developed based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space as follows:
where is the submatrix of consisting of
those columns of for which and similarly, is the
subvector of corresponding to .
Degenerate spike priors on inactive variables and Gaussian slab priors on active
covariates makes the posterior
probability (up to a normalizing constant) of a model available in
explicit form (Li et al., 2020).
The variable selection starts from an empty model and updates the model according to the posterior probability of its neighboring models for some pre-specified number of iterations. In each iteration, the models with small probabilities are screened out in order to quickly identify the regions of high posterior probabilities. A temperature schedule is used to facilitate exploration of models separated by valleys in the posterior probability function, thus mitigate posterior multimodality associated with variable selection models. The default maximum temperature is guided by the asymptotic posterior model selection consistency results in Li et al. (2020).
SVEN provides the maximum a posteriori (MAP) model as well as the weighted average model
(WAM). WAM is obtained in the following way: (1) keep the best (highest probability)
distinct models with
where is chosen so that
;
(2) assign the weights
to the model ; (3) define the approximate marginal inclusion probabilities
for the th variable as
Then, the WAM is defined as the model containing variables with
. SVEN also provides all the top models which
are stored in an sparse matrix, along with their corresponding log (unnormalized)
posterior probabilities.
When X is a list with two matrices, say, W and Z, the above method is extended
to ncol(W)+ncol(Z) dimensional regression. However, the hyperparameters lam and w
are chosen separately for the two matrices, the default values being nrow(W)/ncol(W)^2
and nrow(Z)/ncol(Z)^2 for lam and sqrt(nrow(W))/ncol(W) and
sqrt(nrow(Z))/ncol(Z) for w.
The marginal inclusion probabities can be extracted by using the function mip.
A list with components
model.map |
A vector of indices corresponding to the selected variables in the MAP model. |
model.wam |
A vector of indices corresponding to the selected variables in the WAM. |
model.top |
A sparse matrix storing the top models. |
beta.map |
The ridge estimator of regression coefficients in the MAP model. |
beta.wam |
The ridge estimator of regression coefficients in the WAM. |
mip.map |
The marginal inclusion probabilities of the variables in the MAP model. |
mip.wam |
The marginal inclusion probabilities of the variables in the WAM. |
pprob.map |
The log (unnormalized) posterior probability corresponding to the MAP model. |
pprob.top |
A vector of the log (unnormalized) posterior probabilities corresponding to the top models. |
stats |
Additional statistics. |
Dongjin Li, Debarshi Chakraborty, and Somak Dutta
Maintainer:
Dongjin Li <[email protected]>
Li, D., Dutta, S., and Roy, V. (2023). Model based screening embedded Bayesian variable selection for ultra-high dimensional settings. Journal of Computational and Graphical Statistics, 32(1), 61-73.
[mip.sven()] for marginal inclusion probabilities, [predict.sven()](via [predict()]) for prediction for .
n <- 50; p <- 100; nonzero <- 3 trueidx <- 1:3 truebeta <- c(4,5,6) X <- matrix(rnorm(n*p), n, p) # n x p covariate matrix y <- 0.5 + X[,trueidx] %*% truebeta + rnorm(n) res <- sven(X=X, y=y) res$model.map # the MAP model Z <- matrix(rnorm(n*p), n, p) # another covariate matrix y2 = 0.5 + X[,trueidx] %*% truebeta + Z[,1:2] %*% c(-2,-2) + rnorm(n) res2 <- sven(X=list(X,Z), y=y2)n <- 50; p <- 100; nonzero <- 3 trueidx <- 1:3 truebeta <- c(4,5,6) X <- matrix(rnorm(n*p), n, p) # n x p covariate matrix y <- 0.5 + X[,trueidx] %*% truebeta + rnorm(n) res <- sven(X=X, y=y) res$model.map # the MAP model Z <- matrix(rnorm(n*p), n, p) # another covariate matrix y2 = 0.5 + X[,trueidx] %*% truebeta + Z[,1:2] %*% c(-2,-2) + rnorm(n) res2 <- sven(X=list(X,Z), y=y2)
Opens the SVENETICS GUI in your browser.
svenetics()svenetics()
Runs GWAS across all traits in parallel and saves results as CSVs.
Runs GWAS across all traits and saves results as CSVs.
svenetics_pipeline( svenetics_trained_object, traitfile, hitsizes = NULL, save_dir = "~/SVENETICS_RESULTS" )svenetics_pipeline( svenetics_trained_object, traitfile, hitsizes = NULL, save_dir = "~/SVENETICS_RESULTS" )
svenetics_trained_object |
A trained svenetics object from parameter_selection(). |
traitfile |
Data frame with sample IDs in column 1 and traits in remaining columns. |
hitsizes |
Character vector of hit sizes per trait, or NULL for "medium" across all. |
save_dir |
Directory path where results will be saved (default "~/SVENETICS_RESULTS"). |
Computes True Positive Rate using SNP correlation as proximity measure.
TPR_corrected(model, truth, x, threshold = 0.9)TPR_corrected(model, truth, x, threshold = 0.9)
model |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
x |
SNP matrix. |
threshold |
Correlation threshold (default 0.9). |
Computes True Positive Rate using base-pair window as proximity measure.
TPR_WS(model, truth, mapmat, winsize = 1000)TPR_WS(model, truth, mapmat, winsize = 1000)
model |
Integer vector of selected SNP indices. |
truth |
Integer vector of true causal SNP indices. |
mapmat |
Map matrix with chromosome and position columns. |
winsize |
Window size in base pairs (default 1000). |
Runs 100 parallel simulations and returns the optimal (lambda, w) pair.
tune.sven(x, R2 = 0.5, ehits = 20, betamax = 1, n.cores)tune.sven(x, R2 = 0.5, ehits = 20, betamax = 1, n.cores)
x |
Cleaned SNP matrix. |
R2 |
Heritability (default 0.5). |
ehits |
Expected number of causal SNPs. |
betamax |
Maximum effect size magnitude (default 1). |
n.cores |
Number of cores for parallel computation. |
Tunes SVEN parameters for small, medium, and large expected hit sizes.
tune.sven.all(x, R2 = 0.5, betamax = 1, n.cores)tune.sven.all(x, R2 = 0.5, betamax = 1, n.cores)
x |
Cleaned SNP matrix. |
R2 |
Heritability (default 0.5). |
betamax |
Maximum effect size magnitude (default 1). |
n.cores |
Number of cores for parallel computation. |
Propagates MIPs from SVEN hits to the full SNP matrix via LD.
unite.sven(x, bigx, basic_sven_object, y, ehits, threshold = 0)unite.sven(x, bigx, basic_sven_object, y, ehits, threshold = 0)
x |
Cleaned SNP matrix. |
bigx |
Full (uncleaned) SNP matrix. |
basic_sven_object |
Fitted SVEN object. |
y |
Phenotype vector. |
ehits |
Expected number of hits. |
threshold |
MIP threshold (default 0). |