cvrisk {mboost} | R Documentation |
Cross-validated estimation of the empirical risk for hyper-parameter selection.
cvrisk(object, folds = cv(model.weights(object)), grid = 1:mstop(object), papply = if (require("multicore")) mclapply else lapply, fun = NULL, ...) cv(weights, type = c("bootstrap", "kfold", "subsampling"), B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)
object |
an object of class |
folds |
a weight matrix with number of rows equal to the number
of observations. The number of columns corresponds to
the number of cross-validation runs. Can be computed
using function |
grid |
a vector of stopping parameters the empirical risk is to be evaluated for. |
papply |
(parallel) apply function. In the absence of package |
fun |
if |
weights |
a numeric vector of weights for the model to be cross-validated. |
type |
character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented. |
B |
number of folds, per default 25 for |
prob |
percentage of observations to be included in the learning samples for subsampling. |
strata |
a factor of the same length as |
... |
additional arguments passed to |
The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters mstop
are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied.
Different forms of cross-validation can be applied, for example
10-fold cross-validation or bootstrapping. The weights (zero weights
correspond to test cases) are defined via the folds
matrix.
If package multicore
is available, cvrisk
can be easily
used in parallel on cores/processors available by specifying
papply = mcapply
. The scheduling
can be changed by the corresponding arguments of
mclapply
(via the dot arguments).
The function cv
can be used to build an appropriate
weight matrix to be used with cvrisk
. If strata
is defined
sampling is performed in each stratum separately thus preserving
the distribution of the strata
variable in each fold.
An object of class cvrisk
(when fun
wasn't specified), basically a matrix
containing estimates of the empirical risk for a varying number
of bootstrap iterations. plot
and print
methods
are available as well as a mstop
method.
Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006), The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675–699.
AIC.mboost
for
AIC
based selection of the stopping iteration. Use mstop
to extract the optimal stopping iteration from cvrisk
object.
data("bodyfat", package = "mboost") ### fit linear model to data model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE) ### AIC-based selection of number of boosting iterations maic <- AIC(model) maic ### inspect coefficient path and AIC-based stopping criterion par(mai = par("mai") * c(1, 1, 1, 1.8)) plot(model) abline(v = mstop(maic), col = "lightgray") ### 10-fold cross-validation cv10f <- cv(model.weights(model), type = "kfold") cvm <- cvrisk(model, folds = cv10f, papply = lapply) print(cvm) mstop(cvm) plot(cvm) ### 25 bootstrap iterations (manually) set.seed(290875) n <- nrow(bodyfat) bs25 <- rmultinom(25, n, rep(1, n)/n) cvm <- cvrisk(model, folds = bs25, papply = lapply) print(cvm) mstop(cvm) plot(cvm) ### same by default set.seed(290875) cvrisk(model, papply = lapply) ### 25 bootstrap iterations (using cv) set.seed(290875) bs25_2 <- cv(model.weights(model), type="bootstrap") all(bs25 == bs25_2) ### trees blackbox <- blackboost(DEXfat ~ ., data = bodyfat) cvtree <- cvrisk(blackbox, papply = lapply) plot(cvtree) ### cvrisk in parallel modes: ## Not run: ## multicore only runs properly on unix systems library("multicore") cvrisk(model) ## End(Not run) ## Not run: ## infrastructure needs to be set up in advance library("snow") cl <- makePVMcluster(25) # e.g. to run cvrisk on 25 nodes via PVM myApply <- function(X, FUN, cl, ...) { clusterEvalQ(cl, library("mboost")) # load mboost on nodes ## further set up steps as required clusterApplyLB(cl = cl, X, FUN, ...) } cvrisk(model, papply = myApply, cl = cl) stopCluster(cl) ## End(Not run)