A random splitting into two halves: left part is training set, right part is validation set
Left panel shows single split; right panel shows multiple splits
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
Validation | Train | Train | Train | Train |
The use of the term bootstrap derives from the phrase to pull oneself up by one’s bootstraps, widely thought to be based on one of the 18th century “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:
The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.
It is not the same as the term “bootstrap” used in computer science meaning to “boot” a computer from a set of core instructions, though the derivation is similar.
A graphical illustration of the bootstrap approach on a small sample containing \(n = 3\) observations. Each bootstrap data set contains \(n\) observations, sampled with replacement from the original data set. Each bootstrap data set is used to obtain an estimate of \(\alpha\).
Figure 1. Association of negative control signatures with overall survival. In plots A-C the NKI cohort was split into two groups using a signature of post-prandial laughter (panel A), localization of skin fibroblasts (panel B), social defeat in mice (panel C). In panels A-C, the fraction of patients alive (overall survival, OS) is shown as a function of time for both groups. Hazard ratios (HR) between groups and their associated p-values are given in bottom-left corners. Panel D depicts p-values for association with outcome for all MSigDB c2 signatures and random signatures of identical size as MSigDB c2 signatures.
Figure 2. Most published signatures are not significantly better outcome predictors than random signatures of identical size. The x-axis denotes the p-value of association with overall survival. Red dots stand for published signatures, yellow shapes depict the distribution of p-values for 1000 random signatures of identical size, with the lower 5% quantiles shaded in green and the median shown as black line. Signatures are ordered by increasing sizes.
Figure 4. Most prognostic transcriptional signals are correlated with meta-PCNA. A) Each point denotes a signature. The x-axis depicts the absolute value of the correlation of the first principal component of the signatures with meta-PCNA, the y-axis depicts the hazard ratio for outcome association. Details of the analysis for each data point are available in the Supporting Information (Text S1). B) Distribution of the correlations of individual genes with meta-PCNA, for genes significantly associated with overall survival (red) and for all the genes spotted on the microarrays (black).
sklearn.model_selection.cross_val_score
estimator
: estimator object implementing ‘fit’X
: array-likey
: array-like, optional, default: Nonegroups
: array-like, with shape (n_samples,), optionalscoring
: string, callable or None, optional, default: Nonecv
: int, cross-validation generator or an iterable, optionaln_jobs
: integer, optionalsklearn.model_selection.KFold(n_splits=3, shuffle=False, random_state=None)
sklearn.model_selection.LeaveOneOut()
Both use loop for train_index, test_index in kf.split(X):
.
get_n_splits
provides number of iterations that will occur.