armed.crossvalidation package#
Submodules#
armed.crossvalidation.grouped_cv module#
Custom scikit-learn KFolds for grouped AND stratified splitting, created by hermidalc https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-656094573 —– 12/28/2020
- class armed.crossvalidation.grouped_cv.StratifiedGroupKFold(*args: Any, **kwargs: Any)#
Bases:
_BaseKFold
Stratified K-Folds iterator variant with non-overlapping groups.
This cross-validation object is a variation of StratifiedKFold that returns stratified folds with non-overlapping groups. The folds are made by preserving the percentage of samples for each class.
The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds).
The difference between GroupKFold and StratifiedGroupKFold is that the former attempts to create balanced folds such that the number of distinct groups is approximately the same in each fold, whereas StratifiedGroupKFold attempts to create folds which preserve the percentage of samples for each class.
Read more in the User Guide.
- Parameters:
n_splits (int, default=5) – Number of folds. Must be at least 2.
shuffle (bool, default=False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.
random_state (int or RandomState instance, default=None) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave random_state as None. Pass an int for reproducible output across multiple function calls. See Glossary.
Examples
>>> import numpy as np >>> from sklearn.model_selection import StratifiedGroupKFold >>> X = np.ones((17, 2)) >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]) >>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8]) >>> cv = StratifiedGroupKFold(n_splits=3) >>> for train_idxs, test_idxs in cv.split(X, y, groups): ... print("TRAIN:", groups[train_idxs]) ... print(" ", y[train_idxs]) ... print(" TEST:", groups[test_idxs]) ... print(" ", y[test_idxs]) TRAIN: [2 2 4 5 5 5 5 6 6 7] [1 1 1 0 0 0 0 0 0 0] TEST: [1 1 3 3 3 8 8] [0 0 1 1 1 0 0] TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8] [0 0 1 1 1 1 0 0 0 0 0 0] TEST: [2 2 6 6 7] [1 1 0 0 0] TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8] [0 0 1 1 1 1 1 0 0 0 0 0] TEST: [4 5 5 5 5] [1 0 0 0 0]
See also
StratifiedKFold
Takes class information into account to build folds which retain class distributions (for binary or multiclass classification tasks).
GroupKFold
K-fold iterator variant with non-overlapping groups.
- class armed.crossvalidation.grouped_cv.StratifiedGroupShuffleSplit(*args: Any, **kwargs: Any)#
Bases:
StratifiedShuffleSplit
Stratified GroupShuffleSplit cross-validator Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers. This cross-validation object is a merge of GroupShuffleSplit and StratifiedShuffleSplit, which returns randomized folds stratified by group class. The folds are made by preserving the percentage of groups for each class. Note: like the StratifiedShuffleSplit strategy, stratified random group splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. Read more in the User Guide. :param n_splits: Number of re-shuffling & splitting iterations. :type n_splits: int, default=5 :param test_size: If float, should be between 0.0 and 1.0 and represent the proportion
of groups to include in the test split (rounded up). If int, represents the absolute number of test groups. If None, the value is set to the complement of the train size. By default, the value is set to 0.1.
- Parameters:
train_size (float, int, or None, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the groups to include in the train split. If int, represents the absolute number of train groups. If None, the value is automatically set to the complement of the test size.
random_state (int, RandomState instance or None, default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Examples
>>> import numpy as np >>> from sklearn.model_selection import StratifiedGroupShuffleSplit >>> X = np.ones(shape=(15, 2)) >>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0]) >>> groups = np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6]) >>> print(groups.shape) (15,) >>> sgss = StratifiedGroupShuffleSplit(n_splits=3, train_size=.7, ... random_state=43) >>> sgss.get_n_splits() 3 >>> for train_idx, test_idx in sgss.split(X, y, groups): ... print("TRAIN:", groups[train_idx]) ... print(" ", y[train_idx]) ... print(" TEST:", groups[test_idx]) ... print(" ", y[test_idx]) TRAIN: [2 2 2 4 5 5 5 5 6 6] [1 1 1 0 1 1 1 1 0 0] TEST: [1 1 3 3 3] [0 0 1 1 1] TRAIN: [1 1 2 2 2 3 3 3 4] [0 0 1 1 1 1 1 1 0] TEST: [5 5 5 5 6 6] [1 1 1 1 0 0] TRAIN: [1 1 2 2 2 3 3 3 6 6] [0 0 1 1 1 1 1 1 0 0] TEST: [4 5 5 5 5] [0 1 1 1 1]
See also
GroupShuffleSplit
Shuffle-Group(s)-Out iterator.
StratifiedShuffleSplit
Stratified ShuffleSplit iterator.
- split(X, y, groups=None)#
Generate indices to split data into training and test set. :param X: Training data, where n_samples is the number of samples
and n_features is the number of features. Note that providing
y
is sufficient to generate the splits and hencenp.zeros(n_samples)
may be used as a placeholder forX
instead of actual training data.- Parameters:
y (array-like, shape (n_samples,)) – The target variable for supervised learning problems. Stratification is done based on the y labels.
groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.
- Yields:
train (ndarray) – The training set indices for that split.
test (ndarray) – The testing set indices for that split.
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical by setting
random_state
to an integer.
armed.crossvalidation.splitting module#
Classes for holding data with K-Fold cross-validation
- class armed.crossvalidation.splitting.BasicKFoldUtil(x: numpy.ndarray, z: numpy.ndarray, y: numpy.ndarray, n_folds: int = 5, kfold_class=sklearn.model_selection.StratifiedKFold, seed=8)#
Bases:
object
- create_folds() list #
Generate and store folds
- Returns:
list of tuples (x_train, z_train, y_train, x_val, z_val, y_val)
- Return type:
list
- get_fold(idx: int) tuple #
Get data for given fold
- Parameters:
idx (int) – fold number
- Returns:
(x_train, z_train, y_train, x_val, z_val, y_val)
- Return type:
tuple
- class armed.crossvalidation.splitting.NestedKFoldUtil(x: numpy.ndarray, z: numpy.ndarray, y: numpy.ndarray, n_folds_outer: int = 5, n_folds_inner: int = 5, kfold_class=sklearn.model_selection.StratifiedKFold, seed=8)#
Bases:
BasicKFoldUtil
- create_folds() list #
Generate and store folds
- Returns:
- list of dicts
- {‘outer’: (x_train, z_train, y_train, x_val, z_val, y_val),
’inner’: BasicKFoldUtil}
- Return type:
list
- get_fold(idx_outer: int, idx_inner: int | None = None) tuple #
Get data for given fold
- Parameters:
idx_outer (int) – Outer fold number
idx_inner (int, optional) – Inner fold number. If not provided, return the outer split.
- Returns:
(x_train, z_train, y_train, x_val, z_val, y_val) Returns inner validation split if idx_inner is provided or outer
split otherwise.
- Return type:
tuple
Module contents#
Custom cross-validation tools derived from those in sklearn