armed.crossvalidation package#

Submodules#

armed.crossvalidation.grouped_cv module#

Custom scikit-learn KFolds for grouped AND stratified splitting, created by hermidalc https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-656094573 —– 12/28/2020

class armed.crossvalidation.grouped_cv.StratifiedGroupKFold(*args: Any, **kwargs: Any)#

Bases: _BaseKFold

Stratified K-Folds iterator variant with non-overlapping groups.

This cross-validation object is a variation of StratifiedKFold that returns stratified folds with non-overlapping groups. The folds are made by preserving the percentage of samples for each class.

The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds).

The difference between GroupKFold and StratifiedGroupKFold is that the former attempts to create balanced folds such that the number of distinct groups is approximately the same in each fold, whereas StratifiedGroupKFold attempts to create folds which preserve the percentage of samples for each class.

Read more in the User Guide.

Parameters:
  • n_splits (int, default=5) – Number of folds. Must be at least 2.

  • shuffle (bool, default=False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

  • random_state (int or RandomState instance, default=None) – When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave random_state as None. Pass an int for reproducible output across multiple function calls. See Glossary.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = StratifiedGroupKFold(n_splits=3)
>>> for train_idxs, test_idxs in cv.split(X, y, groups):
...     print("TRAIN:", groups[train_idxs])
...     print("      ", y[train_idxs])
...     print(" TEST:", groups[test_idxs])
...     print("      ", y[test_idxs])
TRAIN: [2 2 4 5 5 5 5 6 6 7]
       [1 1 1 0 0 0 0 0 0 0]
 TEST: [1 1 3 3 3 8 8]
       [0 0 1 1 1 0 0]
TRAIN: [1 1 3 3 3 4 5 5 5 5 8 8]
       [0 0 1 1 1 1 0 0 0 0 0 0]
 TEST: [2 2 6 6 7]
       [1 1 0 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
       [0 0 1 1 1 1 1 0 0 0 0 0]
 TEST: [4 5 5 5 5]
       [1 0 0 0 0]

See also

StratifiedKFold

Takes class information into account to build folds which retain class distributions (for binary or multiclass classification tasks).

GroupKFold

K-fold iterator variant with non-overlapping groups.

class armed.crossvalidation.grouped_cv.StratifiedGroupShuffleSplit(*args: Any, **kwargs: Any)#

Bases: StratifiedShuffleSplit

Stratified GroupShuffleSplit cross-validator Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers. This cross-validation object is a merge of GroupShuffleSplit and StratifiedShuffleSplit, which returns randomized folds stratified by group class. The folds are made by preserving the percentage of groups for each class. Note: like the StratifiedShuffleSplit strategy, stratified random group splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. Read more in the User Guide. :param n_splits: Number of re-shuffling & splitting iterations. :type n_splits: int, default=5 :param test_size: If float, should be between 0.0 and 1.0 and represent the proportion

of groups to include in the test split (rounded up). If int, represents the absolute number of test groups. If None, the value is set to the complement of the train size. By default, the value is set to 0.1.

Parameters:
  • train_size (float, int, or None, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the groups to include in the train split. If int, represents the absolute number of train groups. If None, the value is automatically set to the complement of the test size.

  • random_state (int, RandomState instance or None, default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupShuffleSplit
>>> X = np.ones(shape=(15, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6])
>>> print(groups.shape)
(15,)
>>> sgss = StratifiedGroupShuffleSplit(n_splits=3, train_size=.7,
...                                    random_state=43)
>>> sgss.get_n_splits()
3
>>> for train_idx, test_idx in sgss.split(X, y, groups):
...     print("TRAIN:", groups[train_idx])
...     print("      ", y[train_idx])
...     print(" TEST:", groups[test_idx])
...     print("      ", y[test_idx])
TRAIN: [2 2 2 4 5 5 5 5 6 6]
       [1 1 1 0 1 1 1 1 0 0]
 TEST: [1 1 3 3 3]
       [0 0 1 1 1]
TRAIN: [1 1 2 2 2 3 3 3 4]
       [0 0 1 1 1 1 1 1 0]
 TEST: [5 5 5 5 6 6]
       [1 1 1 1 0 0]
TRAIN: [1 1 2 2 2 3 3 3 6 6]
       [0 0 1 1 1 1 1 1 0 0]
 TEST: [4 5 5 5 5]
       [0 1 1 1 1]

See also

GroupShuffleSplit

Shuffle-Group(s)-Out iterator.

StratifiedShuffleSplit

Stratified ShuffleSplit iterator.

split(X, y, groups=None)#

Generate indices to split data into training and test set. :param X: Training data, where n_samples is the number of samples

and n_features is the number of features. Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

Parameters:
  • y (array-like, shape (n_samples,)) – The target variable for supervised learning problems. Stratification is done based on the y labels.

  • groups (array-like, with shape (n_samples,)) – Group labels for the samples used while splitting the dataset into train/test set.

Yields:
  • train (ndarray) – The training set indices for that split.

  • test (ndarray) – The testing set indices for that split.

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

armed.crossvalidation.splitting module#

Classes for holding data with K-Fold cross-validation

class armed.crossvalidation.splitting.BasicKFoldUtil(x: numpy.ndarray, z: numpy.ndarray, y: numpy.ndarray, n_folds: int = 5, kfold_class=sklearn.model_selection.StratifiedKFold, seed=8)#

Bases: object

create_folds() list#

Generate and store folds

Returns:

list of tuples (x_train, z_train, y_train, x_val, z_val, y_val)

Return type:

list

get_fold(idx: int) tuple#

Get data for given fold

Parameters:

idx (int) – fold number

Returns:

(x_train, z_train, y_train, x_val, z_val, y_val)

Return type:

tuple

class armed.crossvalidation.splitting.NestedKFoldUtil(x: numpy.ndarray, z: numpy.ndarray, y: numpy.ndarray, n_folds_outer: int = 5, n_folds_inner: int = 5, kfold_class=sklearn.model_selection.StratifiedKFold, seed=8)#

Bases: BasicKFoldUtil

create_folds() list#

Generate and store folds

Returns:

list of dicts
{‘outer’: (x_train, z_train, y_train, x_val, z_val, y_val),

’inner’: BasicKFoldUtil}

Return type:

list

get_fold(idx_outer: int, idx_inner: int | None = None) tuple#

Get data for given fold

Parameters:
  • idx_outer (int) – Outer fold number

  • idx_inner (int, optional) – Inner fold number. If not provided, return the outer split.

Returns:

(x_train, z_train, y_train, x_val, z_val, y_val) Returns inner validation split if idx_inner is provided or outer

split otherwise.

Return type:

tuple

Module contents#

Custom cross-validation tools derived from those in sklearn