armed.crossvalidation package#

Submodules#

armed.crossvalidation.grouped_cv module#

Custom scikit-learn KFolds for grouped AND stratified splitting, created by hermidalc https://github.com/scikit-learn/scikit-learn/issues/13621#issuecomment-656094573 —– 12/28/2020

class armed.crossvalidation.grouped_cv.StratifiedGroupKFold(*args: Any, **kwargs: Any)#

Bases: _BaseKFold

Stratified K-Folds iterator variant with non-overlapping groups.

This cross-validation object is a variation of StratifiedKFold that returns stratified folds with non-overlapping groups. The folds are made by preserving the percentage of samples for each class.

The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds).

The difference between GroupKFold and StratifiedGroupKFold is that the former attempts to create balanced folds such that the number of distinct groups is approximately the same in each fold, whereas StratifiedGroupKFold attempts to create folds which preserve the percentage of samples for each class.

See also

StratifiedKFold: Takes class information into account to build folds which retain class distributions (for binary or multiclass classification tasks).
GroupKFold: K-fold iterator variant with non-overlapping groups.

class armed.crossvalidation.grouped_cv.StratifiedGroupShuffleSplit(*args: Any, **kwargs: Any)#

Bases: StratifiedShuffleSplit

Stratified GroupShuffleSplit cross-validator Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers. This cross-validation object is a merge of GroupShuffleSplit and StratifiedShuffleSplit, which returns randomized folds stratified by group class. The folds are made by preserving the percentage of groups for each class. Note: like the StratifiedShuffleSplit strategy, stratified random group splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. Read more in the User Guide. :param n_splits: Number of re-shuffling & splitting iterations. :type n_splits: int, default=5 :param test_size: If float, should be between 0.0 and 1.0 and represent the proportion

of groups to include in the test split (rounded up). If int, represents the absolute number of test groups. If None, the value is set to the complement of the train size. By default, the value is set to 0.1.

Parameters:

train_size (float, int, or None, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the groups to include in the train split. If int, represents the absolute number of train groups. If None, the value is automatically set to the complement of the test size.
random_state (int, RandomState instance or None, default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupShuffleSplit
>>> X = np.ones(shape=(15, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6])
>>> print(groups.shape)
(15,)
>>> sgss = StratifiedGroupShuffleSplit(n_splits=3, train_size=.7,
...                                    random_state=43)
>>> sgss.get_n_splits()
3
>>> for train_idx, test_idx in sgss.split(X, y, groups):
...     print("TRAIN:", groups[train_idx])
...     print("      ", y[train_idx])
...     print(" TEST:", groups[test_idx])
...     print("      ", y[test_idx])
TRAIN: [2 2 2 4 5 5 5 5 6 6]
       [1 1 1 0 1 1 1 1 0 0]
 TEST: [1 1 3 3 3]
       [0 0 1 1 1]
TRAIN: [1 1 2 2 2 3 3 3 4]
       [0 0 1 1 1 1 1 1 0]
 TEST: [5 5 5 5 6 6]
       [1 1 1 1 0 0]
TRAIN: [1 1 2 2 2 3 3 3 6 6]
       [0 0 1 1 1 1 1 1 0 0]
 TEST: [4 5 5 5 5]
       [0 1 1 1 1]

armed.crossvalidation.splitting module#

Classes for holding data with K-Fold cross-validation

class armed.crossvalidation.splitting.BasicKFoldUtil(x: numpy.ndarray, z: numpy.ndarray, y: numpy.ndarray, n_folds: int = 5, kfold_class=sklearn.model_selection.StratifiedKFold, seed=8)#

Bases: object

create_folds() → list#

Generate and store folds

Returns:: list of tuples (x_train, z_train, y_train, x_val, z_val, y_val)
Return type:: list

get_fold(idx: int) → tuple#

Get data for given fold

Parameters:: idx (int) – fold number
Returns:: (x_train, z_train, y_train, x_val, z_val, y_val)
Return type:: tuple

class armed.crossvalidation.splitting.NestedKFoldUtil(x: numpy.ndarray, z: numpy.ndarray, y: numpy.ndarray, n_folds_outer: int = 5, n_folds_inner: int = 5, kfold_class=sklearn.model_selection.StratifiedKFold, seed=8)#

Bases: BasicKFoldUtil

create_folds() → list#

Generate and store folds

Returns:

list of dicts

{‘outer’: (x_train, z_train, y_train, x_val, z_val, y_val),: ’inner’: BasicKFoldUtil}

Return type:

list

get_fold(idx_outer: int, idx_inner: int | None = None) → tuple#

Get data for given fold

Parameters:

idx_outer (int) – Outer fold number
idx_inner (int, optional) – Inner fold number. If not provided, return the outer split.

Returns:

(x_train, z_train, y_train, x_val, z_val, y_val) Returns inner validation split if idx_inner is provided or outer

split otherwise.

Return type:

tuple

Module contents#

Custom cross-validation tools derived from those in sklearn