48 lines
1.8 KiB
Text
48 lines
1.8 KiB
Text
# Stratified K-fold vs ShuffleSplit
|
|
|
|
https://stackoverflow.com/questions/45969390/difference-between-stratifiedkfold-and-stratifiedshufflesplit-in-sklearn
|
|
|
|
In ShuffleSplit, the data is shuffled every time, and then split. This means the test sets may overlap between the splits.
|
|
In SKF, test sets don't overlap
|
|
|
|
So, the difference here is that StratifiedKFold just shuffles and splits once, therefore the test sets do not overlap, while StratifiedShuffleSplit shuffles each time before splitting, and it splits n_splits times, the test sets can overlap.
|
|
|
|
Note: the two methods uses "stratified fold" (that why "stratified" appears in both names). It means each part preserves the same percentage of samples of each class (label) as the original data. You can read more at cross_validation documents
|
|
|
|
|
|
''' python code '''
|
|
splits = 5
|
|
|
|
tx = range(10)
|
|
ty = [0] * 5 + [1] * 5
|
|
|
|
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
|
|
from sklearn import datasets
|
|
|
|
kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
|
|
shufflesplit = StratifiedShuffleSplit(n_splits=splits, random_state=42, test_size=2)
|
|
|
|
print("KFold")
|
|
for train_index, test_index in kfold.split(tx, ty):
|
|
print("TRAIN:", train_index, "TEST:", test_index)
|
|
|
|
print("Shuffle Split")
|
|
for train_index, test_index in shufflesplit.split(tx, ty):
|
|
print("TRAIN:", train_index, "TEST:", test_index)
|
|
|
|
'''
|
|
Output:
|
|
|
|
KFold
|
|
TRAIN: [0 2 3 4 5 6 7 9] TEST: [1 8]
|
|
TRAIN: [0 1 2 3 5 7 8 9] TEST: [4 6]
|
|
TRAIN: [0 1 3 4 5 6 8 9] TEST: [2 7]
|
|
TRAIN: [1 2 3 4 6 7 8 9] TEST: [0 5]
|
|
TRAIN: [0 1 2 4 5 6 7 8] TEST: [3 9]
|
|
|
|
Shuffle Split
|
|
TRAIN: [8 4 1 0 6 5 7 2] TEST: [3 9]
|
|
TRAIN: [7 0 3 9 4 5 1 6] TEST: [8 2]
|
|
TRAIN: [1 2 5 6 4 8 9 0] TEST: [3 7]
|
|
TRAIN: [4 6 7 8 3 5 1 2] TEST: [9 0]
|
|
TRAIN: [7 2 6 5 4 3 0 9] TEST: [1 8]
|