ImageDataGenerator flow_from_dataframe not properly shuffling when using validation_split
#205 创建于 2019年4月26日
描述
Hi,
I observe the following issue when using ImageDataGenerator with flow_from_dataframe: If the dataframe containing the images is sorted for the classes and a validation_split is used, the image batches show some of the classes only in the training subset, others only in the validation subset. This bug is severe in my opinion and can strongly spoil ML training attempts.
I use Tensorflow 1.13.1, where the following code replicates the issue:
import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator
Nc = 25 # number of classes
classes = list('class'+str(i) for i in range(Nc)) # generate of class labels
# generate ordered class list
N = 100000 # number of images
data = list(classes[int(i // (N/Nc))] for i in range(N))
# DataFrame
file_df = pd.DataFrame(data={'class': data}, columns=['filename', 'class'])
file_df['filename'] = 'pic.jpg' # only single image, use "drop_duplicates=False" below
image_generator = ImageDataGenerator(validation_split=0.2)
val_imgs = image_generator.flow_from_dataframe(file_df,
directory='/tf/data/', # change to your folder where pic.jpg is present
subset='validation',
shuffle=True,
drop_duplicates=False, #needed as in this example only one image file is present
x_col='filename', y_col='class',
batch_size=1000, target_size=(224,224), class_mode='categorical', seed=42)
imgs, class_indices = val_imgs.next()
print(np.mean(class_indices, axis=0)) # show distribution of classes
This codes shows the following output
[0.202 0.205 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0.199 0. 0. 0. 0. 0. 0.196 0.198 0. 0. 0. 0.
0. ]
where some of the classes are not present at all. While we used here "drop_duplicates=False", the same issue arises without this parameter but with a real dataset of different image files.
This observed behavior can be avoided by pre-shuffling the dataset with
file_df = file_df.sample(frac=1).reset_index(drop=True)
Thanks for your feedback! Manuel