keras-team/keras-preprocessing

ImageDataGenerator flow_from_dataframe not properly shuffling when using validation_split

Open

#205 opened on Apr 26, 2019

View on GitHub
 (2 comments) (1 reaction) (0 assignees)Python (1,028 stars) (453 forks)batch import
help wantedimage

Description

Hi,

I observe the following issue when using ImageDataGenerator with flow_from_dataframe: If the dataframe containing the images is sorted for the classes and a validation_split is used, the image batches show some of the classes only in the training subset, others only in the validation subset. This bug is severe in my opinion and can strongly spoil ML training attempts.

I use Tensorflow 1.13.1, where the following code replicates the issue:

import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator

Nc = 25 # number of classes
classes = list('class'+str(i) for i in range(Nc)) # generate of class labels

# generate ordered class list
N = 100000 # number of images
data = list(classes[int(i // (N/Nc))] for i in range(N))

# DataFrame
file_df = pd.DataFrame(data={'class': data}, columns=['filename', 'class'])
file_df['filename'] = 'pic.jpg' # only single image, use "drop_duplicates=False" below

image_generator = ImageDataGenerator(validation_split=0.2)

val_imgs = image_generator.flow_from_dataframe(file_df, 
    directory='/tf/data/', # change to your folder where pic.jpg is present
    subset='validation', 
    shuffle=True,
    drop_duplicates=False, #needed as in this example only one image file is present
    x_col='filename', y_col='class',
    batch_size=1000, target_size=(224,224), class_mode='categorical', seed=42)

imgs, class_indices = val_imgs.next()
print(np.mean(class_indices, axis=0)) # show distribution of classes

This codes shows the following output

[0.202 0.205 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.199 0.    0.    0.    0.    0.    0.196 0.198 0.    0.    0.    0.
 0.   ]

where some of the classes are not present at all. While we used here "drop_duplicates=False", the same issue arises without this parameter but with a real dataset of different image files.

This observed behavior can be avoided by pre-shuffling the dataset with

file_df = file_df.sample(frac=1).reset_index(drop=True)

Thanks for your feedback! Manuel

Contributor guide