ImageDataGenerator flow_from_dataframe not properly shuffling when using validation_split · keras-team/keras-preprocessing#205

(2 comments) (1 reaction) (0 assignees)Python (1,028 stars) (453 forks)batch import

help wantedimage

Description

Hi,

I observe the following issue when using ImageDataGenerator with flow_from_dataframe: If the dataframe containing the images is sorted for the classes and a validation_split is used, the image batches show some of the classes only in the training subset, others only in the validation subset. This bug is severe in my opinion and can strongly spoil ML training attempts.

I use Tensorflow 1.13.1, where the following code replicates the issue:

import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator

Nc = 25 # number of classes
classes = list('class'+str(i) for i in range(Nc)) # generate of class labels

# generate ordered class list
N = 100000 # number of images
data = list(classes[int(i // (N/Nc))] for i in range(N))

# DataFrame
file_df = pd.DataFrame(data={'class': data}, columns=['filename', 'class'])
file_df['filename'] = 'pic.jpg' # only single image, use "drop_duplicates=False" below

image_generator = ImageDataGenerator(validation_split=0.2)

val_imgs = image_generator.flow_from_dataframe(file_df, 
    directory='/tf/data/', # change to your folder where pic.jpg is present
    subset='validation', 
    shuffle=True,
    drop_duplicates=False, #needed as in this example only one image file is present
    x_col='filename', y_col='class',
    batch_size=1000, target_size=(224,224), class_mode='categorical', seed=42)

imgs, class_indices = val_imgs.next()
print(np.mean(class_indices, axis=0)) # show distribution of classes

This codes shows the following output

[0.202 0.205 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.199 0.    0.    0.    0.    0.    0.196 0.198 0.    0.    0.    0.
 0.   ]

where some of the classes are not present at all. While we used here "drop_duplicates=False", the same issue arises without this parameter but with a real dataset of different image files.

This observed behavior can be avoided by pre-shuffling the dataset with

file_df = file_df.sample(frac=1).reset_index(drop=True)

Thanks for your feedback! Manuel

Contributor guide

Tech stack: pythonpandasnumpy
Domain: machine learningdata
Issue type: bug
Difficulty: 3
Estimated time: 1-3 hours
Activity status: stale
Clarity: clear
Prerequisites: PythonKeraspandas
Newbie friendliness: 45
Research direction: Investigate the flow from dataframe method in the Keras preprocessing source (likely in dataframe iterator.py). The issue is that validation split is applied before shuffling, causing class imbalance in subsets. Ensure shuffling is performed before splitting. Reference the reproduction code in the issue.