keras-team/keras-preprocessing

ImageDataGenerator flow_from_dataframe not properly shuffling when using validation_split

Open

#205 建立於 2019年4月26日

在 GitHub 查看
 (2 留言) (1 反應) (0 負責人)Python (1,028 star) (453 fork)batch import
help wantedimage

描述

Hi,

I observe the following issue when using ImageDataGenerator with flow_from_dataframe: If the dataframe containing the images is sorted for the classes and a validation_split is used, the image batches show some of the classes only in the training subset, others only in the validation subset. This bug is severe in my opinion and can strongly spoil ML training attempts.

I use Tensorflow 1.13.1, where the following code replicates the issue:

import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator

Nc = 25 # number of classes
classes = list('class'+str(i) for i in range(Nc)) # generate of class labels

# generate ordered class list
N = 100000 # number of images
data = list(classes[int(i // (N/Nc))] for i in range(N))

# DataFrame
file_df = pd.DataFrame(data={'class': data}, columns=['filename', 'class'])
file_df['filename'] = 'pic.jpg' # only single image, use "drop_duplicates=False" below

image_generator = ImageDataGenerator(validation_split=0.2)

val_imgs = image_generator.flow_from_dataframe(file_df, 
    directory='/tf/data/', # change to your folder where pic.jpg is present
    subset='validation', 
    shuffle=True,
    drop_duplicates=False, #needed as in this example only one image file is present
    x_col='filename', y_col='class',
    batch_size=1000, target_size=(224,224), class_mode='categorical', seed=42)

imgs, class_indices = val_imgs.next()
print(np.mean(class_indices, axis=0)) # show distribution of classes

This codes shows the following output

[0.202 0.205 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.199 0.    0.    0.    0.    0.    0.196 0.198 0.    0.    0.    0.
 0.   ]

where some of the classes are not present at all. While we used here "drop_duplicates=False", the same issue arises without this parameter but with a real dataset of different image files.

This observed behavior can be avoided by pre-shuffling the dataset with

file_df = file_df.sample(frac=1).reset_index(drop=True)

Thanks for your feedback! Manuel

貢獻者指南