ImageDataGenerator flow_from_dataframe not properly shuffling when using validation_split · keras-team/keras-preprocessing#205

(2 留言) (1 反應) (0 負責人)Python (1,028 star) (453 fork)batch import

help wantedimage

描述

Hi,

I observe the following issue when using ImageDataGenerator with flow_from_dataframe: If the dataframe containing the images is sorted for the classes and a validation_split is used, the image batches show some of the classes only in the training subset, others only in the validation subset. This bug is severe in my opinion and can strongly spoil ML training attempts.

I use Tensorflow 1.13.1, where the following code replicates the issue:

import pandas as pd
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator

Nc = 25 # number of classes
classes = list('class'+str(i) for i in range(Nc)) # generate of class labels

# generate ordered class list
N = 100000 # number of images
data = list(classes[int(i // (N/Nc))] for i in range(N))

# DataFrame
file_df = pd.DataFrame(data={'class': data}, columns=['filename', 'class'])
file_df['filename'] = 'pic.jpg' # only single image, use "drop_duplicates=False" below

image_generator = ImageDataGenerator(validation_split=0.2)

val_imgs = image_generator.flow_from_dataframe(file_df, 
    directory='/tf/data/', # change to your folder where pic.jpg is present
    subset='validation', 
    shuffle=True,
    drop_duplicates=False, #needed as in this example only one image file is present
    x_col='filename', y_col='class',
    batch_size=1000, target_size=(224,224), class_mode='categorical', seed=42)

imgs, class_indices = val_imgs.next()
print(np.mean(class_indices, axis=0)) # show distribution of classes

This codes shows the following output

[0.202 0.205 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
 0.199 0.    0.    0.    0.    0.    0.196 0.198 0.    0.    0.    0.
 0.   ]

where some of the classes are not present at all. While we used here "drop_duplicates=False", the same issue arises without this parameter but with a real dataset of different image files.

This observed behavior can be avoided by pre-shuffling the dataset with

file_df = file_df.sample(frac=1).reset_index(drop=True)

Thanks for your feedback! Manuel

貢獻者指南

技術棧: pythonpandasnumpy
領域: machine learningdata
議題類型: bug
難度: 3
預計時間: 1-3 hours
活動狀態: stale
清晰度: clear
前置要求: PythonKeraspandas
新手友善度: 45
研究方向: Investigate the flow from dataframe method in the Keras preprocessing source (likely in dataframe iterator.py). The issue is that validation split is applied before shuffling, causing class imbalance in subsets. Ensure shuffling is performed before splitting. Reference the reproduction code in the issue.