scikit-learn/scikit-learn

Add TimeSeriesCV and HomogeneousTimeSeriesCV

Open

#6,322 创建于 2016年2月9日

在 GitHub 查看
 (23 评论) (0 反应) (0 负责人)Python (66,084 star) (27,020 fork)batch import
ModerateNew Featurehelp wantedmodule:model_selection

描述

I get this asked about once a day, so I think we should just add it. Many people work with time series, and adding cross-validation for them would be really easy. The standard strategy is described for example here

There are basically two cases: homogeneous time series (one sample every X seconds / days), or heterogeneous time series, where each sample has a time stamp.

For the homogeneous case, we can just put the first n_samples // n_folds in the first fold etc, so it's a very simple variation of KFold. Fixed in #6586.

For heterogeneous case, we need to get a labels array and split accordingly. If we cast that to integers, people could actually provide pandas time series, and they would be handled correctly (they will be converted to nanoseconds).

I remember arguing against this addition, but I changed my mind ;)

贡献者指南