[RFC] Varying the number of outputs considered for splitting in Multi Output Decision Trees · scikit-learn/scikit-learn#27882

(9 comments) (0 reactions) (0 assignees)Python (27,020 forks)batch import

New Featurehelp wanted

Repository metrics

Stars: (66,084 stars)
PR merge metrics: (Avg merge 10d) (90 merged PRs in 30d)

Description

Describe the workflow you want to enable

One strength of RFRs is that they are incredibly robust and therefore provide a strong baseline for many tasks without needing to consider normalization or scaling of either the inputs or outputs. In the case of multi-output RFRs this robustness towards the output space goes away due to the summing of impurities across different output dimensions which entails the need to standardize the output labels to ensure that undue attention isn't given to particular outputs. Currently the documentation doesn't readily inform the user of this artifact. In the spirit of the Random Forest one solution to avoiding this problem would be to randomly sample which output(s) to consider for the determination of the split. If the number of outputs was set to 1 then we would end up with a case where the normalization of the output space once again doesn't matter.

Describe your proposed solution

The introduction of a new kwarg max_outputs (by analogy to max_features) could allow users to control how many outputs were considered when selecting the optimal split. If set to 1.0 all outputs would be used as currently, if set to 1 then a single output would be used as described above. This seems like a relevant and natural hyper-parameter for the multi-output RFR.

Describe alternatives you've considered, if relevant

No response

Additional context

I have not been able to find literature that explores the above slight adjustment to the current algorithm. This is a RFC to see if the team would accept such a PR in principle without the quoted 200+ citation requirement if sufficient empirical evidence was provided.

Contributor guide

Research direction: Investigate the implementation of multi output decision trees in scikit learn's RandomForestRegressor and propose a new max outputs parameter that randomly samples outputs for split consideration. Provide empirical evidence comparing performance with and without output normalization.
Tech stack: pythonscikit learn
Domain: machine learningdata
Issue type: Feature
Difficulty: 3
Estimated time: 3-5 days
Activity status: Active
Clarity: Clear
Prerequisites: Pythonscikit learnmachine learning fundamentals
Newbie friendliness: 40