[RFC] Varying the number of outputs considered for splitting in Multi Output Decision Trees
#27,882 opened on Dec 1, 2023
Repository metrics
- Stars
- (66,084 stars)
- PR merge metrics
- (Avg merge 10d) (90 merged PRs in 30d)
Description
Describe the workflow you want to enable
One strength of RFRs is that they are incredibly robust and therefore provide a strong baseline for many tasks without needing to consider normalization or scaling of either the inputs or outputs. In the case of multi-output RFRs this robustness towards the output space goes away due to the summing of impurities across different output dimensions which entails the need to standardize the output labels to ensure that undue attention isn't given to particular outputs. Currently the documentation doesn't readily inform the user of this artifact. In the spirit of the Random Forest one solution to avoiding this problem would be to randomly sample which output(s) to consider for the determination of the split. If the number of outputs was set to 1 then we would end up with a case where the normalization of the output space once again doesn't matter.
Describe your proposed solution
The introduction of a new kwarg max_outputs (by analogy to max_features) could allow users to control how many outputs were considered when selecting the optimal split. If set to 1.0 all outputs would be used as currently, if set to 1 then a single output would be used as described above. This seems like a relevant and natural hyper-parameter for the multi-output RFR.
Describe alternatives you've considered, if relevant
No response
Additional context
I have not been able to find literature that explores the above slight adjustment to the current algorithm. This is a RFC to see if the team would accept such a PR in principle without the quoted 200+ citation requirement if sufficient empirical evidence was provided.