pyjanitor-devs/pyjanitor
View on GitHub[ENH] expand_columns should support drop_first
Open
#368 opened on May 19, 2019
being worked onenhancementgood first issuegood intermediate issue
Description
Brief Description
In general, when you create dummy variables it is a good idea to drop one of the resultant columns as it is a linear combination of the other columns. See https://datascience.stackexchange.com/questions/27957/why-do-we-need-to-discard-one-dummy-variable
Pandas has the drop_first option in get_dummies
I would like to propose that drop_first be added as a parameter to expand_columns
Example API
Please modify the example API below to illustrate your proposed API, and then delete this sentence.
>>> X_cat2 = pd.DataFrame({'A': [1, None, 3],
... 'names': ['Fred,George', 'George', 'John,Paul']})
>>> jn.expand_column(X_cat2, 'names', sep=',')
A names Fred George John Paul
0 1.0 Fred,George 1 1 0 0
1 NaN George 0 1 0 0
2 3.0 John,Paul 0 0 1 1
to
>>> X_cat2 = pd.DataFrame({'A': [1, None, 3],
... 'names': ['Fred,George', 'George', 'John,Paul']})
>>> jn.expand_column(X_cat2, 'names', sep=',', drop_first=True)
A names Fred George John
0 1.0 Fred,George 1 1 0
1 NaN George 0 1 0
2 3.0 John,Paul 0 0 1