pyjanitor-devs/pyjanitor

[ENH] expand_columns should support drop_first

Open

#368 opened on May 19, 2019

View on GitHub
 (5 comments) (3 reactions) (0 assignees)Python (1,217 stars) (164 forks)batch import
being worked onenhancementgood first issuegood intermediate issue

Description

Brief Description

In general, when you create dummy variables it is a good idea to drop one of the resultant columns as it is a linear combination of the other columns. See https://datascience.stackexchange.com/questions/27957/why-do-we-need-to-discard-one-dummy-variable

Pandas has the drop_first option in get_dummies

I would like to propose that drop_first be added as a parameter to expand_columns

Example API

Please modify the example API below to illustrate your proposed API, and then delete this sentence.

>>> X_cat2 = pd.DataFrame({'A': [1, None, 3],
...     'names': ['Fred,George', 'George', 'John,Paul']})
>>> jn.expand_column(X_cat2, 'names', sep=',')
     A        names  Fred  George  John  Paul
0  1.0  Fred,George     1       1     0     0
1  NaN       George     0       1     0     0
2  3.0    John,Paul     0       0     1     1

to

>>> X_cat2 = pd.DataFrame({'A': [1, None, 3],
...     'names': ['Fred,George', 'George', 'John,Paul']})
>>> jn.expand_column(X_cat2, 'names', sep=',', drop_first=True)
     A        names  Fred  George  John
0  1.0  Fred,George     1       1     0     
1  NaN       George     0       1     0     
2  3.0    John,Paul     0       0     1     

Contributor guide