py-why/dowhy

data subset refuter bug when dataframe has categorical columns

Open

#1372 opened on Nov 28, 2025

View on GitHub
 (6 comments) (0 reactions) (0 assignees)Python (6,453 stars) (883 forks)batch import
buggood first issue

Description

Describe the bug This following error occurs with the distance matching estimator and data subset refuter if the dataframe has categorical column. This is caused by concatenating reindexed dataframe with not reindexed dataframe at the line here. The dataframe self._observed_common_causes is encoded in the script that reindex the encoded dataframe. The bug only appears when data subset refuter is used because the original dataframe is sampled, so reindexing will cause index values mismatch. The distance matching estimator would still work when no sampling is applied, because the index values are the same for original and encoded dataframes.

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match). [Trace ID: 00-b9354c2840feea7fea571bd8e74bcf5e-52edfa00b0c57816-00] _RemoteTraceback: """ Traceback (most recent call last): File "/databricks/python/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker r = call_item() ^^^^^^^^^^^ File "/databricks/python/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 291, in call return self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/databricks/python/lib/python3.12/site-packages/joblib/parallel.py", line 598, in call return [func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/dowhy/causal_refuters/data_subset_refuter.py", line 82, in _refute_once new_effect = new_estimator.estimate_effect( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.12/site-packages/dowhy/causal_estimators/distance_matching_estimator.py", line 178, in estimate_effect treated = updated_df.loc[data[self._target_estimand.treatment_variable[0]] == 1] ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 1191, in getitem return self._getitem_axis(maybe_callable, axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 1413, in _getitem_axis return self._getbool_axis(key, axis=axis) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 1209, in _getbool_axis key = check_bool_indexer(labels, key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/databricks/python/lib/python3.12/site-packages/pandas/core/indexing.py", line 2662, in check_bool_indexer raise IndexingError( pandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match). """ The index is reset by encoder if the input dataframe has categorical column.

Steps to reproduce the behavior use a dataframe with categorical column and do the data subset refute

Expected behavior Indices should align

Version information:

  • DoWhy version 0.14

Contributor guide