XGBoostJsonParser not working well with 'binary features' · o19s/elasticsearch-learning-to-rank#152

Repository metrics

Stars: (1,456 stars)
PR merge metrics: (Avg merge 42d 4h) (2 merged PRs in 30d)

Description

The current setup of the plugin requires a feature map to be used for creating serialized xgboost json model file (for an example of feature map see this).

In the feature map, each feature can be assigned 3 possible data types: q (quantitate), i (binary) and int (integer).

When the data type is int or q, each split node will be serialized to look like below:

      { "nodeid": 6, "depth": 2, "split": "f1", "split_condition": 5, "yes": 13, "no": 14, "missing": 14, "children": [
        { "nodeid": 13, "leaf": 0.000920585 },
        { "nodeid": 14, "leaf": -0.044742 }
      ]}

However, when data type is i, each split node would look like this after serialization:

 { "nodeid": 4, "depth": 2, "split": "f2", "yes": 9, "no": 10, "children": [
        { "nodeid": 9, "leaf": 0.138548 },
        { "nodeid": 10, "leaf": -0.0143873 }
      ]}

Basically, there will be no field for 'missing' and 'split_condition'.

The current XGBoostJsonParser though, explicitely checks for existence of split conditions and therefore throws exceptions when parsing binary nodes. (The code below is copied from here:)

boolean splitHasAllFields() {
            return nodeId != null && threshold != null && split != null && leftNodeId != null && rightNodeId != null && depth != null
                    && children != null && children.size() == 2;
  }

What I suggest for the fix:

In the short term, define all binary features into integer features and notify users of this limitation somewhere in the documentation.
In the long run, revise splitHasAllFields() to account for the data type of the split nodes, or just eliminate the check on split conditions and threshold, or provide default values for binary split nodes.

Contributor guide

Research direction: Modify the XGBoostJsonParser's splitHasAllFields() method to allow missing 'split condition' and 'missing' fields for binary features, and update the parsing logic to handle binary split nodes appropriately.
Tech stack: java
Domain: backend
Issue type: Bug
Difficulty: 3
Estimated time: Half day
Activity status: Active
Clarity: Clear
Prerequisites: JavaElasticsearch
Newbie friendliness: 50

Repository metrics

Description

Contributor guide

Get fresh easy issues in your inbox.