XGBoostJsonParser not working well with 'binary features' · o19s/elasticsearch-learning-to-rank#152

(4 留言) (0 反應) (0 負責人)Java (1,456 star) (366 fork)batch import

docshelp wanted

描述

The current setup of the plugin requires a feature map to be used for creating serialized xgboost json model file (for an example of feature map see this).

In the feature map, each feature can be assigned 3 possible data types: q (quantitate), i (binary) and int (integer).

When the data type is int or q, each split node will be serialized to look like below:

      { "nodeid": 6, "depth": 2, "split": "f1", "split_condition": 5, "yes": 13, "no": 14, "missing": 14, "children": [
        { "nodeid": 13, "leaf": 0.000920585 },
        { "nodeid": 14, "leaf": -0.044742 }
      ]}

However, when data type is i, each split node would look like this after serialization:

 { "nodeid": 4, "depth": 2, "split": "f2", "yes": 9, "no": 10, "children": [
        { "nodeid": 9, "leaf": 0.138548 },
        { "nodeid": 10, "leaf": -0.0143873 }
      ]}

Basically, there will be no field for 'missing' and 'split_condition'.

The current XGBoostJsonParser though, explicitely checks for existence of split conditions and therefore throws exceptions when parsing binary nodes. (The code below is copied from here:)

boolean splitHasAllFields() {
            return nodeId != null && threshold != null && split != null && leftNodeId != null && rightNodeId != null && depth != null
                    && children != null && children.size() == 2;
  }

What I suggest for the fix:

In the short term, define all binary features into integer features and notify users of this limitation somewhere in the documentation.
In the long run, revise splitHasAllFields() to account for the data type of the split nodes, or just eliminate the check on split conditions and threshold, or provide default values for binary split nodes.

貢獻者指南

技術棧: java
領域: backend
議題類型: bug
難度: 3
預計時間: 1-3 hours
活動狀態: stale
清晰度: clear
前置要求: JavaXGBoost model formatElasticsearchJSON parsing
新手友善度: 40
研究方向: The issue is in XGBoostJsonParser.java, specifically the splitHasAllFields() method. The fix could involve making the threshold and missing fields optional for binary features, or providing default values. Investigate the XGBoost JSON structure for binary features and modify the parser accordingly. Check existing tests and documentation for how feature maps are handled.