[ASK] the encode of the feature in field · recommenders-team/recommenders#1114

Repository metrics

Stars: (17,706 stars)
PR merge metrics: (Avg merge 6d 16h) (10 merged PRs in 30d)

Description

hi,dear

Description

I'm a little confused that the Class for encoding the feature in the field, I see the example,

df_feature_original = pd.DataFrame({
    'rating': [1, 0, 0, 1, 1],
    'field1': ['xxx1', 'xxx2', 'xxx4', 'xxx4', 'xxx4'],
    'field2': [3, 4, 5, 6, 7],
    'field3': [1.0, 2.0, 3.0, 4.0, 5.0],
    'field4': ['1', '2', '3', '4', '5']
})
converter = LibffmConverter().fit(df_feature_original, col_rating='rating')
df_out = converter.transform(df_feature_original)
df_out

 | rating | field1 | field2 | field3 | field4

0 | 1 | 1:1:1 | 2:4:3 | 3:5:1.0 | 4:6:1
1 | 0 | 1:2:1 | 2:4:4 | 3:5:2.0 | 4:7:1
2 | 0 | 1:3:1 | 2:4:5 | 3:5:3.0 | 4:8:1
3 | 1 | 1:3:1 | 2:4:6 | 3:5:4.0 | 4:9:1
4 | 1 | 1:3:1 | 2:4:7 | 3:5:5.0 | 4:10:1

I found the number of the feature is increasing in a field,then encode the features in another field , but that's not same with the FFM author down

Click  Advertiser  Publisher
=====  ==========  =========
    0        Nike        CNN
    1        ESPN        BBC
Then, you can generate FFM format data:
    0 0:0:1 1:1:1
    1 0:2:1 1:3:1

he encodes the features in a example and then another example , so the method in the rp is different, does the difference affect the results ??

Other Comments

maybe my poor English could not be understood,Now is the Chinese Time down 这里的rp编码规则：先对一个field内的 feature进行编码，然后再对另一个field内的feature进行编码而libffm的编码：对一条数据的所有fields内的features进行编码，然后下一条数据，这两种feature编码规则会影响最终的结果吗？

多谢大佬

waiting for your kind reply ! thx

Contributor guide

Research direction: Compare the encoding methods: the repository encodes features field by field, while the original libffm encodes example by example. Investigate if this difference affects the model's ability to learn field interactions. Check the LibffmConverter source code and the libffm documentation to understand the intended format. Run a small experiment to compare results.
Tech stack: python
Domain: machine learning
Issue type: Research
Difficulty: 2
Estimated time: Under 1 hour
Activity status: Active
Clarity: Mostly clear
Prerequisites: PythonMachine Learning basics
Newbie friendliness: 70

Repository metrics

Description

Description

Other Comments

Contributor guide

Get fresh easy issues in your inbox.