XGBoost Ranker不流畅

发布于 2025-01-25 02:26:58 字数 1769 浏览 4 评论 0原文

我有以下围绕棒球的玩具问题。

在这个棒球数据集中，我有很多有关比赛成绩的数据，在很多年中，我能够每次收集以下信息：

每支球队的最终排名参加了
整个比赛中该球队的击球平均值。

让我们在这次旅行社2005年世界杯，8支球队参加，我可以生成以下数据框架：

Tourny, Team, Rank, Batting Avg,
2005-WC, Jays, 1, 0.45,
2005-WC, Cards, 2, 0.25,
2005-WC, Ravens, 3, 0.85,
2005-WC, Crows, 4, 0.23,
2005-WC, Jays, 5, 0.11,
...

然后，如果我有不同年份的多个巡回赛，那么我就可以扩展列表并获取大量数据。

然后，我可以问一个问题，击球平均值对于预测团队的最终排名有用吗？

这似乎是Xgbranker应该能够回答的问题。

然后，我们可以通过以下内容使用一些香草Xgbranker插入此数据：

model = xgb.XGBRanker( 
    max_depth = 10,
    learning_rate = 0.01,
    n_estimators = 100, 
    objective='rank:pairwise',
    booster = 'gbtree',
    gamma = 5,
    min_child_weight=1,
    subsample=0.1,
    colsample_bytree = 1,
    reg_alpha = 0.5,
    reg_lambda = 0.5,
    base_score = 0.5,
    seed = 42,
    )

X_train = df['Batting Avg']
Y_train = df['Rank]
groups = [number of entries per tourney]

model.fit(X_train.values, Y_train.values, group=groups)

经过适当的培训（非常快）我们进行以下情节，以查看排名如何随球队的击球平均值而变化。

x = np.arange(0, 1, 0.001) 
y = model.predict(x)
plot(x,y)

作为一般语句，我们的归一化X值（这就是为什么图不是0-1）表明，大致，更好的击球平均值，更好的等级！正是我们期望的：

但是，如果您开始放大，则开始发现极其不希望的特征。也就是说，击球平均值的很小变化可以极大地改变您的排名，尽管我们并不希望完全单调的结果，但鉴于只有1个参数，我们确实期望会变得更加平滑。

有人可以帮我理解这一点吗？从表面上看，绝对没有错，但这些不是理想的特征。我不是要求单一性的，但是让图更顺畅的是更加直观。

我需要进行哪种参数才能使其变得更好？

对于参考，我正在使用Python。

原文

I have the following toy problem centered around Baseball.

In this baseball dataset, I have a lot of data about tournament results over many many years where I am able to collect the following information per result:

Final ranking of each Team that participated
The Batting average of that team over the entire tournament.

Let us say in this tourny 2005 World Cup, 8 teams participated, I can generate the following dataframe:

Tourny, Team, Rank, Batting Avg,
2005-WC, Jays, 1, 0.45,
2005-WC, Cards, 2, 0.25,
2005-WC, Ravens, 3, 0.85,
2005-WC, Crows, 4, 0.23,
2005-WC, Jays, 5, 0.11,
...

Then if I have multiple Tournys from different years, I can then extend my list and get lots of data.

I can then ask the question, is Batting average useful in predicting the final rank of my Team?

This seems like the question that XGBRanker should be able to answer.

We can then plug in this data with some vanilla XGBRanker via the following:

model = xgb.XGBRanker( 
    max_depth = 10,
    learning_rate = 0.01,
    n_estimators = 100, 
    objective='rank:pairwise',
    booster = 'gbtree',
    gamma = 5,
    min_child_weight=1,
    subsample=0.1,
    colsample_bytree = 1,
    reg_alpha = 0.5,
    reg_lambda = 0.5,
    base_score = 0.5,
    seed = 42,
    )

X_train = df['Batting Avg']
Y_train = df['Rank]
groups = [number of entries per tourney]

model.fit(X_train.values, Y_train.values, group=groups)

After the appropriate training (very fast)
We do the following plot to view how Rank changes with the team's batting average.

x = np.arange(0, 1, 0.001) 
y = model.predict(x)
plot(x,y)

As a general statement, our normalized X values, (which is why the graph isn't 0-1) shows that roughly, better batting average, better rank! Exactly what we expect:

However, if you start zooming in, you start finding extremely undesirable traits.
Namely, a very small change in batting averages can DRASTICALLY change your rank, and while we do not expect a fully monotonic result, we do expect to be way more SMOOTH, given there is only 1 parameter being used.

Can someone help me understand this? While on the surface, there is nothing ABSOLUTELY wrong, these are not desirable traits to have. I am not asking for monoticity, but having the graph be smoother is far more intuitive.

What kind of parameters do I need to tune to make it work better?

for references, I am using python.

分享到QQ

分享到微博