XGBoost Ranker不流畅

发布于 2025-01-25 02:26:58 字数 1769 浏览 4 评论 0原文

我有以下围绕棒球的玩具问题。

在这个棒球数据集中,我有很多有关比赛成绩的数据,在很多年中,我能够每次收集以下信息:

  1. 每支球队的最终排名参加了
  2. 整个比赛中该球队的击球平均值。

让我们在这次旅行社2005年世界杯,8支球队参加,我可以生成以下数据框架:

Tourny, Team, Rank, Batting Avg,
2005-WC, Jays, 1, 0.45,
2005-WC, Cards, 2, 0.25,
2005-WC, Ravens, 3, 0.85,
2005-WC, Crows, 4, 0.23,
2005-WC, Jays, 5, 0.11,
...

然后,如果我有不同年份的多个巡回赛,那么我就可以扩展列表并获取大量数据。

然后,我可以问一个问题,击球平均值对于预测团队的最终排名有用吗?

这似乎是Xgbranker应该能够回答的问题。

然后,我们可以通过以下内容使用一些香草Xgbranker插入此数据:

model = xgb.XGBRanker( 
    max_depth = 10,
    learning_rate = 0.01,
    n_estimators = 100, 
    objective='rank:pairwise',
    booster = 'gbtree',
    gamma = 5,
    min_child_weight=1,
    subsample=0.1,
    colsample_bytree = 1,
    reg_alpha = 0.5,
    reg_lambda = 0.5,
    base_score = 0.5,
    seed = 42,
    )

X_train = df['Batting Avg']
Y_train = df['Rank]
groups = [number of entries per tourney]

model.fit(X_train.values, Y_train.values, group=groups)

经过适当的培训(非常快) 我们进行以下情节,以查看排名如何随球队的击球平均值而变化。

x = np.arange(0, 1, 0.001) 
y = model.predict(x)
plot(x,y)

作为一般语句,我们的归一化X值(这就是为什么图不是0-1)表明,大致,更好的击球平均值,更好的等级!正是我们期望的:

“在此处输入图像说明”

但是,如果您开始放大,则开始发现极其不希望的特征。 也就是说,击球平均值的很小变化可以极大地改变您的排名,尽管我们并不希望完全单调的结果,但鉴于只有1个参数,我们确实期望会变得更加平滑。

有人可以帮我理解这一点吗?从表面上看,绝对没有错,但这些不是理想的特征。我不是要求单一性的,但是让图更顺畅的是更加直观。

我需要进行哪种参数才能使其变得更好?

对于参考,我正在使用Python。

I have the following toy problem centered around Baseball.

In this baseball dataset, I have a lot of data about tournament results over many many years where I am able to collect the following information per result:

  1. Final ranking of each Team that participated
  2. The Batting average of that team over the entire tournament.

Let us say in this tourny 2005 World Cup, 8 teams participated, I can generate the following dataframe:

Tourny, Team, Rank, Batting Avg,
2005-WC, Jays, 1, 0.45,
2005-WC, Cards, 2, 0.25,
2005-WC, Ravens, 3, 0.85,
2005-WC, Crows, 4, 0.23,
2005-WC, Jays, 5, 0.11,
...

Then if I have multiple Tournys from different years, I can then extend my list and get lots of data.

I can then ask the question, is Batting average useful in predicting the final rank of my Team?

This seems like the question that XGBRanker should be able to answer.

We can then plug in this data with some vanilla XGBRanker via the following:

model = xgb.XGBRanker( 
    max_depth = 10,
    learning_rate = 0.01,
    n_estimators = 100, 
    objective='rank:pairwise',
    booster = 'gbtree',
    gamma = 5,
    min_child_weight=1,
    subsample=0.1,
    colsample_bytree = 1,
    reg_alpha = 0.5,
    reg_lambda = 0.5,
    base_score = 0.5,
    seed = 42,
    )

X_train = df['Batting Avg']
Y_train = df['Rank]
groups = [number of entries per tourney]

model.fit(X_train.values, Y_train.values, group=groups)

After the appropriate training (very fast)
We do the following plot to view how Rank changes with the team's batting average.

x = np.arange(0, 1, 0.001) 
y = model.predict(x)
plot(x,y)

As a general statement, our normalized X values, (which is why the graph isn't 0-1) shows that roughly, better batting average, better rank! Exactly what we expect:

enter image description here

However, if you start zooming in, you start finding extremely undesirable traits.
Namely, a very small change in batting averages can DRASTICALLY change your rank, and while we do not expect a fully monotonic result, we do expect to be way more SMOOTH, given there is only 1 parameter being used.

enter image description here

Can someone help me understand this? While on the surface, there is nothing ABSOLUTELY wrong, these are not desirable traits to have. I am not asking for monoticity, but having the graph be smoother is far more intuitive.

What kind of parameters do I need to tune to make it work better?

for references, I am using python.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文