对于LightGBM而言,具有较大的异常值是一个问题?
我手头有分类任务。我正在为此使用lightgbm
。 我有一个特定值,该值的直方图如下:
基本上,所有值都很好地集中在左侧,右侧有几个值。 LightGBM使用近似拆分,而不是确切的分裂。因此,它必须构建直方图并找到垃圾箱边缘。
有人碰巧知道吗?bin范围如何定义?我问,因为如果没有选择构建足够的垃圾箱,那么它将使整个变量无用,因为太多的东西会在下箱中混乱。最终,问题是:在列中拥有一些非常高值的后果是什么?
另外,
I have a classification task at hand. I'm using lightgbm
for that.
I have a particular value that has the histogram as below:
Basically, all values are nicely concentrated on the left, with a few values on the right.
Lightgbm uses approximate splits, rather than exact ones. It, therefore, has to build a histogram and find bin edges.
Does anyone happen to know, how exactly the bin ranges are defined? I'm asking because if one does not choose to build enough bins then it will render the whole variable useless since too much stuff will be cluttered in lower bins. Ultimately, the question is: what are the consequences of having a few very high values in a column?
Also, this seems to be a relevant piece of code, but I'm not good enough with C++ to read and understand it relatively quickly.
UPD: Just to clarify, it's a poor visualization here. The largest number is 5.02e03
, not 6265.02e03
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论