vertexai表格式automl拒绝包含空的行

发布于 2025-02-11 21:56:15 字数 1067 浏览 2 评论 0 原文

我正在尝试基于一个相当稀疏的表格数据集构建二进制分类器,但是培训失败了以下消息:

训练管道失败,错误消息:输入行传递的验证太少。在1169548输入中,有194个有效。至少50%的行必须通过验证。

我的理解是,表格automl应该能够处理空值,因此我不确定这里发生了什么,我很感谢任何建议。 文档> documentation 明确地提到我不审查每一列的无效性,但我不喜欢't请在“数据集”选项卡上设置或检查列的无用性(也许该文档已过时?)。此外,数字列出对缺失值的支持,只有NAN和INF。

数据集为100万行,34列,只有189行无效。我最稀疏的列有5,000行的数据,下一个最稀有的数据分别为72K和274K行。列是分类和数字的混合物,只有几个没有空的列。

数据存储为CSV,数据集导入似乎无问题运行。生成统计信息在数据集上运行,但由于某种原因,缺失%列未能填充。解决这个问题的最佳方法可能是什么?我不确定这是我需要更改CSV中的空表示,更改某些数据集/培训设置的情况,或者它是否是Automl错误(较小的可能性)。谢谢!

I am trying to build a binary classifier based on a tabular dataset that is rather sparse, but training is failing with the following message:

Training pipeline failed with error message: Too few input rows passed validation. Of 1169548 inputs, 194 were valid. At least 50% of rows must pass validation.

My understanding was that tabular AutoML should be able to handle Null values, so I'm not sure what's happening here, and I would appreciate any suggestions. The documentation explicitly mentions reviewing each column's nullability, but I don't see any way to set or check a column's nullability on the dataset tab (perhaps the documentation is out of date?). Additionally, the documentation explicitly mentions that missing values are treated as null, which is how I've set up my CSV. The documentation for numeric however does not explicitly list support for missing values, just NaN and inf.

The dataset is 1 million rows, 34 columns, and only 189 rows are null-free. My most sparse column has data in 5,000 unique rows, with the next rarest having data in 72k and 274k rows respectively. Columns are a mix of categorical and numeric, with only a handful of columns without nulls.

The data is stored as a CSV, and the Dataset import seems to run without issue. Generate statistics ran on the dataset, but for some reason the missing % column failed to populate. What might be the best way to address this? I'm not sure if this is a case where I need to change my null representation in the CSV, change some dataset/training setting, or if its an AutoML bug (less likely). Thanks!

Image of missing % column being blank

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

过潦 2025-02-18 21:56:15

允许无效&培训期间的无效值预测,我们必须明确设置允许无效的值 flag在 yes 期间,如下图所示。您可以在数据集页面上的模型培训设置下找到此设置。标志必须按列以列设置。

To allow invalid & null values during training & prediction, we have to explicitly set the allow invalid values flag to Yes during training as shown in the image below. You can find this setting under model training settings on the dataset page. The flag has to be set on a column by column basis.

enter image description here

说好的呢 2025-02-18 21:56:15

我尝试了 @kabilan mohanraj 的建议并解决了我的问题。我要做的是单击下拉列表,以使无效的值进入培训。进行此更改后,所有行都通过了验证,我的模型能够毫无问题地训练。我最初认为丢失值不会被视为无效的,这是不正确的。

I tried @Kabilan Mohanraj's suggestion and it resolved my issue. What I had to do was click the dropdown to allow invalid values into training. After making this change, all rows passed validation and my model was able to train without issue. I'd initially assumed that missing values would not count as invalid, which was incorrect.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文