我正在尝试基于一个相当稀疏的表格数据集构建二进制分类器,但是培训失败了以下消息:
训练管道失败,错误消息:输入行传递的验证太少。在1169548输入中,有194个有效。至少50%的行必须通过验证。
我的理解是,表格automl应该能够处理空值,因此我不确定这里发生了什么,我很感谢任何建议。 文档> documentation 明确地提到我不审查每一列的无效性,但我不喜欢't请在“数据集”选项卡上设置或检查列的无用性(也许该文档已过时?)。此外,数字列出对缺失值的支持,只有NAN和INF。
数据集为100万行,34列,只有189行无效。我最稀疏的列有5,000行的数据,下一个最稀有的数据分别为72K和274K行。列是分类和数字的混合物,只有几个没有空的列。
数据存储为CSV,数据集导入似乎无问题运行。生成统计信息在数据集上运行,但由于某种原因,缺失%列未能填充。解决这个问题的最佳方法可能是什么?我不确定这是我需要更改CSV中的空表示,更改某些数据集/培训设置的情况,或者它是否是Automl错误(较小的可能性)。谢谢!
I am trying to build a binary classifier based on a tabular dataset that is rather sparse, but training is failing with the following message:
Training pipeline failed with error message: Too few input rows passed validation. Of 1169548 inputs, 194 were valid. At least 50% of rows must pass validation.
My understanding was that tabular AutoML should be able to handle Null values, so I'm not sure what's happening here, and I would appreciate any suggestions. The documentation explicitly mentions reviewing each column's nullability, but I don't see any way to set or check a column's nullability on the dataset tab (perhaps the documentation is out of date?). Additionally, the documentation explicitly mentions that missing values are treated as null, which is how I've set up my CSV. The documentation for numeric however does not explicitly list support for missing values, just NaN and inf.
The dataset is 1 million rows, 34 columns, and only 189 rows are null-free. My most sparse column has data in 5,000 unique rows, with the next rarest having data in 72k and 274k rows respectively. Columns are a mix of categorical and numeric, with only a handful of columns without nulls.
The data is stored as a CSV, and the Dataset import seems to run without issue. Generate statistics ran on the dataset, but for some reason the missing % column failed to populate. What might be the best way to address this? I'm not sure if this is a case where I need to change my null representation in the CSV, change some dataset/training setting, or if its an AutoML bug (less likely). Thanks!

发布评论
评论(2)
允许无效&培训期间的无效值预测,我们必须明确设置
允许无效的值
flag在yes
期间,如下图所示。您可以在数据集页面上的模型培训设置下找到此设置。标志必须按列以列设置。To allow invalid & null values during training & prediction, we have to explicitly set the
allow invalid values
flag toYes
during training as shown in the image below. You can find this setting under model training settings on the dataset page. The flag has to be set on a column by column basis.我尝试了 @kabilan mohanraj 的建议并解决了我的问题。我要做的是单击下拉列表,以使无效的值进入培训。进行此更改后,所有行都通过了验证,我的模型能够毫无问题地训练。我最初认为丢失值不会被视为无效的,这是不正确的。
I tried @Kabilan Mohanraj's suggestion and it resolved my issue. What I had to do was click the dropdown to allow invalid values into training. After making this change, all rows passed validation and my model was able to train without issue. I'd initially assumed that missing values would not count as invalid, which was incorrect.