机器学习数据结构输入最佳实践
我正在为自己的一些工作开发一个 C++ 机器学习库,并且我对表示输入数据的最佳实践感到好奇。现在,我正在考虑使用 DataManager 类来处理 I/O 操作,以便从文件、流等中读取数据。在开发这个类时,我意识到还需要创建类来管理特征标签(与输入数据关联)和类标签(在训练数据的情况下)。
因此,我的实现有一个类,它将文件(我正在使用 UCI 机器学习存储库)中的数据读取到 boost::variant 对象中。 DataManager 类重载了运算符>>这样我就可以从提供的行中读取每个逗号分隔的特征值;如果特征值为'?',则输入struct t_missing {}。
对于类/功能管理器,我认为维护功能/类名称的链接列表以及每个功能/类名称中的实例数量是合适的。
不管怎样,这只是我对这样一个课程的最初想法,我很想听到一些关于实施的其他想法/建议。不需要显示代码;我主要只是想听听我可能应该考虑的其他事情。
谢谢!
I am working to develop a C++ machine learning library for some of my own work, and I was curious about best practices for representing input data. Right now, I am thinking about using an DataManager class that handles the I/O operations for reading the data in from file, from a stream, etc. In developing this, I realized that it was also necessary to create classes to manage feature labels (to associate with the input data) and class labels (in the case of training data).
Therefore, my implementation has a class that reads data from a file (I'm using the UCI machine learning repository) into a boost::variant object. The DataManager class overloads operator>> so that I can read each comma-delimited feature value from the line supplied; it the feature value is '?', it inputs struct t_missing {}.
For the class/feature managers, I am thinking that maintaining a linked list of feature/class names and the number of instances falling within each would be appropriate.
Anyway, this was just my initial thought on such a class, and I would love to hear some other thoughts/suggestions on the implementation. Showing code is not necessary; I'm mostly just interested in hearing about other things I should perhaps consider.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一些学习问题涉及稀疏数据,即具有大量可能特征的数据,其中大多数特征为零。在这种情况下,仅存储非零特征会更有效。
SVM 库通常就是这种情况,例如 LibSVM,它存储向量作为 (feature_index, feature_value) 对的列表。例如,他们用于向量的格式:
将是:(索引从 1 开始)
Some learning problems are on sparse data, that is data with a large number of possible features most of which are zero. In this case it is much more efficient to only store features which are non-zero.
This is usually the case with SVM libraries, such as LibSVM, which stores vectors as a list of (feature_index, feature_value) pairs. E.g. The format they would use for a vector:
would be: (indexes start from 1)