用于预处理的良好数据集
我正在选修数据挖掘本科课程,并且有一项编写数据挖掘预处理器的作业。我可以自由选择编程语言和数据集。我想知道是否有人可以建议一个好的数据集来使用。我一直在浏览 UCI 存储库,并且发现了更多此类资源。但作为初学者,我不确定哪个数据集是一个不错的选择。预处理器应该处理以下内容:
- 数据清理
- 缺失值
- 错误
- 异常值
- 标准化
- 重复数据删除
- 数据缩减
- 采样技术
- 降维
选择数据集时应该考虑哪些属性?您有什么建议的具体数据集吗?
I am enrolled in an under-graduate course in Data Mining and I've got an assignment to code a Data Mining Pre-processor. I have the liberty to choose the programming language and the data set. I was wondering if anybody could suggest a good data set to use. I have been going through the UCI Repository and I've found many more such resources. But being a beginner I am not sure which data set would be a good choice. The preprocessor should be dealing with the following stuff:
- Data cleaning
- Missing Values
- Errors
- Outliers
- Nomralization
- De-duplication
- Data Reduction
- Sampling Techniques
- Dimensionality Reduction
What kind of properties should I consider when choosing the data set? Any specific data set you would suggest?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你回答了你自己的问题。选择具有您提到的属性的数据集列表,因为 UCI 存储库已对数据集进行了分类。您可以选择任何人来开始玩它。
因此,首先,如果我是你,我会逐步进行,了解每个数据集的外观及其对分类器性能的影响,并选择一些流行数据集,因为它们被用作大多数研究论文中的基准数据集。您列出的许多问题都是单独的机器学习问题,并且正在进行大量研究。
我会从这样的事情开始:
缺失值:虹膜、投票、心脏病
对于重复:921,810 首歌曲数据集(我认为不是来自 UCI)
标准化:具有不同特征范围的任何连续值数据集
采样技术:皮马
降维:瑞士卷
此外,查找数据集的另一种最佳方法是参考一些相应的出版物。例如,对于降维,您可以查看PCA,ISOMAP等论文,对于采样,请参见SMOTE论文等,看看他们的实验使用什么类型的数据并进行相应的操作。
You answered your own question. Choose list of data-set with the properties that you have mentioned as UCI repository has categorized dataset. You can chose anyone to start playing with it.
So to start with, if I were you,I would proceed step wise, have a feel how each of those look like and its effect on classifier performance and choose some of the popular dataset as they are used as benchmark dataset in most of the research paper. Much of those that you have listed are separate machine learning problems with lots of researches being conducted.
I would start with something like this :
for missing values : Iris, Voting,Heart disease
for Duplicate:921,810 song dataset(not form UCI I think)
Normalization : Any continuous valued dataset with different range for features
Sampling technique : Pima
Dimensionality reduction : Swiss Roll
Further, another best approach to look for the data set would be to refer some of respective publications. Such as , for dimensionality reduction, you can look into papers of PCA, ISOMAP etc, for sampling see SMOTE paper etc and see what type of data do they use for their experiments and proceed accordingly.