当预定义类别不可用时如何对文本进行分类
我有一个问题,不知道必须应用哪种算法。 我正在考虑在情况二中应用聚类,但不知道情况一:
我有 50 万份信用卡活动文档。每个文档都有明确的定义,每行包含 1 个事务。日期、金额、零售商名称以及 5-20 字的零售商简短描述。 样本: 2004-11-47,500美元,亚马逊,一家提供商品和服务的在线零售商,包括书籍、硬件、音乐等。 问题: 1. 如果没有预先定义的类别,如何对每个条目进行分类。 2. 如果给你预先定义的类别,例如“餐厅”、“娱乐”等,你会怎么做?
I have a problem and not getting idea which algorithm have to apply.
I am thinking to apply clustering in case two but no idea on case one:
I have .5 million credit card activity documents. Each document is well defined and contains 1 transaction per line. The date, the amount, the retailer name, and a short 5-20 word description of the retailer.
Sample:
2004-11-47,$500,Amazon,An online retailer providing goods and services including books, hardware, music, etc.
Questions:
1. How would classify each entry given no pre defined categories.
2. How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
1) 如果没有预先定义的类别,如何对每个条目进行分类。
你不会。相反,您可以对二维数据的特征使用某种降维算法,猜测“自然”簇的数量,然后运行聚类算法。
2)如果给你预先定义的类别,例如“餐厅”、“娱乐”等,你会如何做到这一点。
你需要手动标记其中一堆,然后训练一个分类器,看看它与通常的类别配合得如何或者您可以检查聚类算法是否能够很好地识别这些类别,但您仍然需要一些标记数据。
1) How would classify each entry given no pre defined categories.
You wouldn't. Instead, you'd use some dimensionality reduction algorithm on the data's features to them in 2-d, make a guess at the number of "natural" clusters, then run a clustering algorithm.
2) How would do this if you were given pre defined categories such as "restaurant", "entertainment", etc.
You'd manually label a bunch of them, then train a classifier on that and see how well it works with the usual machinery of accuracy/F1, cross validation, etc. Or you'd check whether a clustering algorithm picks up these categories well, but then you still need some labeled data.