创建数据集:从文本文档中提取特征(TF-IDF)
我必须从一些文本文件创建一个数据集,将它们写为特征向量。
像这样:
doc1: 1,0.45 6,0.001 94,0.1 ...
doc2: 3,0.5 98,0.2 ...
...
向量的每个位置代表一个单词,分数由 TF-IDF 之类的东西给出。
你知道一些库/工具/相关的东西吗? (java更好)
I've to create a dataset from some text files, writing them as vectors of features.
Something like this:
doc1: 1,0.45 6,0.001 94,0.1 ...
doc2: 3,0.5 98,0.2 ...
...
each position of the vector represent a word, and the score is given by something like TF-IDF.
Do you know some library/tool/whatever for this? (java is better)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
几天后,我找到了“完美的工具”:Word Vector Tool。
http://sourceforge.net/projects/wvtool/
After some days i found the "perfect tool" for this: Word Vector Tool.
http://sourceforge.net/projects/wvtool/
木槌。包括TF-IDF、POS、分类。
mallet. including TF-IDF, POS, classification.
当然有很多,例如 http://en.wikipedia.org/wiki/Lucene
但是
我建议您从头开始编写一个基本的 IR 系统。深入了解底层始终是一次很好的学习经历。
Sure there are many eg http://en.wikipedia.org/wiki/Lucene
However
I recommend that you write an basic IR system from scratch. Looking under the hood is always a great learning experience.