总的来说,我对机器学习和文本挖掘相当陌生。我注意到一个名为 Lib Linear 的 ruby 库的存在 https://github.com/tomz /liblinear-ruby-swig。
到目前为止,我想做的是训练软件来识别文本是否提到与自行车相关的任何内容。
有人可以强调我应该遵循的步骤(即:预处理文本和如何处理),共享资源,最好分享一个简单的示例来帮助我继续。
任何帮助都会做,谢谢!
I'm fairly new at machine learning and text mining in general. It has come to my attention the presence of a ruby library called Liblinear https://github.com/tomz/liblinear-ruby-swig.
What I want to do so far is train the software to identify whether a text mentions anything related to bicycles or not.
Can someone please highlight the steps that I should be following (i.e: preprocessing text and how), share resources and ideally share a simple example to get me going.
Any help will do, thanks!
发布评论
评论(1)
经典方法是:
现在,要对文档进行分类,请按照步骤 4 对其进行矢量化,并将其提供给分类器以获得相关/不相关的标签。将其与实际标签进行比较,看看是否正确。通过这个简单的方法,您应该能够获得至少 80% 的准确率。
要改进此方法,请将布尔值替换为术语计数,按文档长度标准化,或者更好的是 tf-idf 分数。
The classical approach is:
Now, to classify a document, vectorize it as in step 4. and feed it to the classifier to get a related/unrelated label for it. Compare this with the actual label to see if it went right. You should be able to get at least some 80% accuracy with this simple method.
To improve this method, replace the booleans with term counts, normalized by document length, or, even better, tf-idf scores.