如何自动标记所需的内容、算法和建议
我正在使用一些非常大的报纸文章数据库,我将它们保存在 MySQL 数据库中,并且我可以查询它们。
我现在正在寻找方法来帮助我用一些描述性标签来标记这些文章。
所有这些文章都可以通过如下 URL 访问:
http://web.site/CATEGORY/this-is-the-title-slug
因此,至少我可以使用类别来确定我们正在处理的内容类型。但是,我也想根据文章文本进行标记。
我最初的方法是这样做的:
- 获取所有文章
- 获取所有单词,删除所有标点符号,按空格分割,并按出现次数对它们进行计数
- 分析它们,并过滤掉常见的非描述性单词,例如“他们”,“我”,“这个” 、“这些”、“他们的”等。
- 当所有常见的单词都被过滤掉后,唯一剩下的就是值得标记的单词。
但事实证明这是一项相当手动的任务,而不是一个非常漂亮或有用的方法。
这也遇到了单词或名称被空格分割的问题,例如,如果 1.000 篇文章包含名称“John Doe”,而 1.000 篇文章包含名称“John Hanson”,我只会得到单词“John”其中,不是他的名字,而是姓氏。
I am working with some really large databases of newspaper articles, I have them in a MySQL database, and I can query them all.
I am now searching for ways to help me tag these articles with somewhat descriptive tags.
All these articles is accessible from a URL that looks like this:
http://web.site/CATEGORY/this-is-the-title-slug
So at least I can use the category to figure what type of content that we are working with. However, I also want to tag based on the article-text.
My initial approach was doing this:
- Get all articles
- Get all words, remove all punctuation, split by space, and count them by occurrence
- Analyze them, and filter common non-descriptive words out like "them", "I", "this", "these", "their" etc.
- When all the common words was filtered out, the only thing left is words that is tag-worthy.
But this turned out to be a rather manual task, and not a very pretty or helpful approach.
This also suffered from the problem of words or names that are split by space, for example if 1.000 articles contains the name "John Doe", and 1.000 articles contains the name of "John Hanson", I would only get the word "John" out of it, not his first name, and last name.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
自动标记文章确实是一个研究问题,当其他人已经完成了大部分工作时,您可能会花费大量时间重新发明轮子。我建议使用现有的自然语言处理工具包之一,例如 NLTK。
首先,我建议考虑实现一个适当的分词器(比按空格分割好得多),然后看看分块和词干算法。
您可能还想计算 n-grams 的频率,即单词序列个别单词。这将解决“单词被空格分割”的问题。 NLTK 等工具包具有为此构建的功能。
最后,当您迭代改进算法时,您可能需要对数据库的随机子集进行训练,然后尝试算法如何标记剩余的文章集以查看其效果如何。
Automatically tagging articles is really a research problem and you can spend a lot of time re-inventing the wheel when others have already done much of the work. I'd advise using one of the existing natural language processing toolkits like NLTK.
To get started, I would suggest looking at implementing a proper Tokeniser (much better than splitting by whitespace), and then take a look at Chunking and Stemming algorithms.
You might also want to count frequencies for n-grams, i.e. a sequences of words, instead of individual words. This would take care of "words split by a space". Toolkits like NLTK have functions in-built for this.
Finally, as you iteratively improve your algorithm, you might want to train on a random subset of the database and then try how the algorithm tags the remaining set of articles to see how well it works.
您应该使用诸如 tf-idf 之类的指标来取出标签:
tf-idf 有多种实现方式可供选择;对于 Java 和 .NET,有 Lucene,对于 Python,有 scikits.learn。
如果您想做得更好,请使用 语言模型。这需要一些概率论知识。
You should use a metric such as tf-idf to get the tags out:
Various implementations of tf-idf are available; for Java and .NET, there's Lucene, for Python there's scikits.learn.
If you want to do better than this, use language models. That requires some knowledge of probability theory.
看看 Kea。它是一个用于从文本文档中提取关键短语的开源工具。
您的问题也已在 http://metaoptimize.com/qa 上讨论过多次:
Take a look at Kea. It's an open source tool for extracting keyphrases from text documents.
Your problem has also been discussed many times at http://metaoptimize.com/qa:
如果我正确理解您的问题,您希望将文章分为相似类别。例如,您可以将文章 1 分配给“体育”,将文章 2 分配给“政治”,依此类推。或者,如果您的课程粒度更细,则相同的文章可能会分配给“达拉斯小牛队”和“共和党总统竞选”。
这属于“聚类”算法的一般类别。此类算法有很多可能的选择,但这是一个活跃的研究领域(意味着它不是一个已解决的问题,因此没有一个算法的性能可能达到您想要的程度)。
我建议您查看 Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) 或“LDA”。我对任何可用的 LDA 实现都没有个人经验,因此我无法推荐特定的系统(也许其他人比我更有知识,可以推荐用户友好的实现)。
您还可以考虑 LingPipe 中提供的凝聚集群实现(请参阅 http: //alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html),尽管我怀疑 LDA 实现可能更可靠。
在考虑集群系统时需要考虑几个问题:
您是否希望允许部分班级成员资格 - 例如,考虑一篇讨论经济前景及其对总统竞选的潜在影响的文章;该文件是否可以部分属于“经济”集群,部分属于“选举”集群?有些聚类算法允许部分类分配,有些则不允许。
您想手动创建一组类(即列出“经济”、“体育”等),还是更喜欢学习该组数据中的类别?手动类别标签可能需要更多监督(手动干预),但如果您选择从数据中学习,“标签”可能对人类没有意义(例如,类别 1、类别 2 等),甚至课程内容可能不会提供太多信息。也就是说,学习算法将找到相似之处并对它认为相似的文档进行聚类,但生成的聚类可能与您对“好”类应包含的内容的想法不符。
If I understand your question correctly, you'd like to group the articles into similarity classes. For example, you might assign article 1 to 'Sports', article 2 to 'Politics', and so on. Or if your classes are much finer-grained, the same articles might be assigned to 'Dallas Mavericks' and 'GOP Presidential Race'.
This falls under the general category of 'clustering' algorithms. There are many possible choices of such algorithms, but this is an active area of research (meaning it is not a solved problem, and thus none of the algorithms are likely to perform quite as well as you'd like).
I'd recommend you look at Latent Direchlet Allocation (http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or 'LDA'. I don't have personal experience with any of the LDA implementations available, so I can't recommend a specific system (perhaps others more knowledgeable than I might be able to recommend a user-friendly implementation).
You might also consider the agglomerative clustering implementations available in LingPipe (see http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html), although I suspect an LDA implementation might prove somewhat more reliable.
A couple questions to consider while you're looking at clustering systems:
Do you want to allow fractional class membership - e.g. consider an article discussing the economic outlook and its potential effect on the presidential race; can that document belong partly to the 'economy' cluster and partly to the 'election' cluster? Some clustering algorithms allow partial class assignment and some do not
Do you want to create a set of classes manually (i.e., list out 'economy', 'sports', ...), or do you prefer to learn the set of classes from the data? Manual class labels may require more supervision (manual intervention), but if you choose to learn from the data, the 'labels' will likely not be meaningful to a human (e.g., class 1, class 2, etc.), and even the contents of the classes may not be terribly informative. That is, the learning algorithm will find similarities and cluster documents it considers similar, but the resulting clusters may not match your idea of what a 'good' class should contain.
您的方法似乎很明智,有两种方法可以改进标记。
Your approach seems sensible and there are two ways you can improve the tagging.
如果内容是图像或视频,请查看以下博客文章:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
基本上有两种方法可以从图像和视频中自动提取关键字。
在上面的博客文章中,我列出了最新的研究论文来说明解决方案。其中一些甚至包括演示站点和源代码。
如果内容是大型文本文档,请查看这篇博客文章:
市场上最佳关键短语提取 API
http://scottge.net/ 2015/06/13/best-key-phrase-extraction-apis-in-the-market/
谢谢,斯科特
If the content is an image or video, please check out the following blog article:
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
If the content is a large text document, please check out this blog article:
Best Key Phrase Extraction APIs in the Market
http://scottge.net/2015/06/13/best-key-phrase-extraction-apis-in-the-market/
Thanks, Scott
假设您有预定义的标签集,您可以使用 Elasticsearch Percolator API,如下答案所示:
Elasticsearch - 使用“标签”索引来发现给定字符串中的所有标签
Assuming you have pre-defined set of tags, you can use the Elasticsearch Percolator API like this answer suggests:
Elasticsearch - use a "tags" index to discover all tags in a given string
您是在谈论名称实体识别吗?如果是这样,阿努潘·贾恩是对的。这是使用深度学习的研究问题病例报告表。 2017年,名称实体识别问题成为半惊喜学习技术的主力。
以下链接是相关论文:
http://ai2-website.s3.amazonaws.com/ Publications/semi-supervised-sequence.pdf
另外,以下链接是 Twitter 上的关键相提取:
http://jkx.fudan.edu.cn/~qzhang/论文/keyphrase.emnlp2016.pdf
Are you talking about the name-entity recognition ? if so, Anupam Jain is right. it;s research problem with using deep learning & CRF. In 2017, the name-entity recognition problem is force on semi-surprise learning technology.
The below link is related ner of paper:
http://ai2-website.s3.amazonaws.com/publications/semi-supervised-sequence.pdf
Also, The below link is key-phase extraction on twitter:
http://jkx.fudan.edu.cn/~qzhang/paper/keyphrase.emnlp2016.pdf