有人可以简单解释一下自然语言处理的要素吗?
我是自然语言处理的新手,对所使用的术语感到困惑。
什么是代币化? POS 标签?实体识别?
标记化只是将文本分割成可以有意义的部分或为这些部分赋予含义?以及含义,当我确定某物是名词、动词或形容词时,它的名字是什么。如果我想分为日期、姓名、货币呢?
我需要一个关于 NLP 中使用的领域/术语的简单解释。
I'm new to Natural Language Processing and I'm a confused about the terms used.
What is tokenization? POS tagging? Entity Identify?
Tokenization is only split the text in parts that can have a meaning or give a meaning for these parts? And the meaning, what is the name when I determine that something is a noun, verb or adjetive. And if I want to divide into dates, names, currency?
I need a simple explanation about the areas/terms used in NLP.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我们举个例子,比如
Tokenization,就是把这些句子变成我们所说的 token,基本上就是单词。这句话的标记是
my, cat's, name, is, pat, he, likes, to sat, on, the, mat
。 (有时您可能会将cat's
视为两个标记;这取决于个人喜好和意图,哈哈。)POS 代表词性,因此为这些句子标记词性就是通过一个名为 POS tagger 的程序运行它,该程序将标签句子中的每个标记代表其词性。在这种情况下,斯坦福大学的一个小组编写的标记器的输出是:(
这是一个将猫视为两个标记的很好的例子。)
实体识别通常称为命名实体识别。这是一个处理像我们这样的文本并识别主要是专有名词的内容的过程,但也可以包括日期或您教识别器识别的任何其他内容。对于我们的示例,命名实体识别系统将插入一个标签,例如
我们的猫的名字。如果有像
现在这样的另一个句子,识别器将标记三个实体(总共四个实体,因为
Pat
将被标记两次)。现在,所有这些工具的实际工作原理完全是另一回事了。 :)
Let's use an example like
Tokenization is to take these sentences into what we call tokens, which are basically the words. The tokens for this sentence are
my, cat's, name, is, pat, he, likes, to sit, on, the, mat
. (Sometimes you may seecat's
as two tokens; this depends on personal preference and intention lol.)POS stands for Part-Of-Speech, so to tag these sentences for parts-of-speech would be to run it through a program called a POS tagger, which will label each token in the sentence for its part-of-speech. The output from the tagger written by a group at Stanford in this case is:
(Here is a good example of
cat's
being treated as two tokens.)Entity Identify is more often called Named Entity Recognition. It is the process of taking a text like ours and identifying things that are mostly proper nouns but can also include dates or anything else that you teach the recognizer to, well, recognize. For our example a Named Entity Recognition system would insert a tag like
for our cat's name. If there was another sentence like
now the recognizer would label three entities (four total since
Pat
would be labeled twice).Now how all of these tools actually work is a whole other story. :)
补充一下 dmn 的解释:
一般来说,在 NLP 中您应该关心两个主题:
统计与基于规则的分析
轻量级与重量级分析
统计分析使用统计机器学习技术对文本进行分类,通常具有良好的精度和召回率。 基于规则的分析技术基本上使用手工构建的规则,并且具有非常好的精确度,但召回率很差(基本上它们识别规则中的情况,但没有其他)。
轻量级与重量级分析是您将在现场看到的两种方法。一般来说,学术工作是重量级的,以解析器、花哨的分类器和许多非常高科技的 NLP 内容为特色。在工业界,总的来说,重点是数据,许多学术内容的扩展性很差,超越标准统计或机器学习技术并不能给你带来太多。例如,解析在很大程度上是无用的(而且很慢),因此关键字和 ngram 分析实际上非常有用,尤其是当您有大量数据时。例如,谷歌翻译在幕后显然并不那么花哨——他们只是拥有如此多的数据,无论他们的翻译软件多么精致,他们都可以碾压其他人。
这样做的结果是,工业界有很多机器学习和数学,但使用的 NLP 东西不是很复杂,因为复杂的东西确实不能很好地工作。更受欢迎的是使用用户数据,例如相关主题和机械土耳其人的点击次数……这非常有效,因为人们比计算机更擅长理解自然语言。
解析就是将一个句子分解成短语,比如动词短语、名词短语、介词短语等,并得到一个语法树。您可以使用在线版本的斯坦福解析器来使用示例并感受一下解析器做什么。例如,假设我们有句子
then we do POS tagging:
使用 POS 标签和训练有素的统计解析器,我们得到一个解析树:
我们还可以进行稍微不同类型的解析,称为依赖解析:
< a href="https://en.wikipedia.org/wiki/N-gram" rel="nofollow noreferrer">N-Gram 基本上是长度为 n 的相邻单词的集合。您可以在此处查看 Google 数据中的 n-gram。您还可以执行字符 n 元语法,这在拼写纠正中被大量使用。
情感分析正在分析文本以提取人们对某事的感受或提及事物(例如品牌)的方式。这涉及大量查看表示情感的词语。
语义分析正在分析文本的含义。通常,这采用分类法和本体论的形式,将概念组合在一起(狗、猫属于动物和宠物),但这是一个非常不发达的领域。 WordNet 和 Framenet 等资源在这里很有用。
To add to dmn's explanation:
In general, there are two themes you should care about in NLP:
Statistical vs Rule-Based Analysis
Lightweight vs Heavyweight Analysis
Statistical Analysis uses statistics machine learning techniques to classify text and in general have good precision and good recall. Rule-Based Analysis techniques basically use hand-built rules and have very good precision but terrible recall (basically they identify the cases in your rules, but nothing else).
Lightweight vs Heavyweight Analysis are the two approaches you'll see in the field. In general, academic work is heavyweight, featuring parsers, fancy classifiers and lots of very high tech NLP stuff. In industry, by and large the focus is on data, and a lot of the academic stuff scales poorly and going beyond standard statistical or machine learning techniques doesn't bring you much. For example, parsing is largely useless (and slow) and as such keyword and ngram analysis is actually pretty useful, especially when you have a lot of data. For example, Google Translate isn't apparently that fancy behind the scenes- they just have so much data they can crush everybody else no matter how refined their translation software is.
The upshot of this is in industry there's a lot of machine learning and math, but the NLP stuff is used is not very sophisticated, because the sophisticated stuff really doesn't work well. Far preferred is using user data like clicks on related subjects and mechanical turk... and this works very well as people are far better at understanding natural language than computers.
Parsing is break a sentence down into phrases, say verb phrase, noun phrase, prepositional phrase, etc and get a grammatical tree. You can use the online version of the Stanford Parser to play with examples and get a feel for what a parser does. For example, Let's say we have the sentence
Then we do POS tagging:
Using the POS tags and a trained statistical parser, we get a parse tree:
We can also do a slightly different type of parse called a dependency parse:
N-Grams are basically sets of adjacent words of length n. You can look at n-grams in Google's data here. You can also do character n-grams which are used heavily for spelling correction.
Sentiment Analysis is analyzing text to extract how people feel about something or in what light things (such as brands) are mentioned. This involves a lot of looking at words that denote emotion.
Semantic Analysis is analyzing the meaning of text. Often this takes the form of taxonomies and ontologies where you group concepts together (dog,cat belong to animal and pet) but it is a very undeveloped field. Resources like WordNet and Framenet are useful here.
要回答问题的更具体部分:标记化是将文本分解为多个部分(通常是单词),而不是太关心它们的含义。词性标注是消除可能的词性(名词、动词等)之间的歧义,它发生在标记化之后。识别日期、名称等称为命名实体识别 (NER)。
To answer the more specific part of your question: tokenization is breaking the text into parts (usually words), not caring too much about their meaning. POS tagging is disambiguating between possible parts of speech (noun, verb, etc.), it takes place after tokenization. Recognizing dates, names etc. is named entity recognition (NER).