N 元语法:解释 + 2 个应用程序
我想用 n-gram 实现一些应用程序(最好是在 PHP 中)。
哪种类型的 n 元语法更适合大多数用途? 单词级还是字符级 n-gram? 如何在 PHP 中实现 n-gram-tokenizer?
首先,我想知道 N-gram 到底是什么。 它是否正确? 这就是我理解 n 元语法的方式:
句子:“我住在纽约。”
单词级二元组(n 为 2):“# I'、“我住”、“住在”、“在 NY”、“NY #”
字符级二元组(n 为 2):“#I”、“I#” ”、“#l”、“li”、“iv”、“ve”、“e#”、“#i”、“in”、“n#”、“#N”、“NY”、“Y#”
当你有这个 n-gram-parts 数组时,你可以删除重复的部分,并为每个部分添加一个计数器,给出频率:
单词级二元组:[1, 1, 1, 1, 1]
字符级二元组:[2, 1, 1, ...]
这是正确的吗?
此外,我想了解更多有关 n-gram 的信息:
- How can I recognize the language of a text using n-grams
- Is it possible to do machine ?即使没有双语语料库,也可以使用 n-gram 进行翻译?
- 将 n-gram 与贝叶斯过滤器结合起来?
- 如何构建垃圾邮件过滤器(垃圾邮件、火腿邮件)?如何 关于篮球或狗的文本?我的方法(使用维基百科关于“狗”和“篮球”的文章执行以下操作):为两个文档构建 n 元语法向量,对它们进行标准化,计算曼哈顿/欧几里得距离,结果越接近到 1 相似度越高,
您对我的应用方法(尤其是最后一种)有何看法?
我希望你可以帮助我。 提前致谢!
I want to implement some applications with n-grams (preferably in PHP).
Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP?
First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams:
Sentence: "I live in NY."
word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'
character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"
When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:
word level bigrams: [1, 1, 1, 1, 1]
character level bigrams: [2, 1, 1, ...]
Is this correct?
Furthermore, I would like to learn more about what you can do with n-grams:
- How can I identify the language of a text using n-grams?
- Is it possible to do machine translation using n-grams even if you don't have a bilingual corpus?
- How can I build a spam filter (spam, ham)? Combine n-grams with a Bayesian filter?
- How can I do topic spotting? For example: Is a text about basketball or dogs? My approach (do the following with a Wikipedia article for "dogs" and "basketball"): build the n-gram vectors for both documents, normalize them, calculate Manhattan/Euclidian distance, the closer the result is to 1 the higher is the similarity
What do you think about my application approaches, especially the last one?
I hope you can help me. Thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
单词 n-gram 通常对于您提到的大多数文本分析应用程序更有用,但语言检测可能是例外,其中字符三元组之类的东西可能会给出更好的结果。 实际上,您可以为您感兴趣检测的每种语言的文本语料库创建 n 元语法向量,然后将每个语料库中的三元组的频率与您正在分类的文档中的三元组的频率进行比较。 例如,三元词
the
在英语中出现的频率可能比在德语中高得多,并且可以提供一定程度的统计相关性。 一旦您拥有 n-gram 格式的文档,您就可以选择多种算法进行进一步分析,贝叶斯滤波器、N 最近邻、支持向量机等。在您提到的应用程序中,机器翻译可能是最牵强的,因为单独的 n 元语法不会让你走得很远。 将输入文件转换为 n-gram 表示只是将数据转换为进一步特征分析格式的一种方法,但当您丢失大量上下文信息时,它可能对翻译没有用处。
需要注意的一件事是,为一个文档创建一个向量 [1,1,1,2,1] 并为另一个文档创建一个向量 [2,1,2,4] 是不够的,如果尺寸不匹配。 也就是说,向量中的第一个条目不能是一个文档中的
the
而另一个文档中的is
,否则算法将不起作用。 你最终会得到像 [0,0,0,0,1,1,0,0,2,0,0,1] 这样的向量,因为大多数文档不会包含你感兴趣的大多数 n-gram。功能的“补充”至关重要,它要求您“提前”决定在分析中包含哪些 ngram。 通常,这是作为两遍算法实现的,首先确定各种 n-gram 的统计显着性,以决定保留哪些内容。 谷歌“功能选择”了解更多信息。基于单词的 n-gram 加上支持向量机以一种出色的方式执行主题识别,但您需要预先分类为“主题”和“主题外”的大型文本语料库来训练分类器。 您会在 citeseerx 等网站上找到大量研究论文,解释了解决此问题的各种方法。 。 我不建议使用欧几里得距离方法来解决这个问题,因为它不会根据统计显着性对各个 n 元词进行加权,因此两个文档都包含
the
、a
、is
和of
会被认为比两个都包含Baysian
的文档更好的匹配。 从您感兴趣的 n 元语法中删除停用词会在一定程度上改善这一情况。Word n-grams will generally be more useful for most text analysis applications you mention with the possible exception of language detection, where something like character trigrams might give better results. Effectively, you would create n-gram vector for a corpus of text in each language you are interested in detecting and then compare the frequencies of trigrams in each corpus to the trigrams in the document you are classifying. For example, the trigram
the
probably appears much more frequently in English than in German and would provide some level of statistical correlation. Once you have your documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N- Nearest Neighbor, Support Vector Machines, etc..Of the applications you mention, machine translation is probably the most farfetched, as n-grams alone will not bring you very far down the path. Converting an input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.
One thing to watch out for, is that it isn't enough to create a vector [1,1,1,2,1] for one document and a vector [2,1,2,4] for another document, if the dimensions don't match. That is, the first entry in the vector can not be
the
in one document andis
in another or the algorithms won't work. You will wind up with vectors like [0,0,0,0,1,1,0,0,2,0,0,1] as most documents will not contain most n-grams you are interested in. This 'lining up' of features is essential, and it requires you to decide 'in advance' what ngrams you will be including in your analysis. Often, this is implemented as a two pass algorithm, to first decide the statistical significance of various n-grams to decide what to keep. Google 'feature selection' for more information.Word based n-grams plus Support Vector Machines in an excellent way to perform topic spotting, but you need a large corpus of text pre classified into 'on topic' and 'off topic' to train the classifier. You will find a large number of research papers explaining various approaches to this problem on a site like citeseerx. I would not recommend the euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that both include
the
,a
,is
, andof
would be considered a better match than two documents that both includedBaysian
. Removing stop-words from your n-grams of interest would improve this somewhat.您对 n 元语法的定义是正确的。
您可以将单词级 n-gram 用于搜索类型应用程序。 字符级 n-gram 更多地可用于文本本身的分析。 例如,为了识别文本的语言,我将使用字母的频率与该语言的既定频率进行比较。 也就是说,文本应大致匹配该语言中字母的出现频率。
PHP 中单词的 n-gram 分词器可以使用 strtok 完成:
https:// /www.php.net/manual/en/function.strtok.php
对于字符使用 split:
https://www.php.net/manual/en/function.str-split.php
然后你可以根据需要将数组分割为任意数量n-gram。
贝叶斯过滤器需要经过训练才能用作垃圾邮件过滤器,它可以与 n 元语法结合使用。 然而,你需要给它足够的输入才能让它学习。
就学习页面上下文而言,您的最后一种方法听起来不错……但这仍然相当困难,但 n-gram 听起来是一个很好的起点。
You are correct about the definition of n-grams.
You can use word level n-grams for search type applications. Character level n-grams can be used more for analysis of the text itself. For example, to identify the language of a text, I would use the frequencies of the letters as compared to the established frequencies of the language. That is, the text should roughly match the frequency of occurrence of letters in that language.
An n-gram tokenizer for words in PHP can be done using strtok:
https://www.php.net/manual/en/function.strtok.php
For characters use split:
https://www.php.net/manual/en/function.str-split.php
Then you can just split the array as you'd like to any number of n-grams.
Bayesian filters need to be trained for use as spam filters, which can be used in combination with n-grams. However you need to give it plenty of input in order for it to learn.
Your last approach sounds decent as far as learning the context of a page... this is still however fairly difficult to do, but n-grams sounds like a good starting point for doing so.