我可以使用哪种算法来查找常见的相邻单词/模式识别?
我的数据库中有一个大表,其中包含按文本顺序排列的各种文本中的许多单词。我想找到某组单词一起出现的次数/频率。
示例:假设我在很多文本中都有这 4 个单词:United |州 |的|美国。
。我将得到结果:
美国:50
美国:45
美利坚合众国:40
(这只是4个词的例子,但可以有少于和多于4个的情况)。
有一些算法可以做到这一点或类似吗?
编辑:欢迎使用一些显示操作方法的 R 或 SQL 代码。我需要一个实际的例子来说明我需要做什么。
表结构
我有两个表:Token
,其中包含id
和text
。文本是UNIQUE
,并且该表中的每个条目代表一个不同的单词。
TextBlockHasToken
是保持文本顺序的表。每行代表文本中的一个单词。
它具有 textblockid
,这是令牌所属文本的块。 sentence
是标记的句子,position
是标记在句子中的位置,tokenid
是标记表引用。
I have a big table in my database with a lot of words from various texts in the text order. I want to find the number of times/frequency that some set of words appears together.
Example: Supposing I have this 4 words in many texts: United | States | of | America
. I will get as result:
United States: 50
United States of: 45
United States of America: 40
(This is only an example with 4 words, but can there are with less and more than 4).
There is some algorithm that can do this or similar to this?
Edit: Some R or SQL code showing how to do is welcome. I need a practical example of what I need to do.
Table Structure
I have two tables: Token
which haves id
and text
. The text is is UNIQUE
and each entrance in this table represents a different word.
TextBlockHasToken
is the table which keeps the text order. Each row represents a word in a text.
It haves textblockid
that is the block of the text the token belongs. sentence
that is the sentence of the token, position
that is the token position inside the sentence and tokenid
that is the token table reference.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
它被称为 N-gram;在你的情况下是4克。它确实可以作为马尔可夫链的副产品获得,但您也可以使用滑动窗口(大小 4)来遍历(线性)文本,同时更新 4 维“直方图”。
2011 年 11 月 22 日更新:
马尔可夫链是一种在给定当前状态的情况下对切换到新状态的概率进行建模的方法。这是“状态机”的随机等价物。在自然语言的情况下,“状态”是由“前 N 个单词”形成的,这意味着您将先验概率(前 N 个单词之前)视为 equal_to_one。在 NLP 案例中,计算机人员很可能会使用树来实现马尔可夫链。 “状态”只是从根到当前节点的路径,words_to_follow 的概率是当前节点的后代的概率。但是每次我们选择一个新的子节点时,我们实际上会在树中向下移动,并“忘记”根节点,窗口只有 N 个字宽,这意味着树的深度为 N 层。
你可以很容易地看到,如果你像这样走马尔可夫链/树,任何时候第一个单词之前的概率都是1,第一个单词之后的概率是P(w1),第二个单词之后的概率= P(w2) || w1 等。因此,在处理语料库时,您构建一个马尔可夫树(:= 更新节点中的频率),在旅程结束时,您可以通过 freq(word) / SUM 估计给定单词选择的概率(频率(兄弟姐妹))。对于树中 5 层深度的单词,这是在给定前 4 个单词的情况下该单词的概率。如果您需要 N-gram 概率,则需要从根到最后一个单词的路径中所有概率的乘积。
It is called an N-gram; in your case a 4-gram. It can indeed be obtained as the by-product of a Markov-chain, but you could also use a sliding window (size 4) to walk through the (linear) text while updating a 4-dimensionsal "histogram".
UPDATE 2011-11-22:
A markov chain is a way to model the probability of switching to a new state, given the current state. This is the stochastic equivalent of a "state machine". In the natural language case, the "state" is formed by the "previous N words", which implies that you consider the prior probability (before the previous N words) as equal_to_one. Computer people will most likely use a tree for implementing Markov chains in the NLP case. The "state" is simply the path taken from the root to the current node, and the probabilities of the words_to_follow are the probabilities of the current node's offspring. But every time that we choose a new child node, we actually shift down the tree, and "forget" the root node, out window is only N words wide, which translates to N levels deep into the tree.
You can easily see that if you are walking a Markov chain/tree like this, at any time the probability before the first word is 1, the probability after the first word is P(w1), after the second word = P(w2) || w1, etc. So, when processing the corpus you build a Markov tree ( := update the frequencies in the nodes), at the end of the ride you can estimate the probability of a given choice of word by freq(word) / SUM(freq(siblings)). For a word 5-deep into the tree this is the probability of the word, given the previous 4 words. If you want the N-gram probabilities, you want the product of all the probabilities in the path from the root to the last word.
这是马尔可夫链的典型用例。从文本库中估计马尔可夫模型,并在转换表中找到高概率。由于这些表示一个单词跟随另一个单词的概率,因此短语将显示为高转换概率。
通过计算短语开头词在文本中出现的次数,您还可以得出绝对数字。
This is a typical use case for Markov chains. Estimate the Markov model from your textbase and find high probabilites in the transition table. Since these indicate probabilities that one word will follow another, phrases will show up as high transition probabilites.
By counting the number of times the phrase-start word showed up in the texts, you can also derive absolute numbers.
这是一个小片段,用于计算给定单词集的文本的所有组合/ngram。为了适用于更大的数据集,它使用哈希库,尽管它可能仍然相当慢......
因此以下输入......
将产生以下哈希:
请注意,此函数不区分大小写并注册任何目标词的排列,例如:
...结果:
Here is a small snippet that calculates all combinations/ngrams of a text for a given set of words. In order to work for larger datasets it uses the hash library, though it is probably still pretty slow...
So the following input ...
... would result in the following hash:
Notice that this function is case insensitive and registers any permutation of the target words, e.g:
...results in:
我不确定它是否对你有帮助,但这是我大约一年前写的一个小 python 程序,它计算 N 元语法(好吧,只有单元、双元和三元语法)。 (它还计算每个 N-gram 的熵)。我用它来计算大文本中的 N 元语法。
链接
I'm not sure if its of a help to you, but here is a little python program I wrote about a year ago that counts N-grams (well, only mono-, bi-, and trigrams). (It also calculates the entropy of each N-gram). I used it to count those N-grams in a large text.
Link