当前位置：文江博客话题详情

我可以使用哪种算法来查找常见的相邻单词/模式识别？

发布于 2024-12-14 07:32:43 字数 725 浏览 8 评论 0原文

我的数据库中有一个大表，其中包含按文本顺序排列的各种文本中的许多单词。我想找到某组单词一起出现的次数/频率。

示例：假设我在很多文本中都有这 4 个单词：United |州 |的|美国。。我将得到结果：

美国：50
美国：45
美利坚合众国：40

（这只是4个词的例子，但可以有少于和多于4个的情况）。

有一些算法可以做到这一点或类似吗？

编辑：欢迎使用一些显示操作方法的 R 或 SQL 代码。我需要一个实际的例子来说明我需要做什么。

表结构

我有两个表：Token，其中包含id 和text。文本是UNIQUE，并且该表中的每个条目代表一个不同的单词。

TextBlockHasToken 是保持文本顺序的表。每行代表文本中的一个单词。

它具有 textblockid，这是令牌所属文本的块。 sentence 是标记的句子，position 是标记在句子中的位置，tokenid 是标记表引用。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

素罗衫 2024-12-21 07:32:43

它被称为 N-gram；在你的情况下是4克。它确实可以作为马尔可夫链的副产品获得，但您也可以使用滑动窗口（大小 4）来遍历（线性）文本，同时更新 4 维“直方图”。

2011 年 11 月 22 日更新：
马尔可夫链是一种在给定当前状态的情况下对切换到新状态的概率进行建模的方法。这是“状态机”的随机等价物。在自然语言的情况下，“状态”是由“前 N 个单词”形成的，这意味着您将先验概率（前 N 个单词之前）视为 equal_to_one。在 NLP 案例中，计算机人员很可能会使用树来实现马尔可夫链。 “状态”只是从根到当前节点的路径，words_to_follow 的概率是当前节点的后代的概率。但是每次我们选择一个新的子节点时，我们实际上会在树中向下移动，并“忘记”根节点，窗口只有 N 个字宽，这意味着树的深度为 N 层。

你可以很容易地看到，如果你像这样走马尔可夫链/树，任何时候第一个单词之前的概率都是1，第一个单词之后的概率是P(w1)，第二个单词之后的概率= P(w2) || w1 等。因此，在处理语料库时，您构建一个马尔可夫树（:= 更新节点中的频率），在旅程结束时，您可以通过 freq(word) / SUM 估计给定单词选择的概率（频率（兄弟姐妹））。对于树中 5 层深度的单词，这是在给定前 4 个单词的情况下该单词的概率。如果您需要 N-gram 概率，则需要从根到最后一个单词的路径中所有概率的乘积。

回复收藏 0 原文

给妤﹃绝世温柔 2024-12-21 07:32:43

这是马尔可夫链的典型用例。从文本库中估计马尔可夫模型，并在转换表中找到高概率。由于这些表示一个单词跟随另一个单词的概率，因此短语将显示为高转换概率。

通过计算短语开头词在文本中出现的次数，您还可以得出绝对数字。

回复收藏 0 原文

幼儿园老大 2024-12-21 07:32:43

这是一个小片段，用于计算给定单词集的文本的所有组合/ngram。为了适用于更大的数据集，它使用哈希库，尽管它可能仍然相当慢......

require(hash)

get.ngrams <- function(text, target.words) {
  text <- tolower(text)
  split.text <- strsplit(text, "\\W+")[[1]]
  ngrams <- hash()
  current.ngram <- ""
  for(i in seq_along(split.text)) {
    word <- split.text[i]
    word_i <- i
    while(word %in% target.words) {
      if(current.ngram == "") {
        current.ngram <- word
      } else {
        current.ngram <- paste(current.ngram, word)
      }
      if(has.key(current.ngram, ngrams)) {
        ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1
      } else{
        ngrams[[current.ngram]] <- 1
      }
      word_i <- word_i + 1
      word <- split.text[word_i]
    }
    current.ngram <- ""
  }
  ngrams
}

因此以下输入......

some.text <- "He states that he loves the United States of America,
 and I agree it is nice in the United States."
some.target.words <- c("united", "states", "of", "america")

usa.ngrams <- get.ngrams(some.text, some.target.words)

将产生以下哈希：

>usa.ngrams
<hash> containing 10 key-value pair(s).
  america : 1
  of : 1
  of america : 1
  states : 3
  states of : 1
  states of america : 1
  united : 2
  united states : 2
  united states of : 1
  united states of america : 1

请注意，此函数不区分大小写并注册任何目标词的排列，例如：

some.text <- "States of united America are states"
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)

...结果：

>usa.ngrams
<hash> containing 10 key-value pair(s).
  america : 1
  of : 1
  of united : 1
  of united america : 1
  states : 2
  states of : 1
  states of united : 1
  states of united america : 1
  united : 1
  united america : 1

Here is a small snippet that calculates all combinations/ngrams of a text for a given set of words. In order to work for larger datasets it uses the hash library, though it is probably still pretty slow...

require(hash)

get.ngrams <- function(text, target.words) {
  text <- tolower(text)
  split.text <- strsplit(text, "\\W+")[[1]]
  ngrams <- hash()
  current.ngram <- ""
  for(i in seq_along(split.text)) {
    word <- split.text[i]
    word_i <- i
    while(word %in% target.words) {
      if(current.ngram == "") {
        current.ngram <- word
      } else {
        current.ngram <- paste(current.ngram, word)
      }
      if(has.key(current.ngram, ngrams)) {
        ngrams[[current.ngram]] <- ngrams[[current.ngram]] + 1
      } else{
        ngrams[[current.ngram]] <- 1
      }
      word_i <- word_i + 1
      word <- split.text[word_i]
    }
    current.ngram <- ""
  }
  ngrams
}

So the following input ...

some.text <- "He states that he loves the United States of America,
 and I agree it is nice in the United States."
some.target.words <- c("united", "states", "of", "america")

usa.ngrams <- get.ngrams(some.text, some.target.words)

... would result in the following hash:

>usa.ngrams
<hash> containing 10 key-value pair(s).
  america : 1
  of : 1
  of america : 1
  states : 3
  states of : 1
  states of america : 1
  united : 2
  united states : 2
  united states of : 1
  united states of america : 1

Notice that this function is case insensitive and registers any permutation of the target words, e.g:

some.text <- "States of united America are states"
some.target.words <- c("united", "states", "of", "america")
usa.ngrams <- get.ngrams(some.text, some.target.words)

...results in:

>usa.ngrams
<hash> containing 10 key-value pair(s).
  america : 1
  of : 1
  of united : 1
  of united america : 1
  states : 2
  states of : 1
  states of united : 1
  states of united america : 1
  united : 1
  united america : 1

回复收藏 0 原文