快速 n 元语法计算

发布于 2024-12-07 03:06:29 字数 159 浏览 4 评论 0原文

我正在使用 NLTK 在语料库中搜索 n-gram,但在某些情况下需要很长时间。我注意到计算 n 元语法在其他软件包中并不是一个不常见的功能(显然 Haystack 有一些功能)。这是否意味着如果我放弃 NLTK,可能会有一种更快的方法在我的语料库中查找 n 元语法?如果是这样,我可以用什么来加快速度?

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

長街聽風 2024-12-14 03:06:29

由于您没有指出是否需要单词级 n 元语法或字符级 n 元语法,因此我将假设前者,而不失一般性。

我还假设您从一个由字符串表示的标记列表开始。您可以轻松地做的是自己编写 n 元语法提取。

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

然后将 yield 替换为您想要对每个 n-gram 采取的实际操作(将其添加到 dict,将其存储在数据库中,等等)以摆脱发电机开销。

最后,如果确实不够快,请将上面的内容转换为 Cython 并编译。使用 defaultdict 而不是 yield 的示例:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count
美胚控场 2024-12-14 03:06:29

您可能会发现使用 zip 和 splat (*) 运算符 这里

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])
七禾 2024-12-14 03:06:29

对于字符级 n-gram,您可以使用以下函数

def ngrams(text, n):
    n-=1
    return [text[i-n:i+1] for i,char in enumerate(text)][n:] 

For character-level n-grams you could use the following function

def ngrams(text, n):
    n-=1
    return [text[i-n:i+1] for i,char in enumerate(text)][n:] 
忱杏 2024-12-14 03:06:29
def generate_ngrams(words, ngram=2):
  return [words[i:i+ngram] for i in range(len(words)-ngram+1)]



sentence = "I really like python, it's pretty awesome."
words = sentence.split()
words

['I', 'really', 'like', 'python,', "it's", 'pretty', 'awesome.']


res = generate_ngrams(words, ngram=2)
res

[['I', 'really'],
 ['really', 'like'],
 ['like', 'python,'],
 ['python,', "it's"],
 ["it's", 'pretty'],
 ['pretty', 'awesome.']]


res = generate_ngrams(words, ngram=3)
res

[['I', 'really', 'like'],
 ['really', 'like', 'python,'],
 ['like', 'python,', "it's"],
 ['python,', "it's", 'pretty'],
 ["it's", 'pretty', 'awesome.']]


res = generate_ngrams(words, ngram=4)
res

[['I', 'really', 'like', 'python,'],
 ['really', 'like', 'python,', "it's"],
 ['like', 'python,', "it's", 'pretty'],
 ['python,', "it's", 'pretty', 'awesome.']]
def generate_ngrams(words, ngram=2):
  return [words[i:i+ngram] for i in range(len(words)-ngram+1)]



sentence = "I really like python, it's pretty awesome."
words = sentence.split()
words

['I', 'really', 'like', 'python,', "it's", 'pretty', 'awesome.']


res = generate_ngrams(words, ngram=2)
res

[['I', 'really'],
 ['really', 'like'],
 ['like', 'python,'],
 ['python,', "it's"],
 ["it's", 'pretty'],
 ['pretty', 'awesome.']]


res = generate_ngrams(words, ngram=3)
res

[['I', 'really', 'like'],
 ['really', 'like', 'python,'],
 ['like', 'python,', "it's"],
 ['python,', "it's", 'pretty'],
 ["it's", 'pretty', 'awesome.']]


res = generate_ngrams(words, ngram=4)
res

[['I', 'really', 'like', 'python,'],
 ['really', 'like', 'python,', "it's"],
 ['like', 'python,', "it's", 'pretty'],
 ['python,', "it's", 'pretty', 'awesome.']]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文