当前位置：文江博客话题详情

Python 文本中重复的短语

发布于 2024-10-08 22:16:22 字数 83 浏览 3 评论 0原文

我有一个问题，我不知道如何解决它。请给一个建议。

我有一条文字。好大好大的文字。任务是找到文本中所有长度为3（包含三个单词）的重复短语。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心作怪 2024-10-15 22:16:22

在我看来，你有两个问题。

第一个是提出一种标准化输入的有效方法。你说你想找到输入中的所有三词短语，但是短语是由什么组成的呢？例如，the blackdog 和 The black,dog? 是同一个短语吗？

正如 marcog 所建议的，实现此目的的一种方法是使用诸如 re.findall 之类的东西。但这非常低效：它遍历您的整个输入并将单词复制到列表中，然后您必须处理该列表。如果您输入的文本很长，就会浪费时间和空间。

更好的方法是将输入视为流，并构建一个一次生成一个单词的生成器。下面是一个示例，它使用空格作为单词之间的分隔符，然后从单词中去除非字母字符并将其转换为小写：

>>> def words(text):
       pattern = re.compile(r"[^\s]+")
       non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
       for match in pattern.finditer(text):
           nxt = non_alpha.sub("", match.group()).lower()
           if nxt:  # skip blank, non-alpha words
               yield nxt


>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']

第二个问题是将规范化的单词分组为三单词短语。同样，这里是生成器可以高效执行的地方：

>>> def phrases(words):
        phrase = []
        for word in words:
            phrase.append(word)
            if len(phrase) > 3:
                phrase.remove(phrase[0])
            if len(phrase) == 3:
                yield tuple(phrase)

>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]

几乎可以肯定该函数有一个更简单的版本，但这个版本很高效，而且并不难理解。

值得注意的是，将生成器链接在一起仅遍历列表一次，并且不会在内存中构建任何大型临时数据结构。您可以使用结果构建一个按短语键入的 defaultdict：

>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
        counts[phrase] += 1

这会在计算短语时对 text 进行一次传递。完成后，查找字典中值大于 1 的每个条目。

You have, it seems to me, two problems.

The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are the black dog and The black, dog? the same phrase?

A way of doing this, as marcog suggests, is by using something like re.findall. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.

A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:

>>> def words(text):
       pattern = re.compile(r"[^\s]+")
       non_alpha = re.compile(r"[^a-z]", re.IGNORECASE)
       for match in pattern.finditer(text):
           nxt = non_alpha.sub("", match.group()).lower()
           if nxt:  # skip blank, non-alpha words
               yield nxt


>>> text
"O'er the bright blue sea, for Sir Joseph Porter K.C.B."
>>> list(words(text))
['oer', 'the', 'bright', 'blue', 'sea', 'for', 'sir', 'joseph', 'porter', 'kcb']

The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:

>>> def phrases(words):
        phrase = []
        for word in words:
            phrase.append(word)
            if len(phrase) > 3:
                phrase.remove(phrase[0])
            if len(phrase) == 3:
                yield tuple(phrase)

>>> list(phrases(words(text)))
[('oer', 'the', 'bright'), ('the', 'bright', 'blue'), ('bright', 'blue', 'sea'), ('blue', 'sea', 'for'), ('sea', 'for', 'sir'), ('for', 'sir', 'joseph'), ('sir', 'joseph', 'porter'), ('joseph', 'porter', 'kcb')]

There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.

Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a defaultdict keyed by phrase:

>>> import collections
>>> counts = collections.defaultdict(int)
>>> for phrase in phrases(words(text)):
        counts[phrase] += 1

This makes a single pass over text as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.

回复收藏 0 原文

糖果控 2024-10-15 22:16:22

最原始的方法是读取字符串中的文本。执行 string.split() 并获取列表中的单个单词。然后，您可以每三个单词对列表进行切片，并使用 collections.defaultdict(int) 来保持计数。

d = collections.defaultdict(int)

d[phrase]+=1

正如我所说，它非常粗糙。但肯定应该让你开始

回复收藏 0 原文

執念 2024-10-15 22:16:22

我建议查看 NLTK 工具包。这是开源的，旨在用于自然语言教学。以及更高级别的 NLP 函数，它有很多标记化类型的函数和集合。

回复收藏 0 原文

谁人与我共长歌 2024-10-15 22:16:22

这是一个大约 O(n) 的解决方案，它应该适用于相当大的输入文本。如果它太慢，您可能需要考虑使用专为文本处理而设计的 Perl 或纯粹为了性能而设计的 C++。

>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
...     phrases[(a, b, c)] += 1
... 
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []

Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.

>>> s = 'The quick brown fox jumps over the lazy dog'
>>> words = string.lower(s).split()
>>> phrases = collections.defaultdict(int)
>>> for a, b, c in zip(words[:-3], words[1:-2], words[2:]):
...     phrases[(a, b, c)] += 1
... 
>>> phrases
defaultdict(<type 'int'>, {('over', 'the', 'lazy'): 1, ('quick', 'brown', 'fox'): 1, ('the', '
quick', 'brown'): 1, ('jumps', 'over', 'the'): 1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps
', 'over'): 1})
>>> [phrase for phrase, count in phrases.iteritems() if count > 1]
>>> []

回复收藏 0 原文

~没有更多了~