Python 文本中重复的短语
我有一个问题,我不知道如何解决它。请给一个建议。
我有一条文字。好大好大的文字。任务是找到文本中所有长度为3(包含三个单词)的重复短语。
I have a problem and I have no idea how to solve it. Please, give a piece of advice.
I have a text. Big, big text. The task is to find all the repeated phrases which lenght is 3(contain of three words) in the text.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在我看来,你有两个问题。
第一个是提出一种标准化输入的有效方法。你说你想找到输入中的所有三词短语,但是短语是由什么组成的呢?例如,
the blackdog
和The black,dog?
是同一个短语吗?正如 marcog 所建议的,实现此目的的一种方法是使用诸如
re.findall
之类的东西。但这非常低效:它遍历您的整个输入并将单词复制到列表中,然后您必须处理该列表。如果您输入的文本很长,就会浪费时间和空间。更好的方法是将输入视为流,并构建一个一次生成一个单词的生成器。下面是一个示例,它使用空格作为单词之间的分隔符,然后从单词中去除非字母字符并将其转换为小写:
第二个问题是将规范化的单词分组为三单词短语。同样,这里是生成器可以高效执行的地方:
几乎可以肯定该函数有一个更简单的版本,但这个版本很高效,而且并不难理解。
值得注意的是,将生成器链接在一起仅遍历列表一次,并且不会在内存中构建任何大型临时数据结构。您可以使用结果构建一个按短语键入的
defaultdict
:这会在计算短语时对
text
进行一次传递。完成后,查找字典中值大于 1 的每个条目。You have, it seems to me, two problems.
The first is coming up with an efficient way of normalizing the input. You say you want to find all of the three-word phrases in the input, but what constitutes a phrase? For instance, are
the black dog
andThe black, dog?
the same phrase?A way of doing this, as marcog suggests, is by using something like
re.findall
. But this is pretty inefficient: it traverses your entire input and copies the words into a list, and then you have to process that list. If your input text is very long, that's going to be wasteful of both time and space.A better approach would be to treat the input as a stream, and build a generator that pulls off one word at a time. Here's an example, which uses spaces as the delimiter between words, then strips non-alpha characters out of the words and converts them to lower case:
The second problem is grouping the normalized words into three-word phrases. Again, here is a place where a generator will perform efficiently:
There's almost certainly a simpler version of that function possible, but this one's efficient, and it's not hard to understand.
Significantly, chaining the generators together only traverses the list once, and it doesn't build any large temporary data structures in memory. You can use the result to build a
defaultdict
keyed by phrase:This makes a single pass over
text
as it counts the phrases. When it's done, find every entry in the dictionary whose value is greater than one.最原始的方法是读取字符串中的文本。执行 string.split() 并获取列表中的单个单词。然后,您可以每三个单词对列表进行切片,并使用 collections.defaultdict(int) 来保持计数。
d = collections.defaultdict(int)
d[phrase]+=1
正如我所说,它非常粗糙。但肯定应该让你开始
the crudest way would be to read text in a string. Do a string.split() and get individual words in a list. You could then slice list per three words, and use collections.defaultdict(int) for keeping the count.
d = collections.defaultdict(int)
d[phrase]+=1
as I said, its very crude. But should certainly get you started
我建议查看 NLTK 工具包。这是开源的,旨在用于自然语言教学。以及更高级别的 NLP 函数,它有很多标记化类型的函数和集合。
I would suggest looking at the NLTK toolkit. This is open source and intended for natural language teaching. as well as higher level NLP functions, it has a lot of tokenizing type of functions and collections.
这是一个大约 O(n) 的解决方案,它应该适用于相当大的输入文本。如果它太慢,您可能需要考虑使用专为文本处理而设计的 Perl 或纯粹为了性能而设计的 C++。
Here's a roughly O(n) solution, which should work on pretty large input texts. If it's too slow, you probably want to look into using Perl which was designed for text processing or C++ for pure performance.