计算指定词汇和指定半径内的单词共发生数量？

发布于 2025-01-27 13:53:52 字数 1413 浏览 2 评论 0 原文

我有一个词汇 v = [“无政府主义”，“起源”，“术语”，“滥用”] 和单词列表 test = ['无政府主义'，'起源'，'as'，'a'，''，''，'of'of'of“ of”，“ abof”，“ first”，''，''，''，'firts' “工作”，“ class”，“激进分子”，“包括”，“'，'diggers'，'滥用'，'the'，“英语”，“术语”，“无政府主义”] 。

我想在词汇量 test 中的每个单词中的每个单词 v 中的单词列表中的半径中计数 r 中的共发生数量。。如果 r = 5 ，那么我们在 v 中的给定词汇单词的左侧看5个单词，右侧是5个单词。我们计算 v 中每个单词的次数发生在5的半径内。

然后， “无政府主义”一词首先出现在 test 中。第一次发生后，我们在左侧（即没有）看5个单词，右侧5个单词（'ointed'，'as'，'a'，''，'术语'，'）。这些“无政府主义”中有什么吗？否。对于“无政府主义”的最后一次出现，我们在左侧看5个单词（'diggers'，'滥用'，'the'，'英语'，'术语'）和5个词权利（再次，什么都没有）。因此，“无政府主义”不会发生在5个单词的半径内，因此输出矩阵的（0，0）输入对应于（“无政府主义”，“无政府主义”）是0。一旦在“无政府主义”的5个字中，因此（0，1）条目（即（“无政府主义”，“源”）单元格的输出矩阵的单元格是1。同样，单词“术语”一次出现在半径5之内在“无政府主义”的第一次出现，并且在第二次出现“无政府主义”的半径5中，因此输出的（0，2）输入为2。我们以这种方式继续使用词汇中的每个单词V 。

因此，结果输出是一个4x4矩阵（因为 v 中有4个单词），并且是对称的，因为例如（“无政府主义”，“原始”）的共发生计数为与（“原始”，“无政府主义”）相同。

对于此示例，输出（例如Numpy数组）看起来像：


0	1	2	1
1	0	1	1
2	1	0	2
1	2	每行和列对应于 `v` 的	0

相应条目。如何在Python中实现这一目标？

原文

I have a vocabulary V = ["anarchism", "originated", "term", "abuse"], and list of words
test = ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'abuse', 'the', 'english', 'term', 'anarchism'].

I'd like to count the number of co-occurrences within a radius R in the list of words test between each word in the vocabulary V. If R=5, say, then we look 5 words to the left of a given vocabulary word in V and 5 words to the right. We then count the number of times each word in V occurs within that radius of 5.

For example, let's take the first word in the V, "anarchism." The word "anarchism" occurs first and last in test. After the first occurrence, we look 5 words to the left (i.e. nothing) and 5 words to the right ('originated', 'as', 'a', 'term', 'of'). Is any of these "anarchism"? No. For the last occurrence of "anarchism", we look 5 words to the left ('diggers', 'abuse', 'the', 'english', 'term') and 5 words to the right (again, nothing). Hence "anarchism" does not occur within a radius of 5 words with itself, so the (0, 0) entry of the output matrix corresponding to ("anarchism", "anarchism") is 0. However, the word "originated" occurs once within 5 words of "anarchism", so the (0, 1) entry (i.e. the ("anarchism", "originated")) cell of the output matrix is 1. Similarly, the word "term" occurs once within radius 5 of the first occurrence of "anarchism" and once within radius 5 of the second occurrence of "anarchism", so the (0, 2) entry of the output is 2. We continue in this way for each word in the vocabulary V.

The resulting output is therefore a 4x4 matrix (since there are 4 words in V), and it is symmetric, since for example the counts of co-occurrences for ("anarchism", "originated") are the same as ("originated", "anarchism").

For this example, the output (e.g. numpy array) looks like: