我有一个词汇 v = [“无政府主义”,“起源”,“术语”,“滥用”]
和单词列表
test = ['无政府主义','起源','as','a','','','of'of'of“ of”,“ abof”,“ first”,'','','','firts' “工作”,“ class”,“激进分子”,“包括”,“','diggers','滥用','the',“英语”,“术语”,“无政府主义”]
。
我想在词汇量 test 中的每个单词中的每个单词 v 中的单词列表中的半径中计数 r
中的共发生数量。 。如果 r = 5
,那么我们在 v
中的给定词汇单词的左侧看5个单词,右侧是5个单词。 我们计算 v
中每个单词的次数发生在5的半径内。
然后, “无政府主义”一词首先出现在 test
中。第一次发生后,我们在左侧(即没有)看5个单词,右侧5个单词('ointed','as','a','','术语','
) 。这些“无政府主义”中有什么吗?否。对于“无政府主义”的最后一次出现,我们在左侧看5个单词('diggers','滥用','the','英语','术语'
)和5个词权利(再次,什么都没有)。因此,“无政府主义”不会发生在5个单词的半径内,因此输出矩阵的(0,0)输入对应于(“无政府主义”,“无政府主义”)是0。一旦在“无政府主义”的5个字中,因此(0,1)条目(即(“无政府主义”,“源”)单元格的输出矩阵的单元格是1。同样,单词“术语”一次出现在半径5之内在“无政府主义”的第一次出现,并且在第二次出现“无政府主义”的半径5中,因此输出的(0,2)输入为2。我们以这种方式继续使用词汇中的每个单词V 。
因此,结果输出是一个4x4矩阵(因为 v
中有4个单词),并且是对称的,因为例如(“无政府主义”,“原始”)的共发生计数为与(“原始”,“无政府主义”)相同。
对于此示例,输出(例如Numpy数组)看起来像:
|
|
|
|
0 |
1 |
2 |
1 |
1 |
0 |
1 |
1 |
2 |
1 |
0 |
2 |
1 |
2 |
每行和列对应于 v 的 |
0 |
相应条目。如何在Python中实现这一目标?
I have a vocabulary V = ["anarchism", "originated", "term", "abuse"]
, and list of words
test = ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'abuse', 'the', 'english', 'term', 'anarchism']
.
I'd like to count the number of co-occurrences within a radius R
in the list of words test
between each word in the vocabulary V
. If R=5
, say, then we look 5 words to the left of a given vocabulary word in V
and 5 words to the right. We then count the number of times each word in V
occurs within that radius of 5.
For example, let's take the first word in the V
, "anarchism." The word "anarchism" occurs first and last in test
. After the first occurrence, we look 5 words to the left (i.e. nothing) and 5 words to the right ('originated', 'as', 'a', 'term', 'of'
). Is any of these "anarchism"? No. For the last occurrence of "anarchism", we look 5 words to the left ('diggers', 'abuse', 'the', 'english', 'term'
) and 5 words to the right (again, nothing). Hence "anarchism" does not occur within a radius of 5 words with itself, so the (0, 0) entry of the output matrix corresponding to ("anarchism", "anarchism") is 0. However, the word "originated" occurs once within 5 words of "anarchism", so the (0, 1) entry (i.e. the ("anarchism", "originated")) cell of the output matrix is 1. Similarly, the word "term" occurs once within radius 5 of the first occurrence of "anarchism" and once within radius 5 of the second occurrence of "anarchism", so the (0, 2) entry of the output is 2. We continue in this way for each word in the vocabulary V
.
The resulting output is therefore a 4x4 matrix (since there are 4 words in V
), and it is symmetric, since for example the counts of co-occurrences for ("anarchism", "originated") are the same as ("originated", "anarchism").
For this example, the output (e.g. numpy array) looks like:
|
|
|
|
0 |
1 |
2 |
1 |
1 |
0 |
1 |
1 |
2 |
1 |
0 |
2 |
1 |
1 |
2 |
0 |
Each row and column corresponds to the respective entries of V
. How can I implement this in Python?
发布评论