将三元组、二元组和一元组与文本匹配;如果一元或二元是已经匹配的三元的子串,则通过; Python
main_text 是包含词性标记句子的列表列表:
main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),
('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')
('queen', 'NN'), ('hates','VB'),('alice','NN')]]
ngrams_to_match 是包含词性标记三元组的列表列表:
ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
[('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
[('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]
(a) 对于 main_text 中的每个句子,首先检查是否完整ngrams_to _match 中的三元组匹配。如果三元组匹配,则返回匹配的三元组和句子。
(b) 然后,检查每个三元组的第一个元组(一元组)或前两个元组(二元组)是否在 main_text 中匹配。
(c) 如果一元组或二元组形成已匹配三元组的子串,则不返回任何内容。否则,返回二元组或一元组匹配以及句子。
输出应该是这样的:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]
条件 (b) 为我们提供了 bigram_match。
错误的输出将是:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
bigram_match = [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
unigram_match = [ [('the', 'DT')] #*bad by condition c
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
unigram_match [('likes', 'VB')]# *bad by condition c
等等。
以下非常丑陋的代码对于这个玩具示例来说效果很好。但我想知道是否有人有更简化的方法。
for ngram in ngrams_to_match:
for sentence in main_text:
for tup in sentence:
#we can't be sure that our part-of-speech tagger will
#tag an ngram word and a main_text word the same way, so
#we match the word in the tuple, not the whole tuple
if ngram[0][0] == tup[0]: #if word in the first ngram matches...
unigram_index = sentence.index(tup) #...then this is our index
unigram = (sentence[unigram_index][0]) #save it as a unigram
try:
if sentence[unigram_index+2][0]==ngram[2][0]:
if sentence[unigram_index+2][0]==ngram[2][0]: #match a trigram
trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
except IndexError:
pass
if ngram[0][0] == tup[0]:# == tup[0]: #same as above
unigram_index = sentence.index(tup)
if sentence[unigram_index+1][0]==span[1][0]: #get bigram match
bigram = (sentence[unigram_index][0],span[1][0])#save the match
if bigram[0] and bigram[1] in trigram: #no substring matches
pass
else:
print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
if unigram in bigram or trigram: #no substring matches
pass
else:
print unigram
main_text is a list of lists containing sentences that've been part-of-speech tagged:
main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),
('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')
('queen', 'NN'), ('hates','VB'),('alice','NN')]]
ngrams_to_match is a list of lists containing part-of-speech tagged trigrams:
ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
[('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
[('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]
(a) For each sentence in main_text, first check to see if a complete trigram in ngrams_to _match matches. If the trigram matches, return the matched trigram and the sentence.
(b) Then, check to see if the the first tuple (a unigram) or the first two tuples (a bigram) of each of the trigrams match in main_text.
(c) If the unigram or bigram forms a substring of an already matched trigram, don't return anything. Otherwise, return the bigram or unigram match and the sentence.
Here is what the output should be:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]
Condition (b) gives us the bigram_match.
The WRONG output would be:
trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
bigram_match = [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
unigram_match = [ [('the', 'DT')] #*bad by condition c
trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
unigram_match [('likes', 'VB')]# *bad by condition c
and so on.
The following, very ugly code works okay for this toy example. But I was wondering if anyone had a more streamlined approach.
for ngram in ngrams_to_match:
for sentence in main_text:
for tup in sentence:
#we can't be sure that our part-of-speech tagger will
#tag an ngram word and a main_text word the same way, so
#we match the word in the tuple, not the whole tuple
if ngram[0][0] == tup[0]: #if word in the first ngram matches...
unigram_index = sentence.index(tup) #...then this is our index
unigram = (sentence[unigram_index][0]) #save it as a unigram
try:
if sentence[unigram_index+2][0]==ngram[2][0]:
if sentence[unigram_index+2][0]==ngram[2][0]: #match a trigram
trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
except IndexError:
pass
if ngram[0][0] == tup[0]:# == tup[0]: #same as above
unigram_index = sentence.index(tup)
if sentence[unigram_index+1][0]==span[1][0]: #get bigram match
bigram = (sentence[unigram_index][0],span[1][0])#save the match
if bigram[0] and bigram[1] in trigram: #no substring matches
pass
else:
print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
if unigram in bigram or trigram: #no substring matches
pass
else:
print unigram
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我尝试过使用生成器来实现这一点。我发现您的规范中存在一些差距,因此我做出了假设。
如果一元词或二元词构成已匹配三元词的子字符串,则不返回任何内容。 - 对于哪个词引用搜索元素或匹配元素有点含糊。让我开始讨厌使用
N-gram
单词(上周之前我从未听说过)。使用添加到
found
集中的内容来修改排除的搜索元素。产量:
I've had a stab at implementing this using a generator. I found some gaps in your spec, so I've made assumptions.
If the unigram or bigram forms a substring of an already matched trigram, don't return anything. - Is a bit ambiguous about which gram is referring to the search elements or the matched elements. Makes me start to hate the use of the
N-gram
words (which I'd never heard of before last week).Play with what gets added to the
found
set in order to modify excluded search elements.Yields: