将三元组、二元组和一元组与文本匹配;如果一元或二元是已经匹配的三元的子串,则通过; Python

发布于 2024-12-18 14:34:28 字数 3343 浏览 0 评论 0原文

main_text 是包含词性标记句子的列表列表:

 main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),    
              ('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')                   
               ('queen', 'NN'), ('hates','VB'),('alice','NN')]]  

ngrams_to_match 是包含词性标记三元组的列表列表:

 ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
                    [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
                    [('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
                    [('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]

(a) 对于 main_text 中的每个句子,首先检查是否完整ngrams_to _match 中的三元组匹配。如果三元组匹配,则返回匹配的三元组和句子。

(b) 然后,检查每个三元组的第一个元组(一元组)或前两个元组(二元组)是否在 main_text 中匹配。

(c) 如果一元组或二元组形成已匹配三元组的子串,则不返回任何内容。否则,返回二元组或一元组匹配以及句子。

输出应该是这样的:

 trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
 trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
 bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]

条件 (b) 为我们提供了 bigram_match。

错误的输出将是:

 trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
 bigram_match =  [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
 unigram_match = [ [('the', 'DT')] #*bad by condition c
 trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
 bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
 unigram_match [('likes', 'VB')]# *bad by condition c

等等。

以下非常丑陋的代码对于这个玩具示例来说效果很好。但我想知道是否有人有更简化的方法。

 for ngram in ngrams_to_match:
  for sentence in main_text:
        for tup in sentence:

            #we can't be sure that our part-of-speech tagger will
            #tag an ngram word and a main_text word the same way, so 
            #we match the word in the tuple, not the whole tuple

        if ngram[0][0] == tup[0]: #if word in the first ngram matches...
            unigram_index = sentence.index(tup) #...then this is our index
            unigram = (sentence[unigram_index][0]) #save it as a unigram

            try:   
                        if sentence[unigram_index+2][0]==ngram[2][0]:
                 if sentence[unigram_index+2][0]==ngram[2][0]:  #match a trigram
                      trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
                      print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
            except IndexError:
            pass
            if ngram[0][0] == tup[0]:# == tup[0]:  #same as above
                unigram_index = sentence.index(tup)               
                if sentence[unigram_index+1][0]==span[1][0]:  #get bigram match     

                bigram = (sentence[unigram_index][0],span[1][0])#save the match
                if bigram[0] and bigram[1] in trigram:  #no substring matches
                                     pass                             
                else:
                    print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
                if unigram in bigram or trigram:  #no substring matches
                    pass
                else:
                    print unigram 

main_text is a list of lists containing sentences that've been part-of-speech tagged:

 main_text = [[('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'), ('likes','VB'),    
              ('tea','NN'), ('and','CC'), ('hats', 'NN')], [('the', 'DT'), ('red','JJ')                   
               ('queen', 'NN'), ('hates','VB'),('alice','NN')]]  

ngrams_to_match is a list of lists containing part-of-speech tagged trigrams:

 ngrams_to_match = [[('likes','VB'),('tea','NN'), ('and','CC')],
                    [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')],
                    [('hates', 'DT'), ('alice', 'JJ'), ('but', 'CC') ],
                    [('and', 'CC'), ('the', 'DT'), ('rabbit', 'NN')]]

(a) For each sentence in main_text, first check to see if a complete trigram in ngrams_to _match matches. If the trigram matches, return the matched trigram and the sentence.

(b) Then, check to see if the the first tuple (a unigram) or the first two tuples (a bigram) of each of the trigrams match in main_text.

(c) If the unigram or bigram forms a substring of an already matched trigram, don't return anything. Otherwise, return the bigram or unigram match and the sentence.

Here is what the output should be:

 trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
 trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
 bigram_match = [('hates', 'DT'), ('alice', JJ')], sentence[1]

Condition (b) gives us the bigram_match.

The WRONG output would be:

 trigram_match = [('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN')], sentence[0]
 bigram_match =  [('the', 'DT'), ('mad', 'JJ')] #*bad by condition c
 unigram_match = [ [('the', 'DT')] #*bad by condition c
 trigram_match = [('likes','VB'),('tea','NN'), ('and','CC')], sentence[0]
 bigram_match = [('likes','VB'),('tea','NN')] #*bad by condition c
 unigram_match [('likes', 'VB')]# *bad by condition c

and so on.

The following, very ugly code works okay for this toy example. But I was wondering if anyone had a more streamlined approach.

 for ngram in ngrams_to_match:
  for sentence in main_text:
        for tup in sentence:

            #we can't be sure that our part-of-speech tagger will
            #tag an ngram word and a main_text word the same way, so 
            #we match the word in the tuple, not the whole tuple

        if ngram[0][0] == tup[0]: #if word in the first ngram matches...
            unigram_index = sentence.index(tup) #...then this is our index
            unigram = (sentence[unigram_index][0]) #save it as a unigram

            try:   
                        if sentence[unigram_index+2][0]==ngram[2][0]:
                 if sentence[unigram_index+2][0]==ngram[2][0]:  #match a trigram
                      trigram = (sentence[unigram_index][0],span[1][0], ngram[2][0])#save the match
                      print 'heres the trigram-->', sentence,'\n', 'trigram--->',trigram
            except IndexError:
            pass
            if ngram[0][0] == tup[0]:# == tup[0]:  #same as above
                unigram_index = sentence.index(tup)               
                if sentence[unigram_index+1][0]==span[1][0]:  #get bigram match     

                bigram = (sentence[unigram_index][0],span[1][0])#save the match
                if bigram[0] and bigram[1] in trigram:  #no substring matches
                                     pass                             
                else:
                    print 'heres a sentence-->', sentence,'\n', 'bigram--->', bigram
                if unigram in bigram or trigram:  #no substring matches
                    pass
                else:
                    print unigram 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

开始看清了 2024-12-25 14:34:28

我尝试过使用生成器来实现这一点。我发现您的规范中存在一些差距,因此我做出了假设。

如果一元词或二元词构成已匹配三元词的子字符串,则不返回任何内容。 - 对于哪个词引用搜索元素或匹配元素有点含糊。让我开始讨厌使用 N-gram 单词(上周之前我从未听说过)。

使用添加到 found 集中的内容来修改排除的搜索元素。

# assumptions:
# - [('hates','DT'),('alice','JJ'),('but','CC')] is typoed and should be:
#   [('hates','VB'),('alice','NN'),('but','CC')]
# - matches can't overlap, matched elements are excluded from further checking
# - bigrams precede unigrams

main_text = [
  [('the','DT'),('mad','JJ'),('hatter','NN'),('likes','VB'),('tea','NN'),('and','CC'),('hats','NN')],
  [('the','DT'),('red','JJ'),('queen','NN'),('hates','VB'),('alice','NN')]
]
ngrams_to_match = [
  [('likes','VB'),('tea','NN'),('and','CC')],
  [('the','DT'),('mad','JJ'),('hatter','NN')],
  [('hates','VB'),('alice','NN'),('but','CC')],
  [('and','CC'),('the','DT'),('rabbit','NN')]
]

def slice_generator(sentence,size=3):
  """
  Generate slices through the sentence in decreasing sized windows. If True is sent to the
  generator, the elements from the previous window will be excluded from future slices.
  """
  sent = list(sentence)
  for c in range(size,0,-1):
    for i in range(len(sent)):
      slice = tuple(sent[i:i+c])
      if all(x is not None for x in slice) and len(slice) == c:
        used = yield slice
        if used:
          sent[i:i+size] = [None] * c

def gram_search(text,matches):
  tri_bi_uni = set(tuple(x) for x in matches) | set(tuple(x[:2]) for x in matches) | set(tuple(x[:1]) for x in matches)
  found = set()
  for i, sentence in enumerate(text):
    gen = slice_generator(sentence)
    send = None
    try:
      while True:
        row = gen.send(send)
        if row in tri_bi_uni - found:
          send = True
          found |= set(tuple(row[:x]) for x in range(1,len(row)))
          print "%s_gram_match, sentence[%s] = %r" % (len(row),i,row)
        else:
          send = False
    except StopIteration:
      pass

gram_search(main_text,ngrams_to_match)

产量:

3_gram_match, sentence[0] = (('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'))
3_gram_match, sentence[0] = (('likes', 'VB'), ('tea', 'NN'), ('and', 'CC'))
2_gram_match, sentence[1] = (('hates', 'VB'), ('alice', 'NN'))

I've had a stab at implementing this using a generator. I found some gaps in your spec, so I've made assumptions.

If the unigram or bigram forms a substring of an already matched trigram, don't return anything. - Is a bit ambiguous about which gram is referring to the search elements or the matched elements. Makes me start to hate the use of the N-gram words (which I'd never heard of before last week).

Play with what gets added to the found set in order to modify excluded search elements.

# assumptions:
# - [('hates','DT'),('alice','JJ'),('but','CC')] is typoed and should be:
#   [('hates','VB'),('alice','NN'),('but','CC')]
# - matches can't overlap, matched elements are excluded from further checking
# - bigrams precede unigrams

main_text = [
  [('the','DT'),('mad','JJ'),('hatter','NN'),('likes','VB'),('tea','NN'),('and','CC'),('hats','NN')],
  [('the','DT'),('red','JJ'),('queen','NN'),('hates','VB'),('alice','NN')]
]
ngrams_to_match = [
  [('likes','VB'),('tea','NN'),('and','CC')],
  [('the','DT'),('mad','JJ'),('hatter','NN')],
  [('hates','VB'),('alice','NN'),('but','CC')],
  [('and','CC'),('the','DT'),('rabbit','NN')]
]

def slice_generator(sentence,size=3):
  """
  Generate slices through the sentence in decreasing sized windows. If True is sent to the
  generator, the elements from the previous window will be excluded from future slices.
  """
  sent = list(sentence)
  for c in range(size,0,-1):
    for i in range(len(sent)):
      slice = tuple(sent[i:i+c])
      if all(x is not None for x in slice) and len(slice) == c:
        used = yield slice
        if used:
          sent[i:i+size] = [None] * c

def gram_search(text,matches):
  tri_bi_uni = set(tuple(x) for x in matches) | set(tuple(x[:2]) for x in matches) | set(tuple(x[:1]) for x in matches)
  found = set()
  for i, sentence in enumerate(text):
    gen = slice_generator(sentence)
    send = None
    try:
      while True:
        row = gen.send(send)
        if row in tri_bi_uni - found:
          send = True
          found |= set(tuple(row[:x]) for x in range(1,len(row)))
          print "%s_gram_match, sentence[%s] = %r" % (len(row),i,row)
        else:
          send = False
    except StopIteration:
      pass

gram_search(main_text,ngrams_to_match)

Yields:

3_gram_match, sentence[0] = (('the', 'DT'), ('mad', 'JJ'), ('hatter', 'NN'))
3_gram_match, sentence[0] = (('likes', 'VB'), ('tea', 'NN'), ('and', 'CC'))
2_gram_match, sentence[1] = (('hates', 'VB'), ('alice', 'NN'))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文