在 python 中将大列表与字符串进行匹配的最佳方法

发布于 2024-11-08 22:50:01 字数 716 浏览 4 评论 0原文

我有一个 python 列表，其中包含大约 700 个术语，我想将它们用作 Django 中某些数据库条目的元数据。我想将列表中的术语与条目描述进行匹配，看看是否有任何术语匹配，但有几个问题。我的第一个问题是列表中有一些多词术语包含其他列表条目中的单词。一个例子是：

Intrusion
Intrusion Detection

我对 re.findall 还没有很深入，因为它会匹配上面示例中的入侵和入侵检测。我只想匹配入侵检测而不是入侵。

有没有更好的方法来进行这种类型的匹配？我想也许可以尝试 NLTK，但看起来它对这种类型的匹配没有帮助。

编辑：

为了更清楚一点，我列出了 700 个术语，例如防火墙或入侵检测。我想尝试将列表中的这些单词与我存储在数据库中的描述进行匹配，看看是否有匹配，并且我将在元数据中使用这些术语。因此，如果我有以下字符串：

There are many types of intrusion detection devices in production today.

并且如果我有一个包含以下术语的列表：

Intrusion
Intrusion Detection

我想匹配“入侵检测”，但不是“入侵”。实际上，我也希望能够匹配单数/复数实例，但我可能有点超前了。所有这一切背后的想法是获取所有匹配并将它们放入列表中，然后处理它们。

原文

I have a python list that contains about 700 terms that I would like to use as metadata for some database entries in Django. I would like to match the terms in the list against the entry descriptions to see if any of the terms match but there are a couple of issues. My first issue is that there are some multiword terms within the list that contain words from other list entries. An example is:

Intrusion
Intrusion Detection

I have not gotten very far with re.findall as it will match both Intrusion and Intrusion Detection in the above example. I would only want to match Intrusion Detection and not Intrusion.

Is there a better way to do this type of matching? I thought maybe maybe about trying NLTK but it didn't look like it could help with this type of matching.

Edit:

So to add a little more clarity, I have a list of 700 terms such as firewall or intrusion detection. I would like to try to match these words in the list against descriptions that I have stored in a database to see if any match, and I will use those terms in metadata. So if I have the following string:

There are many types of intrusion detection devices in production today.

and if I have a list with the following terms:

Intrusion
Intrusion Detection

I would like to match 'intrusion detection', but not 'intrusion'. Really I would like to also be able to match singular/plural instances too, but I may be getting ahead of myself. The idea behind all of this is to take all of the matches and put them in a list, and then process them.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的他你的她 2024-11-15 22:50:01

如果您需要更灵活地匹配条目描述，您可以结合 nltk 和 re

from nltk.stem import PorterStemmer
import re

假设您对同一事件有不同的描述，即。 系统的重写。您可以使用nltk.stem来捕获重写，重写，重写，单数和复数形式等。

master_list = [
    'There are many types of intrusion detection devices in production today.',
    'The CTO approved a rewrite of the system',
    'The CTO is about to approve a complete rewrite of the system',
    'The CTO approved a rewriting',
    'Breaching of Firewalls'
]

terms = [
    'Intrusion Detection',
    'Approved rewrite',
    'Firewall'
]

stemmer = PorterStemmer()

# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)

# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')

for sentence in master_list:
    match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
    matches = [m.group(0) for m in match_obs if m]
    print(matches)

输出：

['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']

['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']

编辑：

要查看哪个术语导致了匹配：

for sentence in master_list:
    # regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
    for term, pattern in zip(terms, regex_patterns):
        if re.search(pattern, sentence, flags=re.IGNORECASE):
            # process term (put it in the db)
            print('TERM: {0} FOUND IN: {1}'.format(term, sentence))

输出：

TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

If you need more flexibility to match entry descriptions, you can combine nltk and re

from nltk.stem import PorterStemmer
import re

let's say you have different descriptions of the same event ie. a rewrite of the system. You can use nltk.stem to capture rewrite, rewriting, rewrites, singular and plural forms etc.

master_list = [
    'There are many types of intrusion detection devices in production today.',
    'The CTO approved a rewrite of the system',
    'The CTO is about to approve a complete rewrite of the system',
    'The CTO approved a rewriting',
    'Breaching of Firewalls'
]

terms = [
    'Intrusion Detection',
    'Approved rewrite',
    'Firewall'
]

stemmer = PorterStemmer()

# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)

# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')

for sentence in master_list:
    match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
    matches = [m.group(0) for m in match_obs if m]
    print(matches)

Output:

['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']

['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']

EDIT:

To see which of the terms caused the match:

for sentence in master_list:
    # regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
    for term, pattern in zip(terms, regex_patterns):
        if re.search(pattern, sentence, flags=re.IGNORECASE):
            # process term (put it in the db)
            print('TERM: {0} FOUND IN: {1}'.format(term, sentence))

Output:

TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

回复收藏 0 原文

感性 2024-11-15 22:50:01

这个问题尚不清楚，但据我了解，您有一个术语主列表。每行说一个术语。接下来，您有一个测试数据列表，其中一些测试数据将位于主列表中，而另一些则不会。您想查看测试数据是否在主列表中以及是否正在执行任务。

假设您的主列表如下所示

入侵检测
防火墙
FooBar

和你的测试数据看起来像这样

入侵
入侵检测
富
酒吧

应该会引导您走向正确的方向

#!/usr/bin/env python

import sys 

def main():
  '''useage tester.py masterList testList'''   


  #open files
  masterListFile = open(sys.argv[1], 'r')
  testListFile = open(sys.argv[2], 'r')

  #bulid master list
  # .strip() off '\n' new line
  # set to lower case. Intrusion != intrusion, but should.
  masterList = [ line.strip().lower() for line in masterListFile ]
  #run test
  for line in testListFile:
    term = line.strip().lower()
    if term  in masterList:
      print term, "in master list!"
      #perhaps grab your metadata using a like %%
    else:
      print "OH NO!", term, "not found!"

  #close files
  masterListFile.close()
  testListFile.close()

if __name__ == '__main__':
  main()

示例输出

哦不！未发现入侵！
入侵检测在主列表中！
哦不！未找到 foo！
哦不！找不到酒吧！

还有其他几种方法可以做到这一点，但这应该为您指明正确的方向。如果你的列表很大（700确实不是那么大）考虑使用字典，我觉得它们更快。特别是如果您打算查询数据库。也许字典结构看起来像 {term: 有关 term 的信息}

This question is unclear, but from what I understand you have a Master List of terms. Say one term per line. Next you have a list of test data, where some of the test data will be in the master list, and some wont. You want to see if the test data is in the master list and if it is perform a task.

Assuming your Master List looks like this

Intrusion Detection
Firewall
FooBar

and your Test Data Looks like this

Intrusion
Intrusion Detection
foo
bar

this simple script should lead you in the right direction

#!/usr/bin/env python

import sys 

def main():
  '''useage tester.py masterList testList'''   


  #open files
  masterListFile = open(sys.argv[1], 'r')
  testListFile = open(sys.argv[2], 'r')

  #bulid master list
  # .strip() off '\n' new line
  # set to lower case. Intrusion != intrusion, but should.
  masterList = [ line.strip().lower() for line in masterListFile ]
  #run test
  for line in testListFile:
    term = line.strip().lower()
    if term  in masterList:
      print term, "in master list!"
      #perhaps grab your metadata using a like %%
    else:
      print "OH NO!", term, "not found!"

  #close files
  masterListFile.close()
  testListFile.close()

if __name__ == '__main__':
  main()

SAMPLE OUTPUT

OH NO! intrusion not found!
intrusion detection in master list!
OH NO! foo not found!
OH NO! bar not found!

there are several other ways to do this, but this should point you in the right direction. if your list is large (700 really isn't that large) consider using a dict, I feel they quicker. especially if yo plan to query a database. perhaps a dictionary structure could look like {term: information about term}

回复收藏 0 原文

~没有更多了~