在 python 中将大列表与字符串进行匹配的最佳方法
我有一个 python 列表,其中包含大约 700 个术语,我想将它们用作 Django 中某些数据库条目的元数据。我想将列表中的术语与条目描述进行匹配,看看是否有任何术语匹配,但有几个问题。我的第一个问题是列表中有一些多词术语包含其他列表条目中的单词。一个例子是:
Intrusion
Intrusion Detection
我对 re.findall 还没有很深入,因为它会匹配上面示例中的入侵和入侵检测。我只想匹配入侵检测而不是入侵。
有没有更好的方法来进行这种类型的匹配?我想也许可以尝试 NLTK,但看起来它对这种类型的匹配没有帮助。
编辑:
为了更清楚一点,我列出了 700 个术语,例如防火墙或入侵检测。我想尝试将列表中的这些单词与我存储在数据库中的描述进行匹配,看看是否有匹配,并且我将在元数据中使用这些术语。因此,如果我有以下字符串:
There are many types of intrusion detection devices in production today.
并且如果我有一个包含以下术语的列表:
Intrusion
Intrusion Detection
我想匹配“入侵检测”,但不是“入侵”。实际上,我也希望能够匹配单数/复数实例,但我可能有点超前了。所有这一切背后的想法是获取所有匹配并将它们放入列表中,然后处理它们。
I have a python list that contains about 700 terms that I would like to use as metadata for some database entries in Django. I would like to match the terms in the list against the entry descriptions to see if any of the terms match but there are a couple of issues. My first issue is that there are some multiword terms within the list that contain words from other list entries. An example is:
Intrusion
Intrusion Detection
I have not gotten very far with re.findall as it will match both Intrusion and Intrusion Detection in the above example. I would only want to match Intrusion Detection and not Intrusion.
Is there a better way to do this type of matching? I thought maybe maybe about trying NLTK but it didn't look like it could help with this type of matching.
Edit:
So to add a little more clarity, I have a list of 700 terms such as firewall or intrusion detection. I would like to try to match these words in the list against descriptions that I have stored in a database to see if any match, and I will use those terms in metadata. So if I have the following string:
There are many types of intrusion detection devices in production today.
and if I have a list with the following terms:
Intrusion
Intrusion Detection
I would like to match 'intrusion detection', but not 'intrusion'. Really I would like to also be able to match singular/plural instances too, but I may be getting ahead of myself. The idea behind all of this is to take all of the matches and put them in a list, and then process them.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您需要更灵活地匹配条目描述,您可以结合
nltk
和re
假设您对同一事件有不同的描述,即。 系统的重写。您可以使用
nltk.stem
来捕获重写,重写,重写,单数和复数形式等。输出:
编辑:
要查看哪个
术语
导致了匹配:输出:
If you need more flexibility to match entry descriptions, you can combine
nltk
andre
let's say you have different descriptions of the same event ie. a rewrite of the system. You can use
nltk.stem
to capture rewrite, rewriting, rewrites, singular and plural forms etc.Output:
EDIT:
To see which of the
terms
caused the match:Output:
这个问题尚不清楚,但据我了解,您有一个术语主列表。每行说一个术语。接下来,您有一个测试数据列表,其中一些测试数据将位于主列表中,而另一些则不会。您想查看测试数据是否在主列表中以及是否正在执行任务。
假设您的主列表如下所示
和你的测试数据看起来像这样
应该会引导您走向正确的方向
示例输出
还有其他几种方法可以做到这一点,但这应该为您指明正确的方向。如果你的列表很大(700确实不是那么大)考虑使用字典,我觉得它们更快。特别是如果您打算查询数据库。也许字典结构看起来像 {term: 有关 term 的信息}
This question is unclear, but from what I understand you have a Master List of terms. Say one term per line. Next you have a list of test data, where some of the test data will be in the master list, and some wont. You want to see if the test data is in the master list and if it is perform a task.
Assuming your Master List looks like this
and your Test Data Looks like this
this simple script should lead you in the right direction
SAMPLE OUTPUT
there are several other ways to do this, but this should point you in the right direction. if your list is large (700 really isn't that large) consider using a dict, I feel they quicker. especially if yo plan to query a database. perhaps a dictionary structure could look like {term: information about term}