在 Python 中模糊匹配大量文本中的字符串(url)
我有一个公司名称列表,还有一个提及公司名称的 url 列表。
最终目标是查看该 url,并找出该 url 上有多少家公司在我的列表中。
示例 URL:http://www.dmx.com/about/our-clients
每个URL 的结构会有所不同,因此我没有一个好的方法来进行正则表达式搜索并为每个公司名称创建单独的字符串。
我想构建一个 for 循环来从 URL 的整个内容列表中搜索每个公司。但看起来 Levenshtein 更适合两个较小的字符串,而不是一个短字符串和大量文本。
这个初学者应该在哪里寻找?
I have a list of company names, and I have a list of url's mentioning company names.
The end goal is to look into the url, and find out how many of the companies on the url are in my list.
Example URL: http://www.dmx.com/about/our-clients
Each URL will be structured differently, so I don't have a good way to do a regex search and create individual strings for each company name.
I'd like build a for loop to search for each company from the list on the entire contents of the URL. But it seems like Levenshtein is better for two smaller strings, vs. a short string and a large body of text.
Where should this beginner be looking?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在我看来,您不需要任何“模糊”匹配。我假设当你说“url”时,你的意思是“url 指向的地址的网页”。只需使用Python内置的子字符串搜索功能:
如果您担心字符串大小写不匹配,只需将其全部转换为大写即可。
It doesn't sound to me like you need any "fuzzy" matching. And I'm assuming that when you say "url" you mean "webpage at the address pointed to by the url." Just use Python's built-in substring search functionality:
If you are worried about string capitalization mismatches, just convert it all to uppercase.
我想在 senderle 的答案中补充一点,以某种方式规范化你的名字可能是有意义的(例如,删除所有特殊字符,然后将其应用于网页文本和你的字符串列表。
如果这还不够好,你可以转到 difflib 并执行以下操作:
我选择的任意截止值(Ratcliff/Obershelp)比率0.8 可能太宽松或太严格;稍微调整一下。
I would add to senderle's answer that it may make sense to normalize your names somehow (e.g., remove all special characters, and then apply it to webpage_text and your list of strings.
If this isn't good enough you can go to difflib and do something like:
The arbitrary cutoff I chose (Ratcliff/Obershelp) ratio of 0.8 may be too lenient or tough; play with it a bit.