在 Python 中模糊匹配大量文本中的字符串(url)

发布于 2024-11-10 03:27:58 字数 372 浏览 0 评论 0原文

我有一个公司名称列表,还有一个提及公司名称的 url 列表。

最终目标是查看该 url,并找出该 url 上有多少家公司在我的列表中。

示例 URL:http://www.dmx.com/about/our-clients

每个URL 的结构会有所不同,因此我没有一个好的方法来进行正则表达式搜索并为每个公司名称创建单独的字符串。

我想构建一个 for 循环来从 URL 的整个内容列表中搜索每个公司。但看起来 Levenshtein 更适合两个较小的字符串,而不是一个短字符串和大量文本。

这个初学者应该在哪里寻找?

I have a list of company names, and I have a list of url's mentioning company names.

The end goal is to look into the url, and find out how many of the companies on the url are in my list.

Example URL: http://www.dmx.com/about/our-clients

Each URL will be structured differently, so I don't have a good way to do a regex search and create individual strings for each company name.

I'd like build a for loop to search for each company from the list on the entire contents of the URL. But it seems like Levenshtein is better for two smaller strings, vs. a short string and a large body of text.

Where should this beginner be looking?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

提笔书几行 2024-11-17 03:27:58

在我看来,您不需要任何“模糊”匹配。我假设当你说“url”时,你的意思是“url 指向的地址的网页”。只需使用Python内置的子字符串搜索功能:

>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>> 

如果您担心字符串大小写不匹配,只需将其全部转换为大写即可。

>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
...     if name in webpage_text:
...         print name, 'found!'
... 
CARIBOU COFFEE found!
EXPRESS found!

It doesn't sound to me like you need any "fuzzy" matching. And I'm assuming that when you say "url" you mean "webpage at the address pointed to by the url." Just use Python's built-in substring search functionality:

>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>> 

If you are worried about string capitalization mismatches, just convert it all to uppercase.

>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
...     if name in webpage_text:
...         print name, 'found!'
... 
CARIBOU COFFEE found!
EXPRESS found!
末骤雨初歇 2024-11-17 03:27:58

我想在 senderle 的答案中补充一点,以某种方式规范化你的名字可能是有意义的(例如,删除所有特殊字符,然后将其应用于网页文本和你的字符串列表。

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-?'"/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

如果这还不够好,你可以转到 difflib 并执行以下操作:

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

我选择的任意截止值(Ratcliff/Obershelp)比率0.8 可能太宽松或太严格;稍微调整一下。

I would add to senderle's answer that it may make sense to normalize your names somehow (e.g., remove all special characters, and then apply it to webpage_text and your list of strings.

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-?'"/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

If this isn't good enough you can go to difflib and do something like:

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

The arbitrary cutoff I chose (Ratcliff/Obershelp) ratio of 0.8 may be too lenient or tough; play with it a bit.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文