如何绘制最“相似”的地图python 中从一个列表到另一个列表的字符串？

发布于 2024-12-20 00:25:53 字数 2319 浏览 2 评论 0原文

给出了两个包含字符串的列表。

- 不仅用英语书写，而且始终使用拉丁字母。
另一个列表包含其中可能出现第一个列表中的字符串（组织）的大部分完整地址。

一个例子：

addresses = [
             "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
             "Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
             "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
             "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",    
             "Computer Science Department, University of California, Santa Barbara, USA 93106",
             "Fraunhofer IAIS, Sankt Augustin, Germany",
             "Department of Computer Science, Cornell University, Ithaca, NY",
             "University of Wisconsin-Madison"
            ]

organisations = [
                 "Catholic University of Leuven"
                 "Fraunhofer IAIS"
                 "Cornell University of Ithaca"
                 "Tübingener Max Plank Institut"
                ]

正如你所看到的，所需的映射是：

"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of  Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS 
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --

我的想法是使用某种“距离算法”来计算字符串的相似度。因为我不能仅仅通过执行 if address in Organization 来查找地址中的组织，因为它在不同地方的写法可能略有不同。所以我的第一个猜测是使用 difflib 模块。特别是 difflib.get_close_matches() 函数，用于从组织列表中为每个地址选择最接近的字符串。但我不太相信结果会足够准确。尽管我不知道应该将比率设置为多高，但该比率似乎是相似性度量。

在花费太多时间尝试 difflib 模块之前，我想询问这里更有经验的人，这是否是正确的方法，或者是否有更合适的工具/方法来解决我的问题。谢谢！

PS：我不需要最佳解决方案。

原文

Given are two lists containing strings.

One contains the name of organisations (mostly universitys) all around the world - not only written in english but always using latin alphabet.
The other list contains mostly full addresses in which strings (organisations) from the first list may occur.

An Example:

addresses = [
             "Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
             "Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
             "Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
             "Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",    
             "Computer Science Department, University of California, Santa Barbara, USA 93106",
             "Fraunhofer IAIS, Sankt Augustin, Germany",
             "Department of Computer Science, Cornell University, Ithaca, NY",
             "University of Wisconsin-Madison"
            ]

organisations = [
                 "Catholic University of Leuven"
                 "Fraunhofer IAIS"
                 "Cornell University of Ithaca"
                 "Tübingener Max Plank Institut"
                ]

As you can see the desired mapping would be:

"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of  Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes     Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS 
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --

My thinking was to use some kind of "disctance- algorithm" to calculate the similarity of the strings. Since I cannot just look for an organisation in an address just by doing if address in organisation because it could be written slightly differently at in different places. So my first guess was using the difflib module. Especially the difflib.get_close_matches() function for selecting for every address the closest string from the organisations list. But I am not quite confident, that the results will be accurate enough. Although I don't know how high I should set the ratio which seams to be a similarity measure.

Before spending too much time in trying the difflib module I thought of asking the more experienced people here, if this is the right approach or if there is a more suited tool / way to solve my problem. Thanks!

PS: I don't need an optimal solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

<逆流佳人身旁 2024-12-27 00:25:53

使用以下作为字符串距离函数（而不是普通的 levenshtein 距离）：

def strdist(s1, s2):
    words1 = set(w for w in s1.split() if len(w) > 3)
    words2 = set(w for w in s2.split() if len(w) > 3)

    scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
    n_shared_words = len([s for s in scores if s <= 3])
    return -n_shared_words

然后使用 Munkres 分配算法此处显示，因为似乎有一个组织和地址之间 1:1 映射。

Use the following as your string distance function (instead of plain levenshtein distance):

def strdist(s1, s2):
    words1 = set(w for w in s1.split() if len(w) > 3)
    words2 = set(w for w in s2.split() if len(w) > 3)

    scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
    n_shared_words = len([s for s in scores if s <= 3])
    return -n_shared_words

Then use the Munkres assignment algorithm shown here since there appears to be a 1:1 mapping between organisations and adresses.

回复收藏 0 原文