如何在python词典中的一个键下类似的单词

发布于 2025-02-13 11:29:31 字数 450 浏览 1 评论 0原文

我正在学习文本处理，并且被卡住了用户在购物时用户在哪个网站上花费哪个网站的调查数据

集： Amazon，Amzn，Amazon Prime，Amazon.com，Amzn Prim等

我有一个有关创建一个词典，该字典在一个密钥下构成类似的值，例如

dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"], 
         "Coursera" : ["coursera","corsera","coursera.org","coursera.com"]}

字典的主要目标是在数据框中创建另一列，并使用我尝试过的每个网站名称的密钥值，

但我尝试过fuzzywuzzy，但无法理解如何了解俱乐部在一个钥匙下相似的值，

谢谢:)

原文

I am learning Text Processing and am stuck
I have a dataset of survey about which website a user spends his money on while shopping

i have data of the form : amazon,amzn ,amazon prime,amazon.com ,amzn prim,etc

Now i want to create a dictionary which clubs the similar values under one key like

dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"], 
         "Coursera" : ["coursera","corsera","coursera.org","coursera.com"]}

The main goal of the dictionary is to create another column in the dataframe with the key value of each website name

I have tried fuzzywuzzy but am unable to understand how to club the similar values under one key

Thanks :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

终难愈 2025-02-20 11:29:31

您的任务是将响应str与正确定义密钥列表中的正确密钥相关联。因此，您需要比较给定的str响应（例如“ Amazon.com”）与您的每个键[“ Amazon”，“ coursera”] < /code>并选择键显示相对于某些度量的最高相似性。

1。键的手动选择

在字符串上选择合适的度量标准是棘手的部分，因为它们仅将其视为字符阵列。不考虑单词的语义，也不涉及域知识。反过来，如果密钥数较低，我建议您进行手动匹配。 Python的内置字符串类str提供lower（），以使比较不变性不变 in - 运算符检查子字符串的会员资格。这是一个很好的起点。

def getKey(website:str):
    # case-insensitive
    website = website.lower() 
    
    # 1. handcrafted key-pattern matching
    refDict = dict()
    refDict['AMAZON']   = ["amzn", "amazon"]
    refDict['COURSERA'] = ["coursera", "amazon"]

    for k,v in refDict.items():
        if any([pattern in website for pattern in v]):
            return k
    
    # if no match found
    return ""

对于熊猫框架，这将产生

df = pd.DataFrame({'website': ['amazon', 'amzn.com', 'coursera', 'corsera', 'cosera', 'save-the-amazon-forest.org']})
df['key'] = [getKey(website) for website in df['website']]
>df

如您所见，此字符串比较本质上也很脆弱。此外，字典中的密钥顺序是事项。请注意，只有自Python 3.6以来，字典默认情况下才能保持插入顺序。如果您使用orderedDict的较早版本来控制订单。

如果您可以强制执行用户编写正确的URL，则可能需要考虑通过正则表达式从字符串中提取它，并直接将其用作密钥。这将节省您的时间在getKey（）中手动列出键和匹配模式。它在 there 中呈现。

2。通过无监督学习的自动键

Since the additional requirement was raised that the algorithm needs to find the keys in an unsupervised fashion, the following code invokes the Edit (Levenshtein) distance and clustering to do exactly that.

import pandas as pd
import numpy as np
from sklearn.cluster import AffinityPropagation
from nltk.cluster.kmeans import KMeansClusterer
from nltk.metrics import *


# example input
websiteList = ["amazon", "apple.com", "amzn", "amazon prime" , "amazon.com", "cosera", "apple inc.", "amzn prim", "coursera", 
               "coursera", "coursera.org", "coursera.com", "StackOverFlow.com", "stackoverflow", "stack-overflow.com", 
               "corsing", "apple", "AAPL"]
websiteListRaw = list(websiteList) # copy for later

df = pd.DataFrame({'website' : websiteList})


def minEditDistance(s1, s2):
    '''Minimum edit distance across all pairwise input (sub-)strings'''
    ptrList_1 = s1.split(' ') + [s1]
    ptrList_2 = s2.split(' ') + [s2]
    
    return min([edit_distance(x_i, x_j) for x_i in ptrList_1 for x_j in ptrList_2])

# lowercase
websiteList = [site.lower() for site in websiteList]
N = len(websiteList)

# delete suffixes
suffixList = ['.com', '.org', 'co.uk', '.eu']
for i in range(N):
    for suffix in suffixList:
        websiteList[i] = websiteList[i].removesuffix(suffix)

# replace special characters
specialSymbolList = ['/', '-', '*']
for i in range(N):
    for symbol in specialSymbolList:
        websiteList[i] = websiteList[i].replace(symbol, ' ')
    
# similarity = -1 * distance
responses = np.array(websiteList) 
minEditSimilarity = (-1.0)*np.array([[minEditDistance(w1,w2) for w1 in responses] for w2 in responses])

# clustering
affprop = AffinityPropagation(affinity="precomputed", damping=0.54, random_state=77)
affprop.fit(minEditSimilarity)

# return
matchDict = dict()
for cluster_id in np.unique(affprop.labels_):
    exemplar = responses[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(responses[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    # assign
    for resp in cluster:
        match_indices = [i for i, name in enumerate(websiteList) if name==resp]
        for resp_index in match_indices:
            matchDict[websiteListRaw[resp_index]] = exemplar.split(' ')[0].upper()
        
    print('exemplar: ', exemplar)
        
# add learned keys
df['key'] = df['website'].replace(matchDict)
df

Your task is to associate a response str with the correct key str from a list of pre-defined keys. Therefore, you need to compare a given str response (e.g. "amazon.com") with each of your keys ["AMAZON", "Coursera"] and pick the key that displays the highest similarity with respect to some metric.

1. Manual choice of Keys

Choosing a suitable metric on strings is the tricky part as they merely treat them as arrays of characters. No consideration is given to the semantics of the words and no domain knowledge is involved. In turn, I'd suggest a manual matching if the number of keys is low. Python's built-in string class str provides lower() to make the comparison invariant invariant the in-Operator checks for membership of a substring. This is a good starting point.

def getKey(website:str):
    # case-insensitive
    website = website.lower() 
    
    # 1. handcrafted key-pattern matching
    refDict = dict()
    refDict['AMAZON']   = ["amzn", "amazon"]
    refDict['COURSERA'] = ["coursera", "amazon"]

    for k,v in refDict.items():
        if any([pattern in website for pattern in v]):
            return k
    
    # if no match found
    return ""

For a Pandas frame this yields

df = pd.DataFrame({'website': ['amazon', 'amzn.com', 'coursera', 'corsera', 'cosera', 'save-the-amazon-forest.org']})
df['key'] = [getKey(website) for website in df['website']]
>df

As you can see, this string comparison is inherently brittle, too. In addition, the order of the keys in the dictionary matters. Note that only since Python 3.6, dictionaries maintain insertion order by default. If you use an earlier version using OrderedDict to keep control of the order.

If you can enforce users to write the proper URL, you might want to consider extracting it from the string via regular expression and use it directly as the key. This would save you the time to list keys and matching patterns manually in getKey() altogether. It is presented in here.

2. Automatic keys via unsupervised learning

Since the additional requirement was raised that the algorithm needs to find the keys in an unsupervised fashion, the following code invokes the Edit (Levenshtein) distance and clustering to do exactly that.

import pandas as pd
import numpy as np
from sklearn.cluster import AffinityPropagation
from nltk.cluster.kmeans import KMeansClusterer
from nltk.metrics import *


# example input
websiteList = ["amazon", "apple.com", "amzn", "amazon prime" , "amazon.com", "cosera", "apple inc.", "amzn prim", "coursera", 
               "coursera", "coursera.org", "coursera.com", "StackOverFlow.com", "stackoverflow", "stack-overflow.com", 
               "corsing", "apple", "AAPL"]
websiteListRaw = list(websiteList) # copy for later

df = pd.DataFrame({'website' : websiteList})


def minEditDistance(s1, s2):
    '''Minimum edit distance across all pairwise input (sub-)strings'''
    ptrList_1 = s1.split(' ') + [s1]
    ptrList_2 = s2.split(' ') + [s2]
    
    return min([edit_distance(x_i, x_j) for x_i in ptrList_1 for x_j in ptrList_2])

# lowercase
websiteList = [site.lower() for site in websiteList]
N = len(websiteList)

# delete suffixes
suffixList = ['.com', '.org', 'co.uk', '.eu']
for i in range(N):
    for suffix in suffixList:
        websiteList[i] = websiteList[i].removesuffix(suffix)

# replace special characters
specialSymbolList = ['/', '-', '*']
for i in range(N):
    for symbol in specialSymbolList:
        websiteList[i] = websiteList[i].replace(symbol, ' ')
    
# similarity = -1 * distance
responses = np.array(websiteList) 
minEditSimilarity = (-1.0)*np.array([[minEditDistance(w1,w2) for w1 in responses] for w2 in responses])

# clustering
affprop = AffinityPropagation(affinity="precomputed", damping=0.54, random_state=77)
affprop.fit(minEditSimilarity)

# return
matchDict = dict()
for cluster_id in np.unique(affprop.labels_):
    exemplar = responses[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(responses[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    # assign
    for resp in cluster:
        match_indices = [i for i, name in enumerate(websiteList) if name==resp]
        for resp_index in match_indices:
            matchDict[websiteListRaw[resp_index]] = exemplar.split(' ')[0].upper()
        
    print('exemplar: ', exemplar)
        
# add learned keys
df['key'] = df['website'].replace(matchDict)
df