如何在python词典中的一个键下类似的单词
我正在学习文本处理,并且被卡住了 用户在购物时用户在哪个网站上花费哪个网站的调查数据
集: Amazon,Amzn,Amazon Prime,Amazon.com,Amzn Prim等
我有一个有关 创建一个词典,该字典在一个密钥下构成类似的值,例如
dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"],
"Coursera" : ["coursera","corsera","coursera.org","coursera.com"]}
字典的主要目标是在数据框中创建另一列,并使用我尝试过的每个网站名称的密钥值,
但我尝试过fuzzywuzzy,但无法理解如何了解俱乐部在一个钥匙下相似的值,
谢谢:)
I am learning Text Processing and am stuck
I have a dataset of survey about which website a user spends his money on while shopping
i have data of the form : amazon,amzn ,amazon prime,amazon.com ,amzn prim,etc
Now i want to create a dictionary which clubs the similar values under one key like
dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"],
"Coursera" : ["coursera","corsera","coursera.org","coursera.com"]}
The main goal of the dictionary is to create another column in the dataframe with the key value of each website name
I have tried fuzzywuzzy but am unable to understand how to club the similar values under one key
Thanks :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的任务是将响应
str
与正确定义密钥列表中的正确密钥相关联。因此,您需要比较给定的str
响应(例如“ Amazon.com”
)与您的每个键[“ Amazon”,“ coursera”] < /code>并选择
键
显示相对于某些度量的最高相似性。1。键的手动选择
在字符串上选择合适的度量标准是棘手的部分,因为它们仅将其视为字符阵列。不考虑单词的语义,也不涉及域知识。反过来,如果密钥数较低,我建议您进行手动匹配。 Python的内置字符串类
str
提供lower()
,以使比较不变性不变 in - 运算符检查子字符串的会员资格。这是一个很好的起点。对于熊猫框架,这将产生
如您所见,此字符串比较本质上也很脆弱。此外,字典中的密钥顺序是事项。请注意,只有自Python 3.6以来,字典默认情况下才能保持插入顺序。如果您使用
orderedDict
的较早版本来控制订单。如果您可以强制执行用户编写正确的URL,则可能需要考虑通过正则表达式从字符串中提取它,并直接将其用作密钥。这将节省您的时间在
getKey()
中手动列出键和匹配模式。它在 there 中呈现。2。通过无监督学习的自动键
Your task is to associate a response
str
with the correct keystr
from a list of pre-defined keys. Therefore, you need to compare a givenstr
response (e.g."amazon.com"
) with each of your keys["AMAZON", "Coursera"]
and pick thekey
that displays the highest similarity with respect to some metric.1. Manual choice of Keys
Choosing a suitable metric on strings is the tricky part as they merely treat them as arrays of characters. No consideration is given to the semantics of the words and no domain knowledge is involved. In turn, I'd suggest a manual matching if the number of keys is low. Python's built-in string class
str
provideslower()
to make the comparison invariant invariant thein
-Operator checks for membership of a substring. This is a good starting point.For a Pandas frame this yields
As you can see, this string comparison is inherently brittle, too. In addition, the order of the keys in the dictionary matters. Note that only since Python 3.6, dictionaries maintain insertion order by default. If you use an earlier version using
OrderedDict
to keep control of the order.If you can enforce users to write the proper URL, you might want to consider extracting it from the string via regular expression and use it directly as the key. This would save you the time to list keys and matching patterns manually in
getKey()
altogether. It is presented in here.2. Automatic keys via unsupervised learning