想要在匹配的字符串中添加剩下的字符串
以下是我的示例代码:
from fuzzywuzzy import fuzz
import json
from itertools import zip_longest
synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th
Generation)","hard
drive","ddr 8gb","something1", "something2",
"something3","HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
for k,k2 in zip_longest(vendor_data,buyer_data):
for v in value:
if fuzz.token_set_ratio(k,v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item+" "+k)
else:
#didnt get only "something" strings here !
if fuzz.token_set_ratio(k2,v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item+" "+k2)
vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer
请注意:“某物”字符串可以是“电池”或“显示”等
同义词JSON
{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7
processor","processor i5","processor i7","core generation","core gen"],
"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR
32gb","DDR 32 gb","DDR4-"],
"ssd":["solid state drive","solid drive"],
"hdd":["Hard Drive"]
}
之类的东西,我需要什么?
我想在供应商列表中动态添加所有“某物”字符串。
呢注意 - “某物”字符串将来可能是任何东西。
我想在供应商数组中添加“某种东西”字符串,这在fuzz> 70中不是一个匹配的值!我也想基本上添加剩余的数据。
例如,如下:
当前的
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state']
预期输出低于
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state',
'something1',
'something2'
'something3'] #something string need to be added in vendor list dynamically.
我在做什么愚蠢的错误?谢谢。
Below is my example code:
from fuzzywuzzy import fuzz
import json
from itertools import zip_longest
synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th
Generation)","hard
drive","ddr 8gb","something1", "something2",
"something3","HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
for k,k2 in zip_longest(vendor_data,buyer_data):
for v in value:
if fuzz.token_set_ratio(k,v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item+" "+k)
else:
#didnt get only "something" strings here !
if fuzz.token_set_ratio(k2,v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item+" "+k2)
vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer
Note: "something" string can be anything like "battery" or "display"etc
synonyms json
{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7
processor","processor i5","processor i7","core generation","core gen"],
"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR
32gb","DDR 32 gb","DDR4-"],
"ssd":["solid state drive","solid drive"],
"hdd":["Hard Drive"]
}
what do i need ?
I want to add all "something" string inside vendor list dynamically.
! NOTE -- "something" string can be anything in future.
I want to add "something" string in vendor array which is not a matched value in fuzz>70! I want to basically add left out data also.
for example like below:
current output
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state']
expected output below
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state',
'something1',
'something2'
'something3'] #something string need to be added in vendor list dynamically.
what silly mistake am I doing ? Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是我的尝试:
对于
process_data(vendor_data)
我们得到:和
process_data(pureer_data)
:我必须将截止分数降低到60,才能获得<<代码> DDR 8GB 。
process_data
函数返回2个列表:一个匹配项,带有同义词
dict中的单词,一个带有无匹配项的项目。如果您需要在问题中列出的列出的输出,只需加入这样的两个列表:Here's my attempt:
For
process_data(vendor_data)
we get:And for
process_data(buyer_data)
:I had to lower the cut-off score to 60 to also get results for
ddr 8gb
. Theprocess_data
function returns 2 lists: One with matches with words from thesynonyms
dict and one with items without matches. If you want exactly the output you listed in your question, just concatenate the two lists like this:我试图提出一个不错的答案(当然不是最干净的答案),
希望您可以通过此代码来实现想要的目标。检查 docs 如果您不知道其他循环会做什么。 tldr:当循环正常终止时(不是休息)时,其他子句将执行。请注意,我将同义词循环放入数据循环中。这是因为我们当然不能知道数据所属的同义词组,大多数情况下,供应商数据输入是处理器,而买方数据是内存。另请注意,我认为一个项目不能超过1次。如果可能是这种情况,则需要进行更高级的检查(例如,当计数器等于2时,请进行计数器并打破)。
编辑:
我又看了一个问题,提出了一个更好的答案:
此代码将字符串与正确的类型相匹配。例如
{'i7处理器':'处理器','solid态':'ssd','corei5:1135g7(11thgeneration)':'processor','硬盘驱动器':'ssd','ssd','ddr 8gb' :'ram','sosings1':'sosings new','sosings2':'sosings new','sosings3':'sosings new',ht(100W)ht(100W)ddr4-2400':'ram'}
> 。您可以与您喜欢的Watherver更改“新事物”
。您也可以做:v_dict [spec] = 0
(在匹配项上)和v_dict [spec] = 1
(在没有匹配项上)。然后,您可以对dict-&gt进行排序;这将带来想要的结果(或多或少),所有公认的项目将首先,然后是所有未识别的项目。如果需要,您可以对此进行一些更高级的分类。我认为此代码为您提供了足够的灵活性来实现自己的目标。
I have tried to come up with a decent answer (certainly not the cleanest one)
I hope you can achieve what you want with this code. Check the docs if you don't know what a for else loop does. TLDR: the else clause executes when the loop terminates normally (not with a break). Note that I put the synonyms loop inside the data loop. This is because we can't certainly know in which synonym group the data belongs, also somethimes the vendor data entry is a processor while the buyer data is memory. Also note that I have assumed an item can't match more than 1 time. If this could be the case you would need to make a more advanced check (just make a counter and break when the counter equals 2 for example).
EDIT:
I took another look at the question and came up with maybe a better answer:
This code matches the strings to the correct type. for example
{'i7 processor': 'processor', 'solid state': 'ssd', 'Corei5 :1135G7 (11thGeneration)': 'processor', 'hard drive': 'ssd', 'ddr 8gb': 'ram', 'something1': 'Something new', 'something2': 'Something new', 'something3': 'Something new', 'HT (100W) DDR4-2400': 'ram'}
. You can change the"Something new"
with watherver you like. You could also do:v_dict[spec] = 0
(on a match) andv_dict[spec] = 1
(on no match). You could then sort the dict ->Which would give the wanted results (more or less), all the recognised items will be first, and then all the unrecognised items. You could do some more advanced sorting on this dict if you want. I think this code gives you enough flexibility to reach your goal.
如果我正确理解,那么您要做的就是匹配由客户和/或供应商指定的关键字与所拥有的关键字的预定义数据库。
首先,我强烈建议使用同义词的反向映射,因此查找速度更快,尤其是在数据集增长时。
第二,考虑到 fuzzywuzzy> Extracone 是为此做出的可靠选择。
现在,
Extractone
返回最佳匹配和一个分数:我将算法分为两个:
这是最终代码,我认为它非常简单,易于理解和扩展:
使用指定的输入运行时,输出为:
If I understand correctly, what you are trying to do is match keywords specified by a customer and/or vendor against a predefined database of keywords you have.
First, I would highly recommend using a reversed mapping of the synonyms, so it's faster to lookup, especially when the dataset will grow.
Second, considering the fuzzywuzzy API, it looks like you simply want the best match, so
extractOne
is a solid choice for that.Now,
extractOne
returns the best match and a score:I would split the algorithm into two:
Here is the final code, which I think is very simple and easy to understand and expand:
When running with the specified inputs, the output is: