想要在匹配的字符串中添加剩下的字符串

发布于 2025-02-03 21:43:41 字数 2292 浏览 3 评论 0原文

以下是我的示例代码：

from fuzzywuzzy import fuzz
import json
from itertools import zip_longest

synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())

vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th 
                       Generation)","hard 
                      drive","ddr 8gb","something1", "something2",
                      "something3","HT (100W) DDR4-2400"]

buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
    for k,k2 in zip_longest(vendor_data,buyer_data):
        for v in value:
            if fuzz.token_set_ratio(k,v) > 70:
                if item in k:
                    vendor.append(k)
                else:
                    vendor.append(item+" "+k)
            else:
                #didnt get only "something" strings here !

            if fuzz.token_set_ratio(k2,v) > 70:
                if item in k2:
                    buyer.append(k2)
                else:
                    buyer.append(item+" "+k2)

vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer

请注意：“某物”字符串可以是“电池”或“显示”等

同义词JSON

{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7 
           processor","processor i5","processor i7","core generation","core gen"],

"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR 
                                                          32gb","DDR 32 gb","DDR4-"],

"ssd":["solid state drive","solid drive"],

"hdd":["Hard Drive"]

 }

之类的东西，我需要什么？

我想在供应商列表中动态添加所有“某物”字符串。

呢注意 - “某物”字符串将来可能是任何东西。

我想在供应商数组中添加“某种东西”字符串，这在fuzz＆gt; 70中不是一个匹配的值！我也想基本上添加剩余的数据。

例如，如下：

当前的

['processor Corei5 :1135G7 (11th Generation)',
 'i7 processor',
 'ram HT (100W) DDR4-2400',
 'ram ddr 8gb',
 'hdd hard drive',
 'ssd solid state']

预期输出低于

 ['processor Corei5 :1135G7 (11th Generation)',
 'i7 processor',
 'ram HT (100W) DDR4-2400',
 'ram ddr 8gb',
 'hdd hard drive',
 'ssd solid state',
 'something1',
 'something2'
 'something3']  #something string need to be added in vendor list dynamically.

我在做什么愚蠢的错误？谢谢。

原文

Below is my example code:

from fuzzywuzzy import fuzz
import json
from itertools import zip_longest

synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())

vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th 
                       Generation)","hard 
                      drive","ddr 8gb","something1", "something2",
                      "something3","HT (100W) DDR4-2400"]

buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
    for k,k2 in zip_longest(vendor_data,buyer_data):
        for v in value:
            if fuzz.token_set_ratio(k,v) > 70:
                if item in k:
                    vendor.append(k)
                else:
                    vendor.append(item+" "+k)
            else:
                #didnt get only "something" strings here !

            if fuzz.token_set_ratio(k2,v) > 70:
                if item in k2:
                    buyer.append(k2)
                else:
                    buyer.append(item+" "+k2)

vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer

Note: "something" string can be anything like "battery" or "display"etc

synonyms json

{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7 
           processor","processor i5","processor i7","core generation","core gen"],

"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR 
                                                          32gb","DDR 32 gb","DDR4-"],

"ssd":["solid state drive","solid drive"],

"hdd":["Hard Drive"]

 }

what do i need ?

I want to add all "something" string inside vendor list dynamically.

! NOTE -- "something" string can be anything in future.

I want to add "something" string in vendor array which is not a matched value in fuzz>70! I want to basically add left out data also.

for example like below:

current output

['processor Corei5 :1135G7 (11th Generation)',
 'i7 processor',
 'ram HT (100W) DDR4-2400',
 'ram ddr 8gb',
 'hdd hard drive',
 'ssd solid state']

expected output below

 ['processor Corei5 :1135G7 (11th Generation)',
 'i7 processor',
 'ram HT (100W) DDR4-2400',
 'ram ddr 8gb',
 'hdd hard drive',
 'ssd solid state',
 'something1',
 'something2'
 'something3']  #something string need to be added in vendor list dynamically.

what silly mistake am I doing ? Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玻璃人 2025-02-10 21:43:41

这是我的尝试：

from fuzzywuzzy import process, fuzz

synonyms = {'processor': ['corei5', 'core', 'corei7', 'i5', 'i7', 'ryzen5', 'i5 processor', 'i7 processor', 'processor i5', 'processor i7', 'core generation', 'core gen'], 'ram': ['DDR4', 'memory', 'DDR3', 'DDR', 'DDR 8gb', 'DDR 8 gb', 'DDR 16gb', 'DDR 16 gb', 'DDR 32gb', 'DDR 32 gb', 'DDR4-'], 'ssd': ['solid state drive', 'solid drive'], 'hdd': ['Hard Drive']}
vendor_data = ['i7 processor', 'solid state', 'Corei5 :1135G7 (11th Generation)', 'hard drive', 'ddr 8gb', 'something1', 'something2', 'something3', 'HT (100W) DDR4-2400']
buyer_data = ['i7 processor 12 generation', 'corei7:latest technology']

def find_synonym(s: str, min_score: int = 60):
    results = process.extractBests(s, choices=synonyms, score_cutoff=min_score)
    if not results:
        return None
    return results[0][-1]

def process_data(l: list, min_score: int = 60):
    matches = []
    no_matches = []
    for item in l:
        syn = find_synonym(item, min_score=min_score)
        if syn is not None:
            new_item = f'{syn} {item}' if syn not in item else item
            matches.append(new_item)
        elif any(fuzz.partial_ratio(s, item) >= min_score for s in synonyms.keys()):
            # one of the synonyms is already in the item string
            matches.append(item)
        else:
            no_matches.append(item)
    return matches, no_matches

对于process_data（vendor_data）我们得到：

(['i7 processor',
  'ssd solid state',
  'processor Corei5 :1135G7 (11th Generation)',
  'hdd hard drive',
  'ram ddr 8gb',
  'ram HT (100W) DDR4-2400'],
 ['something1', 'something2', 'something3'])

和process_data（pureer_data）：

(['i7 processor 12 generation', 'processor corei7:latest technology'], [])

我必须将截止分数降低到60，才能获得<<代码> DDR 8GB 。 process_data函数返回2个列表：一个匹配项，带有同义词 dict中的单词，一个带有无匹配项的项目。如果您需要在问题中列出的列出的输出，只需加入这样的两个列表：

matches, no_matches = process_data(vendor_data)
matches + no_matches  # ['i7 processor', 'ssd solid state', 'processor Corei5 :1135G7 (11th Generation)', 'hdd hard drive', 'ram ddr 8gb', 'ram HT (100W) DDR4-2400', 'something1', 'something2', 'something3']

Here's my attempt:

from fuzzywuzzy import process, fuzz

synonyms = {'processor': ['corei5', 'core', 'corei7', 'i5', 'i7', 'ryzen5', 'i5 processor', 'i7 processor', 'processor i5', 'processor i7', 'core generation', 'core gen'], 'ram': ['DDR4', 'memory', 'DDR3', 'DDR', 'DDR 8gb', 'DDR 8 gb', 'DDR 16gb', 'DDR 16 gb', 'DDR 32gb', 'DDR 32 gb', 'DDR4-'], 'ssd': ['solid state drive', 'solid drive'], 'hdd': ['Hard Drive']}
vendor_data = ['i7 processor', 'solid state', 'Corei5 :1135G7 (11th Generation)', 'hard drive', 'ddr 8gb', 'something1', 'something2', 'something3', 'HT (100W) DDR4-2400']
buyer_data = ['i7 processor 12 generation', 'corei7:latest technology']

def find_synonym(s: str, min_score: int = 60):
    results = process.extractBests(s, choices=synonyms, score_cutoff=min_score)
    if not results:
        return None
    return results[0][-1]

def process_data(l: list, min_score: int = 60):
    matches = []
    no_matches = []
    for item in l:
        syn = find_synonym(item, min_score=min_score)
        if syn is not None:
            new_item = f'{syn} {item}' if syn not in item else item
            matches.append(new_item)
        elif any(fuzz.partial_ratio(s, item) >= min_score for s in synonyms.keys()):
            # one of the synonyms is already in the item string
            matches.append(item)
        else:
            no_matches.append(item)
    return matches, no_matches

For process_data(vendor_data) we get:

(['i7 processor',
  'ssd solid state',
  'processor Corei5 :1135G7 (11th Generation)',
  'hdd hard drive',
  'ram ddr 8gb',
  'ram HT (100W) DDR4-2400'],
 ['something1', 'something2', 'something3'])

And for process_data(buyer_data):

(['i7 processor 12 generation', 'processor corei7:latest technology'], [])

I had to lower the cut-off score to 60 to also get results for ddr 8gb. The process_data function returns 2 lists: One with matches with words from the synonyms dict and one with items without matches. If you want exactly the output you listed in your question, just concatenate the two lists like this:

matches, no_matches = process_data(vendor_data)
matches + no_matches  # ['i7 processor', 'ssd solid state', 'processor Corei5 :1135G7 (11th Generation)', 'hdd hard drive', 'ram ddr 8gb', 'ram HT (100W) DDR4-2400', 'something1', 'something2', 'something3']

回复收藏 0 原文

深海蓝天 2025-02-10 21:43:41

我试图提出一个不错的答案（当然不是最干净的答案），

import json
from itertools import zip_longest

from fuzzywuzzy import fuzz

synonyms = open("synonyms.json", "r")
synonyms = json.loads(synonyms.read())

vendor_data = ["i7 processor", "solid state", "Corei5 :1135G7 (11thGeneration)", "hard drive", "ddr 8gb", "something1",
               "something2",
               "something3", "HT (100W) DDR4-2400"]

buyer_data = ["i7 processor 12 generation", "corei7:latest technology"]
vendor = []
buyer = []

for k, k2 in zip_longest(vendor_data, buyer_data):
    has_matched = False
    for item, value in synonyms.items():
        for v in value:
            if fuzz.token_set_ratio(k, v) > 70:
                if item in k:
                    vendor.append(k)
                else:
                    vendor.append(item + " " + k)
                if has_matched or k2 is None:
                    break
                else:
                    has_matched = True

            if fuzz.token_set_ratio(k2, v) > 70:
                if item in k2:
                    buyer.append(k2)
                else:
                    buyer.append(item + " " + k2)
                if has_matched or k is None:
                    break
                else:
                    has_matched = True
        else:
            continue  # match not found
        break  # match is found
    else:  # only evaluates on normal loop end
        # Only something strings
        # do something with the new input values
        continue  


vendor = list(set(vendor))
buyer = list(set(buyer))

希望您可以通过此代码来实现想要的目标。检查 docs 如果您不知道其他循环会做什么。 tldr：当循环正常终止时（不是休息）时，其他子句将执行。请注意，我将同义词循环放入数据循环中。这是因为我们当然不能知道数据所属的同义词组，大多数情况下，供应商数据输入是处理器，而买方数据是内存。另请注意，我认为一个项目不能超过1次。如果可能是这种情况，则需要进行更高级的检查（例如，当计数器等于2时，请进行计数器并打破）。

编辑：
我又看了一个问题，提出了一个更好的答案：

v_dict = dict()
for spec in vendor_data[:]:
    for item, choices in synonyms.items():
        if process.extractOne(spec, choices)[1] > 70:  # don't forget to import process from fuzzywuzzy
            v_dict[spec] = item
            break
    else:
        v_dict[spec] = "Something new"

此代码将字符串与正确的类型相匹配。例如{'i7处理器'：'处理器'，'solid态'：'ssd'，'corei5：1135g7（11thgeneration）'：'processor'，'硬盘驱动器'：'ssd'，'ssd'，'ddr 8gb' ：'ram'，'sosings1'：'sosings new'，'sosings2'：'sosings new'，'sosings3'：'sosings new'，ht（100W）ht（100W）ddr4-2400'：'ram'}> 。您可以与您喜欢的Watherver更改“新事物”。您也可以做：v_dict [spec] = 0（在匹配项上）和v_dict [spec] = 1（在没有匹配项上）。然后，您可以对dict-＆gt进行排序；

it = iter(v_dict.values())
print(sorted(v_dict.keys(), key=lambda x: next(it)))

这将带来想要的结果（或多或少），所有公认的项目将首先，然后是所有未识别的项目。如果需要，您可以对此进行一些更高级的分类。我认为此代码为您提供了足够的灵活性来实现自己的目标。

I have tried to come up with a decent answer (certainly not the cleanest one)

import json
from itertools import zip_longest

from fuzzywuzzy import fuzz

synonyms = open("synonyms.json", "r")
synonyms = json.loads(synonyms.read())

vendor_data = ["i7 processor", "solid state", "Corei5 :1135G7 (11thGeneration)", "hard drive", "ddr 8gb", "something1",
               "something2",
               "something3", "HT (100W) DDR4-2400"]

buyer_data = ["i7 processor 12 generation", "corei7:latest technology"]
vendor = []
buyer = []

for k, k2 in zip_longest(vendor_data, buyer_data):
    has_matched = False
    for item, value in synonyms.items():
        for v in value:
            if fuzz.token_set_ratio(k, v) > 70:
                if item in k:
                    vendor.append(k)
                else:
                    vendor.append(item + " " + k)
                if has_matched or k2 is None:
                    break
                else:
                    has_matched = True

            if fuzz.token_set_ratio(k2, v) > 70:
                if item in k2:
                    buyer.append(k2)
                else:
                    buyer.append(item + " " + k2)
                if has_matched or k is None:
                    break
                else:
                    has_matched = True
        else:
            continue  # match not found
        break  # match is found
    else:  # only evaluates on normal loop end
        # Only something strings
        # do something with the new input values
        continue  


vendor = list(set(vendor))
buyer = list(set(buyer))

I hope you can achieve what you want with this code. Check the docs if you don't know what a for else loop does. TLDR: the else clause executes when the loop terminates normally (not with a break). Note that I put the synonyms loop inside the data loop. This is because we can't certainly know in which synonym group the data belongs, also somethimes the vendor data entry is a processor while the buyer data is memory. Also note that I have assumed an item can't match more than 1 time. If this could be the case you would need to make a more advanced check (just make a counter and break when the counter equals 2 for example).

EDIT:
I took another look at the question and came up with maybe a better answer:

v_dict = dict()
for spec in vendor_data[:]:
    for item, choices in synonyms.items():
        if process.extractOne(spec, choices)[1] > 70:  # don't forget to import process from fuzzywuzzy
            v_dict[spec] = item
            break
    else:
        v_dict[spec] = "Something new"

This code matches the strings to the correct type. for example {'i7 processor': 'processor', 'solid state': 'ssd', 'Corei5 :1135G7 (11thGeneration)': 'processor', 'hard drive': 'ssd', 'ddr 8gb': 'ram', 'something1': 'Something new', 'something2': 'Something new', 'something3': 'Something new', 'HT (100W) DDR4-2400': 'ram'}. You can change the "Something new" with watherver you like. You could also do: v_dict[spec] = 0 (on a match) and v_dict[spec] = 1 (on no match). You could then sort the dict ->

it = iter(v_dict.values())
print(sorted(v_dict.keys(), key=lambda x: next(it)))

Which would give the wanted results (more or less), all the recognised items will be first, and then all the unrecognised items. You could do some more advanced sorting on this dict if you want. I think this code gives you enough flexibility to reach your goal.

回复收藏 0 原文

眼藏柔 2025-02-10 21:43:41

如果我正确理解，那么您要做的就是匹配由客户和/或供应商指定的关键字与所拥有的关键字的预定义数据库。

首先，我强烈建议使用同义词的反向映射，因此查找速度更快，尤其是在数据集增长时。

第二，考虑到 fuzzywuzzy> Extracone 是为此做出的可靠选择。

现在，Extractone返回最佳匹配和一个分数：

>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

我将算法分为两个：

一个通用部分，只能获得最佳匹配，即使它不是一个很好的匹配（即使它不是一个很好的匹配
）过滤器，您可以根据应用程序的不同标准调整算法的灵敏度。此灵敏度阈值应设定最小的匹配质量。如果您低于此阈值，则只需将“未标记”用于类别。

这是最终代码，我认为它非常简单，易于理解和扩展：

import json
from fuzzywuzzy import process

def load_synonyms():
    with open('synonyms.json') as fin:
        synonyms = json.load(fin)

    # Reversing the map makes it much easier to lookup
    reversed_synonyms = {}
    for key, values in synonyms.items():
        for value in values:
            reversed_synonyms[value] = key                                                                                                                                                                              
    return reversed_synonyms

def load_vendor_data():
    return [
        "i7 processor",
        "solid state",
        "Corei5 :1135G7 (11thGeneration)",
        "hard drive",
        "ddr 8gb",
        "something1",
        "something2",
        "something3",
        "HT (100W) DDR4-2400"
    ]

def load_customer_data():
    return [
        "i7 processor 12 generation",
        "corei7:latest technology"
    ]

def get_tag(keyword, synonyms):
    THRESHOLD = 80                                                                                                                                                                                                        
    DEFAULT = 'general'

    tag, score = process.extractOne(keyword, synonyms.keys())                                                                                                                                                             
    return synonyms[tag] if score > THRESHOLD else DEFAULT

def main():                                                                                                                                                                                                               
    synonyms = load_synonyms()

    customer_data = load_customer_data()
    vendor_data = load_vendor_data()
    data = customer_data + vendor_data

    tags_dict = { keyword: get_tag(keyword, synonyms) for keyword in data }
    print(json.dumps(tags_dict, indent=4))
                                                                                                                                                                                                                      if __name__ == '__main__':
    main()

使用指定的输入运行时，输出为：

{
    "i7 processor 12 generation": "processor",
    "corei7:latest technology": "processor",
    "i7 processor": "processor",
    "solid state": "ssd",
    "Corei5 :1135G7 (11thGeneration)": "processor",
    "hard drive": "hdd",
    "ddr 8gb": "ram",
    "something1": "general",
    "something2": "general",
    "something3": "general",
    "HT (100W) DDR4-2400": "ram"
}

If I understand correctly, what you are trying to do is match keywords specified by a customer and/or vendor against a predefined database of keywords you have.

First, I would highly recommend using a reversed mapping of the synonyms, so it's faster to lookup, especially when the dataset will grow.

Second, considering the fuzzywuzzy API, it looks like you simply want the best match, so extractOne is a solid choice for that.

Now, extractOne returns the best match and a score:

>>> process.extractOne("cowboys", choices)
    ("Dallas Cowboys", 90)

I would split the algorithm into two:

A generic part that simply gets the best match, which should always exist (even if it's not a great one)
A filter, where you could adjust the sensitivity of the algorithm, based on different criteria of your application. This sensitivity threshold should set the minimal match quality. If you're below this threshold, just use "untagged" for the category for example.

Here is the final code, which I think is very simple and easy to understand and expand:

import json
from fuzzywuzzy import process

def load_synonyms():
    with open('synonyms.json') as fin:
        synonyms = json.load(fin)

    # Reversing the map makes it much easier to lookup
    reversed_synonyms = {}
    for key, values in synonyms.items():
        for value in values:
            reversed_synonyms[value] = key                                                                                                                                                                              
    return reversed_synonyms

def load_vendor_data():
    return [
        "i7 processor",
        "solid state",
        "Corei5 :1135G7 (11thGeneration)",
        "hard drive",
        "ddr 8gb",
        "something1",
        "something2",
        "something3",
        "HT (100W) DDR4-2400"
    ]

def load_customer_data():
    return [
        "i7 processor 12 generation",
        "corei7:latest technology"
    ]

def get_tag(keyword, synonyms):
    THRESHOLD = 80                                                                                                                                                                                                        
    DEFAULT = 'general'

    tag, score = process.extractOne(keyword, synonyms.keys())                                                                                                                                                             
    return synonyms[tag] if score > THRESHOLD else DEFAULT

def main():                                                                                                                                                                                                               
    synonyms = load_synonyms()

    customer_data = load_customer_data()
    vendor_data = load_vendor_data()
    data = customer_data + vendor_data

    tags_dict = { keyword: get_tag(keyword, synonyms) for keyword in data }
    print(json.dumps(tags_dict, indent=4))
                                                                                                                                                                                                                      if __name__ == '__main__':
    main()

When running with the specified inputs, the output is:

{
    "i7 processor 12 generation": "processor",
    "corei7:latest technology": "processor",
    "i7 processor": "processor",
    "solid state": "ssd",
    "Corei5 :1135G7 (11thGeneration)": "processor",
    "hard drive": "hdd",
    "ddr 8gb": "ram",
    "something1": "general",
    "something2": "general",
    "something3": "general",
    "HT (100W) DDR4-2400": "ram"
}

回复收藏 0 原文

~没有更多了~