产品名称模糊匹配
我需要自动将来自不同来源的产品名称(相机、笔记本电脑、电视等)与数据库中的规范名称进行匹配。
例如“Canon PowerShot a20IS”、“佳能新款 powershot A20 IS” 和 “数码相机 Canon PS A20IS” 应全部匹配“Canon PowerShot A20 IS”。 我使用了编辑距离,并添加了一些启发式方法(删除明显的常用词、为数字更改分配更高的成本等),这在一定程度上有效,但不幸的是还不够好。
主要问题是,即使相关关键字中的单个字母发生变化,也会产生巨大的差异,但要检测哪些是相关关键字并不容易。 例如,考虑三个产品名称:
联想T400
联想R400
新款联想 T-400、Core 2 Duo
从任何标准来看,前两个字符串都极其相似(好吧,在这种情况下,soundex 可能有助于区分 T 和 R,但名称也可能是 400T 和 400R),第一个和第三个彼此相距甚远,字符串,但是相同的产品。
显然,匹配算法不可能 100% 精确,我的目标是以高置信度自动匹配大约 80% 的名称。
非常感谢任何想法或参考
I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.
For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS"
should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
Any ideas or references is much appreciated
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
这正是我在业余时间正在研究的问题。 我想出的是:
基于关键字缩小搜索范围:
在这种情况下,您可以有一些层次结构:
类型 --> 公司--> 模型
,以便您匹配
“数码相机”对应的是“佳能”类型
的公司,这样您的搜索范围就会缩小得多。
您可以通过引入产品线等进一步解决这个问题。
但要点是,这可能必须迭代地完成。
That is exactly the problem I'm working on in my spare time. What I came up with is:
based on keywords narrow down the scope of search:
in this case you could have some hierarchy:
type --> company --> model
so that you'd match
"Digital Camera" for a type
"Canon" for company and there you'd be left with much narrower scope to search.
You could work this down even further by introducing product lines etc.
But the main point is, this probably has to be done iteratively.
我们可以使用Datadecision服务来匹配产品。
它将允许您使用统计算法自动匹配您的产品数据。 此操作是在定义置信度阈值分数后完成的。
所有无法自动匹配的数据都必须通过专用用户界面手动审核。
在线服务使用查找表来存储同义词以及您的手动匹配历史记录。 这使您可以在下次导入新数据时改进数据匹配自动化。
We can use the Datadecision service for matching products.
It will allow you to automatically match your product data using statistical algorithms. This operation is done after defining a threshold score of confidence.
All data that cannot be automatically matched will have to be manually reviewed through a dedicated user interface.
The online service uses lookup tables to store synonyms as well as your manual matching history. This allows you to improve the data matching automation next time you import new data.
我过去也做过同样的事情。 我所做的是使用NLP方法; TF-IDF 矢量器为每个单词分配权重。 例如您的情况:
Canon PowerShot a20IS
这将告诉您的模型要关心哪些单词以及不关心哪些单词。 感谢 TF-IDF,我的比赛非常顺利。
但请注意:a20IS 无法被识别为 a20 IS,您可以考虑使用某种正则表达式来过滤此类情况。
之后,您可以使用诸如余弦相似度之类的数值计算。
I worked on the exact same thing in the past. What I have done is using an NLP method; TF-IDF Vectorizer to assign weights to each word. For example in your case:
Canon PowerShot a20IS
This will tell your model which words to care and which words to not. I had quite good matches thanks to TF-IDF.
But note this: a20IS cannot be recognized as a20 IS, you may consider to use some kind of regex to filter such cases.
After that, you can use a numeric calculation like cosine similarity.
我认为这可以归结为区分诸如 Lenovo 之类的关键词与诸如 New 之类的废话。
我会对名称数据库进行一些分析来识别关键词。 您可以使用类似于用于生成词云的代码。
然后我会手动编辑列表,删除任何明显的废话,比如也许 New 实际上很常见,但不是关键。
然后您将获得一个可用于帮助识别相似之处的关键词列表。 您可以将“原始”名称与其关键字相关联,并在比较两个或多个原始名称的相似性(字面意思是共享关键字的百分比)时使用这些关键字。
无论如何这都不是一个完美的解决方案,但我认为您并不期待这样的解决方案?
I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.
I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.
Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.
Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).
Not a perfect solution by any stretch, but I don't think you are expecting one?
这里的关键理解是你确实有一个适当的距离度量。 事实上这根本不是你的问题。 你的问题在于分类。
让我举一个例子。 假设您有 20 个 Foo X1 条目和 20 个 Foo Y1 条目。 您可以放心地假设他们是两组。 另一方面,如果您有 39 个条形 X1 条目和 1 个条形 Y1 条目,您应该将它们视为一个组。
现在,距离X1<-> 两个例子中的 Y1 是相同的,那么为什么分类会有差异呢? 这是因为 Bar Y1 是异常值,而 Foo Y1 不是。
有趣的是,您实际上不需要做大量工作来预先确定这些组。 您只需进行递归分类即可。 您从每个组的节点开始,然后为两个最近的节点添加一个超级节点。 在超级节点中,存储最佳假设、其子树的大小及其变化。 由于许多字符串都是相同的,因此您很快就会得到具有相同条目的大型子树。 递归以包含树根的超级节点结束。
现在根据这棵树映射规范名称。 您很快就会看到每个都与整个子树匹配。 现在,使用这些树之间的距离来选择该条目的距离截止。 如果数据库中同时有 Foo X1 和 Foo Y1 产品,则截止距离需要更低才能反映这一点。
The key understanding here is that you do have a proper distance metric. That is in fact not your problem at all. Your problem is in classification.
Let me give you an example. Say you have 20 entries for the Foo X1 and 20 for the Foo Y1. You can safely assume they are two groups. On the other hand, if you have 39 entries for the Bar X1 and 1 for the Bar Y1, you should treat them as a single group.
Now, the distance X1 <-> Y1 is the same in both examples, so why is there a difference in the classification? That is because Bar Y1 is an outlier, whereas Foo Y1 isn't.
The funny part is that you do not actually need to do a whole lot of work to determine these groups up front. You simply do an recursive classification. You start out with node per group, and then add the a supernode for the two closest nodes. In the supernode, store the best assumption, the size of its subtree and the variation in it. As many of your strings will be identical, you'll soon get large subtrees with identical entries. Recursion ends with the supernode containing at the root of the tree.
Now map the canonical names against this tree. You'll quickly see that each will match an entire subtree. Now, use the distances between these trees to pick the distance cutoff for that entry. If you have both Foo X1 and Foo Y1 products in the database, the cut-off distance will need to be lower to reflect that.
您可以使用三元组搜索来实现此目的。 我必须承认,我从未见过实现索引的算法,但见过它在制药应用中的工作原理,它确实可以很好地处理严重拼写错误的药物名称。 您也许可以将相同的逻辑应用于此问题。
You might be able to make use of a trigram search for this. I must admit I've never seen the algorithm to implement an index, but have seen it working in pharmaceutical applications, where it copes very well indeed with badly misspelt drug names. You might be able to apply the same kind of logic to this problem.
我认为 edg 的答案是正确的——你需要区分关键词和废话。
背景很重要。 举个例子,当查看两个 T400 实例时,Core 2 Duo 是无用的,但当查看一个 CPU OEM 包时则不然。
如果您可以在数据库中标记产品名称规范形式的哪些部分更重要并且必须以一种或另一种形式出现才能识别产品,那么您应该这样做。 也许通过使用某种语义标记? 您能负担得起让人对数据库进行标记的费用吗?
您可以尝试为“T-400”、“T400”、“T 400”等定义等价类。也许有一组规则说“数字比附加到这些数字的字母绑定更牢固”。
根据制造商、型号等细分案例可能是一个好方法。 我建议您查看术语识别技术来尝试实现这一目标:http://www.worldcat .org/isbn/9780262100854
在一个主要由规则驱动的灵活框架中设计所有内容,可以根据您的需求和出现的不良模式(阅读:破坏您的算法的事物)修改规则,这将是一个好主意,以及。 这样您就可以根据真实世界的数据来提高系统的性能。
edg's answer is in the right direction, I think - you need to distinguish key words from fluff.
Context matters. To take your example, Core 2 Duo is fluff when looking at two instances of a T400, but not when looking at a a CPU OEM package.
If you can mark in your database which parts of the canonical form of a product name are more important and must appear in one form or another to identify a product, you should do that. Maybe through the use of some sort of semantic markup? Can you afford to have a human mark up the database?
You can try to define equivalency classes for things like "T-400", "T400", "T 400" etc. Maybe a set of rules that say "numbers bind more strongly than letters attached to those numbers."
Breaking down into cases based on manufacturer, model number, etc. might be a good approach. I would recommend that you look at techniques for term spotting to try and accomplish that: http://www.worldcat.org/isbn/9780262100854
Designing everything in a flexible framework that's mostly rule driven, where the rules can be modified based on your needs and emerging bad patterns (read: things that break your algorithm) would be a good idea, as well. This way you'd be able to improve the system's performance based on real world data.
这是记录链接的问题。 dedupe python 库提供了完整的实现,但即使你不使用 python,文档也有关于如何解决此问题的良好概述。
简而言之,在标准范例中,此任务分为三个阶段比较
This is a problem of record linkage. The dedupe python library provides a complete implementation, but even if you don't use python, the documentation has a good overview of how to approach this problem.
Briefly, within the standard paradigm, this task is broken into three stages
您可能想要创建忽略型号的字母/数字组合的逻辑(因为它们总是非常相似)。
You might want to create logic that ignores the letter/number combination of model numbers (since they're nigh always extremely similar).
没有解决此类问题的任何经验,但我认为一个非常幼稚的实现是对搜索词进行标记,并搜索恰好包含任何标记的匹配项。
例如,“Canon PowerShot A20 IS”标记为:
,它将与您想要在结果中显示的每个其他项目相匹配。 当然,这种策略也可能会产生大量错误匹配。
另一种策略是为每个项目存储“关键字”,例如“相机”、“佳能”、“数码相机”,并根据具有匹配关键字的项目进行搜索。 此外,如果您存储了其他属性,例如制造商、品牌等,则可以对每个属性进行搜索。
Not having any experience with this type of problem, but I think a very naive implementation would be to tokenize the search term, and search for matches that happen to contain any of the tokens.
"Canon PowerShot A20 IS", for example, tokenizes into:
which would match each of the other items you want to show up in the results. Of course, this strategy will likely produce a whole lot of false matches as well.
Another strategy would be to store "keywords" with each item, such as "camera", "canon", "digital camera", and searching based on items that have matching keywords. In addition, if you stored other attributes such as Maker, Brand, etc., you could search on each of these.
我想到了拼写检查算法。
虽然我找不到一个好的示例实现,但我相信您可以修改基本的拼写检查算法来得到满意的结果。 即以单词而不是字符为单位进行工作。
留在我记忆中的点点滴滴:
它可能不能直接解决您的问题...但您说您正在寻找想法,对吧?
:-)
Spell checking algorithms come to mind.
Although I could not find a good sample implementation, I believe you can modify a basic spell checking algorithm to comes up with satisfactory results. i.e. working with words as a unit instead of a character.
The bits and pieces left in my memory:
It might not solve your problems directly... but you say you were looking for ideas, right?
:-)