用于查找存在拼写错误或其他信息变化的客户之间的重复交易的技术?
这不是 SQL Server 特定的问题;但这里可能有 tSQL 特定选项。
我有很多客户详细信息;他们中的许多人取消并辞职了。他们获得了一个全新的帐户;我们的数据验证充其量是粗略的;因此他们经常会输错电子邮件地址或其他数据。
问题分为两部分:
第一;我有名字和姓氏、电子邮件、信用卡最后 4 位、邮政编码、电话号码等信息。是否有一种算法/过程可以让我查看我的数据集并寻找常见的重复池,以便我可以确定数据的一些手动特征,这些特征往往是回头客的“陷阱”项目——即 80% 的时间电子邮件“相似”并且邮政编码相同,它是回头客(基于我的人类匹配技能)?
第二;我如何表达数据集之间的相似性——即,如果 5 个字段中有 3 个匹配,则有一个称为匹配的项目?所有不同数据点之间的某种相似性指数?我知道我可以在某种程度上在名称上使用 soundex...但在电子邮件地址上不太确定。
所以,我对快速和肮脏的解决方案都感兴趣(今晚我正在整理分析;但我也对解决这个问题的“正确”方法非常感兴趣。)这两个答案都会赢得我的喜爱和尊重。 =)
This isn't a SQL Server specific question; but there might be tSQL specific options here.
I've got a bunch of customer details; many of them cancel and resign up for their service. They get an entirely new account; and our datavalidation is sketchy at best; so they often mistype email addresses or other data.
The question is in two parts:
First; I've got info such as first and last name, email, last 4 of credit card, postal code, phone number. Is there an algorithm/process I can look at my dataset with and look for common pools of repetition so I can determine some manual characteristics of the data that tend to be 'gotcha' items for repeat customers -- i.e. 80% of the time the emails were 'similar' and the zip code was the same, it was a repeat customer (based upon my human matching skills)?
Second; How might I go about expressing similarity between sets of data --- i.e. have an item called a match if 3 of the 5 fields match? Some sort of similiarity index between all the different data points? I know I can use soundex to some extent on the names... not so sure on email addresses.
So, I'm interested in both quick and dirty solutions (I'm whipping together an analysis tonight; but I'm also very interested in the 'right' ways of going to tackle this problem.) Both answers will earn my love and respect. =)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不是T-SQL,但SSIS有一些答案:模糊查找转换, 模糊分组转换。这篇 MSDN 文章展示了如何使用它:模糊查找和分组提供强大的数据清理功能< /a>,这是有关该主题的视频:MSDN TV:使用 SQL Server 集成服务进行数据清理应用程序 。
也有第 3 方解决方案,例如 Red-Gate 的 Data Cleanser。尽管如此,他们还是使用 SSIS,这是一个更适合这项工作的工具。
您还应该阅读Levenshtein distance,已经有许多公共现成的T-SQL实现以及 CLR 实现,可在 SQL Server 中使用。只需Google 即可。
Not T-SQL, but SSIS has some answers: Fuzzy Lookup Transformation, Fuzzy Grouping Transformation. This MSDN article shows how to use it: Fuzzy Lookups and Groupings Provide Powerful Data Cleansing Capabilities, and here is a video on the topic: MSDN TV: Data Cleansing Applications with SQL Server Integration Services.
There are 3rd party solutions too, like Red-Gate's Data Cleanser. Still, they use SSIS, as is a much more appropriate tool for the job.
You should also read about Levenshtein distance, there are already many public off-the-shelf T-SQL implementations, as well as CLR ones, usable in SQL Server. Just Google for them.
我还建议查看 Jaro-Winkler 模糊匹配算法进行查询,其 CLR 实现可以在 这里
这似乎对于较小数量的文本字符串匹配效果很好
I would also recommend looking at the Jaro-Winkler fuzzy matching algorithm for querying with, and its CLR implementation can be found here
This seems to work quite well with smaller volumes of text string matching