数据对比

发布于 2024-07-05 20:16:52 字数 459 浏览 9 评论 0原文

我们有一个包含公司名称、地址和联系人姓名(等等)的 SQL Server 表。

我们定期从外部来源接收数据文件,要求我们与此表进行匹配。 不幸的是,数据略有不同,因为它来自完全不同的系统。 例如,我们有“123 E. Main St.” 我们收到“123 East Main Street”。 另一个例子,我们有“Acme, LLC”,文件包含“Acme Inc.”。 另一个是,我们有“Ed Smith”,他们有“Edward Smith”。

我们有一个遗留系统,它利用一些相当复杂和 CPU 密集型的方法来处理这些比赛。 有些涉及纯 SQL,有些则涉及 Access 数据库中的 VBA 代码。 目前的系统不错,但并不完美,而且麻烦且难以维护。

这里的管理层希望扩大其使用范围。 将继承系统支持的开发人员希望用需要更少维护的更敏捷的解决方案来替换它。

是否有一种普遍接受的方法来处理这种数据匹配?

We have a SQL Server table containing Company Name, Address, and Contact name (among others).

We regularly receive data files from outside sources that require us to match up against this table. Unfortunately, the data is slightly different since it is coming from a completely different system. For example, we have "123 E. Main St." and we receive "123 East Main Street". Another example, we have "Acme, LLC" and the file contains "Acme Inc.". Another is, we have "Ed Smith" and they have "Edward Smith"

We have a legacy system that utilizes some rather intricate and CPU intensive methods for handling these matches. Some involve pure SQL and others involve VBA code in an Access database. The current system is good but not perfect and is cumbersome and difficult to maintain

The management here wants to expand its use. The developers who will inherit the support of the system want to replace it with a more agile solution that requires less maintenance.

Is there a commonly accepted way for dealing with this kind of data matching?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

ゝ杯具 2024-07-12 20:16:53

Access 确实没有这方面的工具。 在理想的情况下,我会使用 SSIS 解决方案并使用模糊查找。 但是,如果您当前正在使用 Access,那么您的办公室购买 SQL Server Enterprise 版本的可能性对我来说似乎很低。 如果您被当前的环境困住,您可以尝试暴力方法。

从地址的标准化清理开始。 选择 Street、raod 等的标准缩写,并编写代码以将所有正常变体更改为这些标准地址。 将任何两个空格替换为一个空格,修剪所有数据并删除所有非字母数字字符。 正如您所看到的,这是一项艰巨的任务。

至于公司名称,也许您可​​以尝试将名称的前 5 个字符与地址或电话进行匹配。 您还可以创建一个已知变体表以及它们在数据库中的相关内容,以用于清理未来的文件。 因此,如果您记录的 id 100 是 Acme, Inc.,您可能会得到一个如下表:

idfield Name

100 Acme, Inc.

100 Acme, Inc

100 Acme, Incorporated

100 Acme, LLC

100 Acme

这将开始很小,但会随着时间的推移而建立,如果每次找到并修复重复项(使其成为重复数据删除过程的一部分)时,以及每次能够将名称和地址的第一部分与现有公司相匹配时,您都会进行输入。

我还会查看 Torial 发布的该功能,看看它是否有帮助。

所有这些都将是痛苦且耗时的,但随着时间的推移,当您发现新的变体并将它们添加到代码或列表中时,情况会变得更好。 如果您决定对地址数据进行标准化,请确保首先清理生产数据,然后对工作表进行任何导入并清理它,然后尝试匹配生产数据并插入新记录。

Access doesn't really have the tools for this. In an ideal world I would go with the SSIS solution and use fuzzy lookup. But if you are currently using Access, the chances of your office buying SQL Server Enterprise edition seem low to me. If you are stuck with the current environment, you could try a brute force approach.

Start with standardized cleansing of addresses. PIck standard abbreviations for Street, raod, etc. and write code to change all the normal variations to those standard addesses. Replace any instances of two spaces with one space, trim all the data and remove any non-alphanumeric characters. As you can see this is quite a task.

As for company names, maybe you can try matching on first 5 characters of the name and the address or phone. You could also create a table of known variations and what they will relate to in your database to use for cleanising future files. So if you record with id 100 is Acme, Inc. you could have a table like this:

idfield Name

100 Acme, Inc.

100 Acme, Inc

100 Acme, Incorporated

100 Acme, LLC

100 Acme

This will start small but build over time if you make an entry every time you find and fix a duplicate (make it part of you de-dupping process) and if you make an entry every time you are able to match the first part of the name and address to an existing company.

I'd also look at that function Torial posted and see if it helps.

All of this would be painful and timeconsuming, but would get better over time as you find new variations and add them to the code or list. If you do decide to stardardize your addressdata, make sure to clean production data first, then do any imports to a work table and clean it, then try to match to production data and insert new records.

刘备忘录 2024-07-12 20:16:53

有许多供应商提供进行此类模式匹配的产品。 我会做一些研究并找到一个好的、信誉良好的产品并废弃本土系统

正如您所说,您的产品只是好的,这是企业的普遍需求,我确信那里有不止一种优秀的产品。 即使许可证花费几千美元,它仍然比付钱给一群开发人员在内部开发东西要便宜。

此外,“复杂”、“CPU 密集型”、“VBA 代码”和“Access 数据库”等短语同时出现在系统描述中,这一事实也是寻找优秀第三方工具的另一个原因。

编辑:.NET 也可能有一个内置组件可以执行此类操作,在这种情况下您无需为此付费。 我仍然偶尔会对 .NET 提供的工具感到惊讶。

There are many vendors out there that offer products to do this kind of pattern matching. I would do some research and find a good, well-reputed product and scrap the home-grown system.

As you say, your product is only good, and this is a common-enough need for businesses that I'm sure there's more than one excellent product out there. Even if it costs a few thousand bucks for a license, it will still be cheaper than paying a bunch of developers to work on something in-house.

Also, the fact that the phrases "intricate", "CPU intensive", "VBA code" and "Access database" appear together in your system's description is another reason to find a good third-party tool.

EDIT: it's also possible that .NET has a built-in component that does this kind of thing, in which case you wouldn't have to pay for it. I still get surprised once in a while by the tools that .NET offers.

喜你已久 2024-07-12 20:16:53

这是我为几乎相同的堆栈编写的内容(我们需要标准化硬件的制造商名称,并且存在各种变化)。 不过,这是客户端(确切地说是 VB.Net)——并使用 Levenshtein 距离算法(经过修改以获得更好的结果):

    Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
        Dim bestMatch As String = ""
        Dim bestDistance As Integer = 1000 'Almost anything should be better than that!

        For Each matchCandidate As String In stringList
            Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
            If candidateDistance < bestDistance Then
                bestMatch = matchCandidate
                bestDistance = candidateDistance
            End If
        Next

        Return bestMatch
    End Function

    'This will be used to determine how similar strings are.  Modified from the link below...
    'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
    Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
        Dim sLength As Integer = s.Length ' length of s
        Dim tLength As Integer = t.Length ' length of t
        Dim lvCost As Integer ' cost
        Dim lvDistance As Integer = 0
        Dim zeroCostCount As Integer = 0

        Try
            ' Step 1
            If tLength = 0 Then
                Return sLength
            ElseIf sLength = 0 Then
                Return tLength
            End If

            Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
            Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}

            ' fill first row
            For lvIndex As Integer = 0 To sLength
                poBuffer(lvIndex) = lvIndex
            Next

            'fill first column
            For lvIndex As Integer = 1 To tLength
                poBuffer(lvIndex * (sLength + 1)) = lvIndex
            Next

            For lvRowIndex As Integer = 0 To sLength - 1
                Dim s_i As Char = s(lvRowIndex)
                For lvColIndex As Integer = 0 To tLength - 1
                    If s_i = t(lvColIndex) Then
                        lvCost = 0
                        zeroCostCount += 1
                    Else
                        lvCost = 1
                    End If
                    ' Step 6
                    Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
                    Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
                    Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
                    Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
                    lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
                    poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
                Next
            Next
        Catch ex As ThreadAbortException
            Err.Clear()
        Catch ex As Exception
            WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
        End Try

        Return lvDistance - zeroCostCount
    End Function

Here's something I wrote for a nearly identical stack (we needed to standardize the manufacturer names for hardware and there were all sorts of variations). This is client side though (VB.Net to be exact) -- and use the Levenshtein distance algorithm (modified for better results):

    Public Shared Function FindMostSimilarString(ByVal toFind As String, ByVal ParamArray stringList() As String) As String
        Dim bestMatch As String = ""
        Dim bestDistance As Integer = 1000 'Almost anything should be better than that!

        For Each matchCandidate As String In stringList
            Dim candidateDistance As Integer = LevenshteinDistance(toFind, matchCandidate)
            If candidateDistance < bestDistance Then
                bestMatch = matchCandidate
                bestDistance = candidateDistance
            End If
        Next

        Return bestMatch
    End Function

    'This will be used to determine how similar strings are.  Modified from the link below...
    'Fxn from: http://ca0v.terapad.com/index.cfm?fa=contentNews.newsDetails&newsID=37030&from=list
    Public Shared Function LevenshteinDistance(ByVal s As String, ByVal t As String) As Integer
        Dim sLength As Integer = s.Length ' length of s
        Dim tLength As Integer = t.Length ' length of t
        Dim lvCost As Integer ' cost
        Dim lvDistance As Integer = 0
        Dim zeroCostCount As Integer = 0

        Try
            ' Step 1
            If tLength = 0 Then
                Return sLength
            ElseIf sLength = 0 Then
                Return tLength
            End If

            Dim lvMatrixSize As Integer = (1 + sLength) * (1 + tLength)
            Dim poBuffer() As Integer = New Integer(0 To lvMatrixSize - 1) {}

            ' fill first row
            For lvIndex As Integer = 0 To sLength
                poBuffer(lvIndex) = lvIndex
            Next

            'fill first column
            For lvIndex As Integer = 1 To tLength
                poBuffer(lvIndex * (sLength + 1)) = lvIndex
            Next

            For lvRowIndex As Integer = 0 To sLength - 1
                Dim s_i As Char = s(lvRowIndex)
                For lvColIndex As Integer = 0 To tLength - 1
                    If s_i = t(lvColIndex) Then
                        lvCost = 0
                        zeroCostCount += 1
                    Else
                        lvCost = 1
                    End If
                    ' Step 6
                    Dim lvTopLeftIndex As Integer = lvColIndex * (sLength + 1) + lvRowIndex
                    Dim lvTopLeft As Integer = poBuffer(lvTopLeftIndex)
                    Dim lvTop As Integer = poBuffer(lvTopLeftIndex + 1)
                    Dim lvLeft As Integer = poBuffer(lvTopLeftIndex + (sLength + 1))
                    lvDistance = Math.Min(lvTopLeft + lvCost, Math.Min(lvLeft, lvTop) + 1)
                    poBuffer(lvTopLeftIndex + sLength + 2) = lvDistance
                Next
            Next
        Catch ex As ThreadAbortException
            Err.Clear()
        Catch ex As Exception
            WriteDebugMessage(Application.StartupPath , [Assembly].GetExecutingAssembly().GetName.Name.ToString, MethodBase.GetCurrentMethod.Name, Err)
        End Try

        Return lvDistance - zeroCostCount
    End Function
不再让梦枯萎 2024-07-12 20:16:53

SSIS(在Sql 2005+ Enterprise中)具有模糊查找,其设计对于这样的数据清理问题。

除此之外,我只知道特定于域的解决方案 - 例如 地址清理 ,或一般字符串匹配技术

SSIS (in Sql 2005+ Enterprise) has Fuzzy Lookup which is designed for just such data cleansing issues.

Other than that, I only know of domain specific solutions - such as address cleaning, or general string matching techniques.

唱一曲作罢 2024-07-12 20:16:53

有很多方法可以解决这个问题,但这些方法可能并不明显。 最好的是找到唯一的标识符,您可以使用它来匹配字段之外的拼写错误等。

一些想法

  1. 显而易见,社会安全号码、驾驶执照等
  2. 电子邮件地址
  3. 清理后的电话号码(删除标点符号等)

就供应商而言我刚刚回答了类似的问题并粘贴在下面。

每个主要提供商都有自己的解决方案。 Oracle、IBM、SAS Dataflux 等公司都声称自己最擅长解决此类问题。

独立验证评估:

澳大利亚科廷大学数据链接中心进行了一项研究,模拟了 440 万条记录的匹配。 确定提供商的准确性(找到的匹配数与可用的匹配数。错误匹配数)

DataMatch Enterprise,< /a> 准确度最高 (>95%)、速度非常快、成本低

IBM Quality Stage,高精度 (>90%)、非常快、高成本 (>100K 美元)

SAS Data Flux、中等精度 (>85%)、快速、高成本 (>100K) )
这是我们能找到的最好的独立评估,非常彻底。

There's quite a few ways to tackle this that may not be obvious. The best is finding unique identifiers that you can use for matching outside of the fields with mis spellings, etc.

Some thoughts

  1. The obvious, Social security number, drivers license, etc
  2. Email address
  3. Cleansed phone number (Rremove punctuation, etc)

As far as vendors go I just answered a similar question and am pasting below.

Each major provider does have their own solution. Oracle, IBM, SAS Dataflux, etc and each claim to be the best at this kind of problem.

Independent verified evaluation:

There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. Identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost

IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)

SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K)
That was the best independent evaluation we could find, was very thorough.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文