如何选择填充最多的记录?

发布于 2024-09-08 07:52:54 字数 674 浏览 1 评论 0原文

我很不幸不得不处理包含特定记录重复项的数据库,我正在寻找一种快速方法来表示“获取填充最多的记录并更新重复项以匹配它”。

从那里我可以选择不同的记录并获取一组有用的记录。

有什么想法吗?

主要是名称和地址(如果有帮助的话)...

好吧,这里提出了很多问题,所以我会添加更多内容:

首先,我想提取最“填充”而不是最“流行”的行,这意味着具有最多值的行不为空。

一旦我有了集合(这很容易,因为在我的情况下 id 匹配),我就可以填充其他行中的缺失值。

我不想破坏数据,我只想根据准确的匹配(例如通过id)更新数据。

我目前的问题是找出一组行中填充最多的字段,自从发布这个问题以来,我找到了一种不同的方法来解决我更大的问题,即发送到远程服务器的内容,但是我'我仍然有兴趣知道这个问题的解决方案是什么。

示例数据可能看起来像这样...

id   name     addr1            addr2       ect
1    fred     1 the street     Some town   ...
1    fred     null             null        null

给定一个充满这样的匹配对的表,我想找到这些对,然后抓取其中包含信息的对,并将这些值插入到另一行中存在空值的位置。

I have the unfortunate luck of having to deal with a db that contains duplicates of particular records, I am looking for a quick way to say "get the most populated record and update the duplicates to match it".

From there I can select distinct records and get a useful set of records.

Any ideas?

It's mainly names and addresses if that helps...

Ok lots of questions asked here so i'll add little bit more:

Firstly I want to pull the most "populated" not most "popular", this means the row with the most values that are not null.

Once I have the set (which is easy because in my case the id's match) I can then populate the missing values in the other rows.

I don't want to destroy data and i only intend to update data based on an accurate match (eg by id).

My problem at the moment is figuring out which of a set of rows has the most populated fields, having said that since posting this question I have found a different way to solve my bigger problem which is what to send to a remote server however I'm still interested to know what the solution to this might be.

Sample data might look something like this ...

id   name     addr1            addr2       ect
1    fred     1 the street     Some town   ...
1    fred     null             null        null

Given a table full of matching pairs like this I want to find the pairs then grab the one with the info in it and insert those values where there is a null in the other row.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

她如夕阳 2024-09-15 07:52:54

请记住,您可能会破坏此处的数据。仅仅因为一行填充的列较少并不意味着它在填充的列中不太准确。

我假设重复项是由名为“名称”的列确定的。您需要根据重复项的定义进行调整。另外,由于您没有给出任何关于如何处理“人口最多”关系的规则,所以我只是选择了 id 最低的行。

UPDATE
    T1
SET
    col_1 = T2.col_1,
    col_2 = T2.col_2,
    ....
FROM
    My_Table T1
INNER JOIN My_Table T2 ON
    T2.name = T1.name AND
    T2.id =
    (
        SELECT TOP 1
            T3.id
        FROM
            My_Table T3
        WHERE
            T3.name = T1.name
        ORDER BY
            CASE WHEN col_1 IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN col_2 IS NOT NULL THEN 1 ELSE 0 END +
            ... DESC,
            id ASC
    )

编辑:我刚刚重读了你的问题,你提到,“从那里我可以选择不同的记录并获得一组有用的记录。”如果这就是您真正想要的,那么不必费心更新其他行,只需首先选择您想要的行并保持其他所有内容不变:

SELECT
    T1.id,
    T1.name,
    T1.col_1,
    T1.col_2,
    ...
FROM
    My_Table T1
WHERE
    T1.id =
    (
        SELECT TOP 1
            T2.id
        FROM
            My_Table T2
        WHERE
            T2.name = T1.name
        ORDER BY
            CASE WHEN T2.col_1 IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN T2.col_2 IS NOT NULL THEN 1 ELSE 0 END +
            ... DESC,
            T2.id ASC
    )

Keep in mind that you will be potentially destroying data here. Just because a row has fewer columns filled doesn't mean that it's less accurate in the columns that are filled.

I've assumed that duplicates are determined by a column called "name". You'll need to adjust based on your definition of duplicates. Also, since you didn't give any rules on how to deal with ties for "most populated" I just chose the row with the lowest id.

UPDATE
    T1
SET
    col_1 = T2.col_1,
    col_2 = T2.col_2,
    ....
FROM
    My_Table T1
INNER JOIN My_Table T2 ON
    T2.name = T1.name AND
    T2.id =
    (
        SELECT TOP 1
            T3.id
        FROM
            My_Table T3
        WHERE
            T3.name = T1.name
        ORDER BY
            CASE WHEN col_1 IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN col_2 IS NOT NULL THEN 1 ELSE 0 END +
            ... DESC,
            id ASC
    )

EDIT: I just reread your question and you mention, "From there I can select distinct records and get a useful set of records." If that's what you really want, then don't bother updating the other rows, just select the ones that you want in the first place and leave everything else intact:

SELECT
    T1.id,
    T1.name,
    T1.col_1,
    T1.col_2,
    ...
FROM
    My_Table T1
WHERE
    T1.id =
    (
        SELECT TOP 1
            T2.id
        FROM
            My_Table T2
        WHERE
            T2.name = T1.name
        ORDER BY
            CASE WHEN T2.col_1 IS NOT NULL THEN 1 ELSE 0 END +
            CASE WHEN T2.col_2 IS NOT NULL THEN 1 ELSE 0 END +
            ... DESC,
            T2.id ASC
    )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文