如何从雪花中的数据中删除重复物？

发布于 2025-01-27 22:30:22 字数 1233 浏览 3 评论 0 原文

我有此数据，必须删除以黄色突出显示的重复数据。我试图使用不同的分组，但似乎不起作用。

即使评论和价值有所不同，在这种情况下，具有CrimsStandard的价值的任何值都被认为是父母的孩子，因此被认为是重复的，因为前6个字段是相同的。在这种情况下，如何删除重复记录？

CLIENT;CLIENT_ID;QTY_TYPE;QUANTITY;AMOUNT;TRANTYPE;COMMENT;VALUE;ORDER_ID;DEV_COMM;AMT_COMM;ID
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-AGNIndPlusFd;Q;2898;;Buy;INS;CRIMSSTANDARD;10902701;;;
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-DepVz;A;;82000;Sell;GEN,TransferOrder;10902568;Y;0;
TEST_IMPL8;P-DepVz;A;;82000;Sell;INS;CRIMSSTANDARD;10902568;;;
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;GEN;OHPEN_14613296BD001571;10902668;;;14613296BD001571
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;INS;CRIMSSTANDARD;10902668;;;
TEST_IMPL8;P-Dyn4Vz;A;;13000;Buy;INS;CRIMSSTANDARD;10877286;;;
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;GEN;OHPEN_14613296BD001565;10677375;;;14613296BD001565
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;INS;CRIMSSTANDARD;10677375;;;
TEST_IMPL8;P-VNAMijBel;Q;10236;;Sell;INS;CRIMSSTANDARD;11402183;;;

原文

I have this data where I have to remove the duplicate data highlighted in yellow. I tried to use distinct and group by but it does not seem to work.

Even though the comment and value is different, anything having the value as CRIMSSTANDARD is considered a child of the parent in this case and hence considered as a duplicate as the first 6 fields are the same. How do I remove the duplicate records in this case?

CLIENT;CLIENT_ID;QTY_TYPE;QUANTITY;AMOUNT;TRANTYPE;COMMENT;VALUE;ORDER_ID;DEV_COMM;AMT_COMM;ID
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-AGNIndPlusFd;Q;2898;;Buy;INS;CRIMSSTANDARD;10902701;;;
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-DepVz;A;;82000;Sell;GEN,TransferOrder;10902568;Y;0;
TEST_IMPL8;P-DepVz;A;;82000;Sell;INS;CRIMSSTANDARD;10902568;;;
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;GEN;OHPEN_14613296BD001571;10902668;;;14613296BD001571
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;INS;CRIMSSTANDARD;10902668;;;
TEST_IMPL8;P-Dyn4Vz;A;;13000;Buy;INS;CRIMSSTANDARD;10877286;;;
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;GEN;OHPEN_14613296BD001565;10677375;;;14613296BD001565
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;INS;CRIMSSTANDARD;10677375;;;
TEST_IMPL8;P-VNAMijBel;Q;10236;;Sell;INS;CRIMSSTANDARD;11402183;;;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吝吻 2025-02-03 22:30:22

第一次通过，您需要摆脱彼此重复的行（这将在数据中完全相同的数据中解决第1和第3行问题）。使用此处概述的方法：

接下来，尝试消除前6列的重复项，并在评论列中删除CrimsStandard指定的子记录：

begin work;
delete from
    my_table using (
        select
            CLIENT,
            CLIENT_ID,
            QTY_TYPE,
            QUANTITY,
            AMOUNT,
            TRANTYPE
        from
            my_table
        GROUP BY
            CLIENT,
            CLIENT_ID,
            QTY_TYPE,
            QUANTITY,
            AMOUNT,
            TRANTYPE
        HAVING
            COUNT(*) > 1
    ) as my_table_dupes
where
    my_table.CLIENT = X.CLIENT
    AND my_table.CLIENT_ID = my_table_dupes.CLIENT_ID
    AND my_table.QTY_TYPE = my_table_dupes.QTY_TYP
    AND my_table.QUANTITY = my_table_dupes.QUANTITY
    AND my_table.AMOUNT = my_table_dupes.AMOUNT
    AND my_table.TRANTYPE = my_table_dupes.TRANTYPE
    AND my_table.COMMENT = 'CRIMSSTANDARD' 
commit work;

First pass, you'll need to get rid of rows that are entire duplicates of each other (this will fix the row 1 and row 3 issue in your data where the rows are exactly the same). Use the approach outlined here: How to delete Duplicate records in snowflake database table

Next, try this to eliminate duplicates of first 6 columns, and deleting the child record designated by CRIMSSTANDARD in the COMMENT column:

begin work;
delete from
    my_table using (
        select
            CLIENT,
            CLIENT_ID,
            QTY_TYPE,
            QUANTITY,
            AMOUNT,
            TRANTYPE
        from
            my_table
        GROUP BY
            CLIENT,
            CLIENT_ID,
            QTY_TYPE,
            QUANTITY,
            AMOUNT,
            TRANTYPE
        HAVING
            COUNT(*) > 1
    ) as my_table_dupes
where
    my_table.CLIENT = X.CLIENT
    AND my_table.CLIENT_ID = my_table_dupes.CLIENT_ID
    AND my_table.QTY_TYPE = my_table_dupes.QTY_TYP
    AND my_table.QUANTITY = my_table_dupes.QUANTITY
    AND my_table.AMOUNT = my_table_dupes.AMOUNT
    AND my_table.TRANTYPE = my_table_dupes.TRANTYPE
    AND my_table.COMMENT = 'CRIMSSTANDARD' 
commit work;

回复收藏 0 原文

围归者 2025-02-03 22:30:22

雪花上没有在上没有区别，但是您可以使用限定 ：

SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value 
  ORDER BY client, client_id, qty_type, quantity, amount, trantype, value
) = 1;

参见此处查看更多详细信息

编辑：

要保持具有 dev_comm ， AMT_COMM 和 ID non NON NON的行，您可以修改您的订单条件：

SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value 
  ORDER BY 
    IFF(dev_comm IS NOT NULL, 0, 1), 
    IFF(amt_comm IS NOT NULL, 0, 1), 
    IFF(ID IS NOT NULL, 0, 1)
) = 1;

您也可以拥有不同的订单条件：优先级通过中的顺序订购字段。例如，如果您有多个具有 dev_comm 列的行，但是优先考虑具有 ID 的行，则需要放置 iff（ID ... 首先在下按第一个元素订购。

There is no distinct on in Snowflake, but you can have a similar result using qualify:

SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value 
  ORDER BY client, client_id, qty_type, quantity, amount, trantype, value
) = 1;

See here to see more details

EDIT:

To keep the rows that have the dev_comm, amt_comm and ID non null in priority you can modify your order condition:

SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value 
  ORDER BY 
    IFF(dev_comm IS NOT NULL, 0, 1), 
    IFF(amt_comm IS NOT NULL, 0, 1), 
    IFF(ID IS NOT NULL, 0, 1)
) = 1;

You can also have different priorities by ordering the fields in the order by. For example in case you have multiple lines that have the dev_comm columns but you want in priority the line that have the ID, you need to put the iff(ID ... first in your order by. It order by the first element, then if there is several lines with the same first element it order by the second, etc...