找出导致错误的行

发布于 2024-09-01 07:15:33 字数 785 浏览 9 评论 0原文

我有一个动态编写的大型查询来集成一些数据。基本上它的作用是查询一些表,连接其他一些表,处理一些数据,然后将其插入最终表中。

问题是数据太多,我们不能真正信任来源,因为可能存在一些错误或不一致的数据。

例如,在使用客户的数据库进行开发时,我花了将近一个小时寻找错误,因为在我的大型查询中间的某个地方,将某些 varchar 转换为日期时间时出现了错误。事实证明,他们有一些日期为“2009-02-29”的销售,这是一个超出范围的日期。 是的,我知道。为什么它存储为 varchar?嗯,源数据库有 3 列日期,“月”、“日”和“年”。我不知道为什么会这样,但仍然如此。

但如果消息来源不可信,我该如何对待呢?

我无法处理异常,我确实需要它与原始消息达到另一个级别,但我想提供更多信息,以便用户至少可以在致电我们之前尝试解决它。

因此,我考虑向用户显示行号或一些 ID,至少让他知道他必须更正哪些记录。这也是一项艰巨的工作,因为集成有时会运行多达 80000 条记录。 在 80000 条记录的集成中,一条虚拟错误消息:“将 varchar 数据类型转换为日期时间数据类型导致日期时间值超出范围” 毫无意义。

所以任何想法将不胜感激。

哦,我正在使用带有 Service Pack 3 的 SQL Server 2005。


编辑:

好的,对于我所读到的答案,最好的办法是检查可能对引发错误至关重要的每一列,如果他们确实参加了该条件,我自己应该提出一个错误,并使用我认为更具描述性的消息,并添加一些可以存储在单独的表或一些变量中的信息,例如行的 ID 或一些其他根信息。

I have a big fat query that's written dynamically to integrate some data. Basically what it does is query some tables, join some other ones, treat some data, and then insert it into a final table.

The problem is that there's too much data, and we can't really trust the sources, because there could be some errored or inconsistent data.

For example, I've spent almost an hour looking for an error while developing using a customer's database because somewhere in the middle of my big fat query there was an error converting some varchar to datetime. It turned out to be that they had some sales dating '2009-02-29', an out-of-range date.
And yes, I know. Why was that stored as varchar? Well, the source database has 3 columns for dates, 'Month', 'Day' and 'Year'. I have no idea why it's like that, but still, it is.

But how the hell would I treat that, if the source is not trustable?

I can't HANDLE exceptions, I really need that it comes up to another level with the original message, but I wanted to provide some more info, so that the user could at least try to solve it before calling us.

So I thought about displaying to the user the row number, or some ID that would at least give him some idea of what record he'd have to correct. That's also a hard job because there will be times when the integration will run up to 80000 records.
And in an 80000 records integration, a single dummy error message: 'The conversion of a varchar data type to a datetime data type resulted in an out-of-range datetime value' means nothing at all.

So any idea would be appreciated.

Oh I'm using SQL Server 2005 with Service Pack 3.


EDIT:

Ok, so for what I've read as answers, best thing to do is check each column that could be critical to raising errors, and if they do attend the condition, I should myself raise an error, with the message I find more descriptive, and add some info that could have been stored in a separate table or some variables, for example the ID of the row, or some other root information.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

久伴你 2024-09-08 07:15:33

对于日期,您可以使用 isdate 函数,

select ISDATE('20090229'),ISDATE('20090227')

我通常插入到临时表中,进行检查,然后插入到实际表中

for dates you can use the isdate function

select ISDATE('20090229'),ISDATE('20090227')

I usually insert into a staging table, do my checks and then insert into the real tables

咆哮 2024-09-08 07:15:33

我的建议是预先验证传入的数据,当遇到错误时,将记录放在一边。例如,检查无效日期。假设您在一组 80K 中找到了 20 个。将这 20 个数据提取到一个单独的表中,并将错误消息附加到记录中。运行其他验证,然后最终将剩余(所有有效)记录导入到所需的目标表中。

这可能会对性能产生太大影响,但可以让您轻松指出错误并允许纠正它们,然后将其插入第二遍。

My suggestion would be to pre-validate the incoming data, and as you encounter errors, set aside the record. For example, check for invalid dates. Say you find 20 in a set of 80K. Pull those 20 out into a separate table, with the error message attached to the record. Run your other validation, then finally import the remaining (all valid) records into the desired target table(s).

This might have too much impact on performance, but would allow you to easily point out the errors and allow them to be corrected and then inserted in a second pass.

溺ぐ爱和你が 2024-09-08 07:15:33

这听起来像是一个标准的 ETL 问题:提取、转换和加载。 (除非您必须针对同一组数据一遍又一遍地运行此查询,在这种情况下,您几乎会做同样的事情,只是一遍又一遍。那么性能有多重要?)什么

样的错误处理和/或“报告不良数据”您可以提供吗?如果你把所有的东西都当作“一个大的胖查询”,你的选择就会变得非常有限——查询要么有效,要么无效,如果没有,我猜你最多会收到一条 RAISERROR 消息来告诉调用者什么是什么。

在这种情况下,我尝试建立的总体框架是:

  • 从源表开始
  • 生成一组临时表(SQLMenace 的临时表),您知道这些表是一致的并且正确形成的(有效数据、键等)
  • 针对这些表编写“不太大和胖的查询”

这样做,您应该始终能够返回(或存储)有效的数据集......即使它是空的。诀窍在于确定例程何时失败——何时数据太损坏而无法处理并产生所需的结果,因此您返回一条措辞正确的错误消息?

This sounds like a standard ETL issue: Extract, Transform, and Load. (Unless you have to run this query over and over again against the same set of data, in which case you'd pretty much do the same thing, only over and over again. So how critical is performance?)

What kind of error handling and/or "reporting of bad data" are you allowed to provide? If you have everything as "one big fat query", your options become very limited -- either the query works or it doesn't, and if it doesn't I'm guessing you get at best one RAISERROR message to tell the caller what's what.

In a situation like this, the general framework I'd try to set up is:

  • Starting with the source table(s)
  • Produce an interim set of tables (SQLMenace's staging tables) that you know are consistant and properly formed (valid data, keys, etc.)
  • Write the "not quite so big and fat query" against those tables

Done this way, you should always be able to return (or store) a valid data set... even if it is empty. The trick will be in determining when the routine fails -- when is the data too corrupt to process and produce the desired results, so you return a properly worded error message instead?

梦里°也失望 2024-09-08 07:15:33

尝试这样的方法来查找行:

...big fat query here...
WHERE ISDATE(YourBadVarcharColumn)!=1

将数据加载到临时表中,其中大多数列都是 varchar 并允许 NULL,其中有一个状态列。

运行 UPDATE 命令,例如

UPDATE Staging
    SET Status='X'
    WHERE ISDATE(CONVERT(YourCharYear+YourCharMonth+YourCharDat+)!=1 
        OR OtherColumn<4...

然后只需从暂存表中插入 Status!='X'

INSERT INTO RealTable
        (col1, col2...)
    SELECT
        col1, col2, ...
        where Status!='X'

try something like this to find the rows:

...big fat query here...
WHERE ISDATE(YourBadVarcharColumn)!=1

Load the Data into a staging table, where most columns are varchar and allow NULLs, where you have a status column.

Run an UPDATE command like

UPDATE Staging
    SET Status='X'
    WHERE ISDATE(CONVERT(YourCharYear+YourCharMonth+YourCharDat+)!=1 
        OR OtherColumn<4...

Then just insert from your staging table where Status!='X'

INSERT INTO RealTable
        (col1, col2...)
    SELECT
        col1, col2, ...
        where Status!='X'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文