仅当添加条件的第二秒时，红移抛出无效的数字错误

发布于 2025-01-25 00:22:48 字数 1120 浏览 3 评论 0原文

我正在使用Python查询一个格式不正确的红移服务器。我能够根据列的数字值像这样滤除非数字和查询：

import pyodbc
import pandas as pd
conn = pyodbc.connect(...)

q = '''
select *
from
(
    select 
    cast(oldcol as numeric) as newcol
    from tablename
    where oldcol similar to '[0-9]+(.[0-9]+)?' 
)
where newcol > 10
'''
pd.read_sql(q, conn)

但是，当我将Where语句更改为：

where newcol > 10 and newcol < 150

或者

where newcol between 10 and 150

我会收到以下错误：

ERROR:  Invalid digit, Value '-', Pos 2, Type: Decimal \nDETAIL:  \n  -----------------------------------------------\n  error:  Invalid digit, Value '-', Pos 2, Type: Decimal \n  code:      1207\n  context:   85-59\n  query:     82218981\n  location:  :0\n  process:   query1_125_82218981 [pid=0]\n  -----------------------------------------------\n\n (30) (SQLExecDirectW)")

但是在有一个'的地方不应有任何行- '在newcol中，因为否则第一个查询将无法正常工作，

where newcol similar to '%-%'

如果我使用另一个数字列，请转换为数字而不是数字，则不会返回相同的错误，请转换比较的顺序，使用不同的比较值，并使用光标= conn.cursor（）cursor.execute（q）执行代码。

原文

I'm using Python to query a Redshift server that has improperly formatted numbers. I am able to filter out non-numbers and query based on the column's numeric value like so:

import pyodbc
import pandas as pd
conn = pyodbc.connect(...)

q = '''
select *
from
(
    select 
    cast(oldcol as numeric) as newcol
    from tablename
    where oldcol similar to '[0-9]+(.[0-9]+)?' 
)
where newcol > 10
'''
pd.read_sql(q, conn)

However, when I change the WHERE statement to:

where newcol > 10 and newcol < 150

where newcol between 10 and 150

I get the following error:

ERROR:  Invalid digit, Value '-', Pos 2, Type: Decimal \nDETAIL:  \n  -----------------------------------------------\n  error:  Invalid digit, Value '-', Pos 2, Type: Decimal \n  code:      1207\n  context:   85-59\n  query:     82218981\n  location:  :0\n  process:   query1_125_82218981 [pid=0]\n  -----------------------------------------------\n\n (30) (SQLExecDirectW)")

But there shouldn't be any rows where there is a '-' in newcol, because otherwise the first query wouldn't have worked, and no rows return if I change the WHERE statement to

where newcol similar to '%-%'

I'm getting the same error if I use another numeric column, convert to decimal instead of numeric, reverse the order of the comparisons, use different comparison values, and execute the code using cursor = conn.cursor() cursor.execute(q).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岁月如刀 2025-02-01 00:22:48

如果这是优化器，则可以在其中不适当地进行条件，您可以尝试在该子句中应用一个冗余以解决这个问题。说...

where newcol > 10 and newcol < 150 and newcol::text similar to '[0-9]+(.[0-9]+)?'

这只是一个猜测可能有助于“脱离”问题的猜测。您可能需要使用此游戏来说服优化器做正确的事情。

这说我怀疑该问题在数据中更有可能。 “ - ”的报告可能只是红移不知道应该报告什么是数据，而不是实际上一部分数据。我对可能原因的最佳猜测是，被馈送给铸造的数据不适合数字（18,0）的默认数字数据类型。如果您的字符串数据值打破此类型定义，我可以看到此错误消息正在发生。

您可能没有看到原始查询中的错误的原因是它会产生大量的输出（或有限），并且查询在达到问题值之前会产生结果。尝试将原始查询的输出（没有顶部或限制）发送到临时表，然后查看它是否完成而没有错误作为测试。您也可以删除Where子句，看看是否所有内部查询的数据都可以处理。

这只是一个猜测，您看到的问题可能是数据期望问题，而不是RS错误。我过去曾见过RS错误（通常使用优化器），因此我永远不会排除这种可能性。

=================

在评论中来回一些后，RedShift可能会在其中推进一个条款。这就是可以工作的方式。

RedShift通过应用与Table Metadata的条款进行应用，从而获得了巨大的性能。优化器几乎可以肯定会弄平这组2个选择以获得性能。它可能尝试将外部子句 - “ newcol＆gt; 10和newcol＆lt; 150” - 并将其应用于源数据表的元数据。减少从磁盘上读取的行可能会有很大帮助。它将通过（有效地）更改到“铸造（数字）＆gt; 10并铸造（oldcol为数字）＆lt; 150”的地方来实现这一目标。完全有效，对吧？但是，不，这种演员仅应用于内部查询的某些行，而不是全部。

那么，如何修复 /证明它呢？

您的查询结构基本上是一个案例语句，但使用Where子句进行。您的WHERE子句正在“更改”数据如何处理。红移优化器不太可能以这种方式思考并越过（如果该理论成立）。如果我们将数据处理信息添加到查询的数据处理部分中，那么这应该有助于红移发现它不能按照所需的方式将查询弄平。

这是我尝试的实验：（

select *
from
(
    select 
    decode(oldcol similar to '[0-9]+(.[0-9]+)?', true, cast(oldcol as numeric)) as newcol
    from tablename
    where oldcol similar to '[0-9]+(.[0-9]+)?' 
)
where newcol > 10 and newcol < 150

decode（）只是一个缩短/简单的情况语法）

现在，优化器将需要将Decode（）带入扁平化条款并“保护”查询免受数据错误的情况。这种结构将使子句不需要的内心使人内心，但我们不要一次改变太多的事情。

如果这种情况有效，那么我们可能对正在发生的事情有良好的处理。不能保证这种理论在这种情况下是正确的，因为更改查询只是“移动”了问题，但这是“可能正确的”。此信息还将有助于创建一个简单的测试案例，以提交给AWS。

If this is the optimizer advancing a where condition inappropriately you could try applying a redundant where clause to suss this out. Say ...

where newcol > 10 and newcol < 150 and newcol::text similar to '[0-9]+(.[0-9]+)?'

This is just a guess as to what might help "unbreak" the issue. You may need to play with this to convince the optimizer to do the right thing.

This said I suspect it is much more likely that the issue is in the data. The report of '-' is likely just Redshift not knowing what should be reported as the data and not actually some part of the data. My best guess as to a likely cause is that the data being fed to CAST() does not fit in a default numeric data type which is numeric(18,0). If your string data values break this type definition I can see this error message occurring.

The reason you may not be seeing the error in the original query is that it produces a lot of output that is buffered (or LIMITed) and the query produces results before it gets to the problem value. Try sending the output of the original query (without a TOP or LIMIT) to a temp table and see if it completes without error as a test. You could also just remove the where clause and see if all of the inner query's data is processable.

This is just a guess and there are likely other ways that the problem you are seeing is a data expectation issue and not a RS bug. I have seen RS bugs in the past (often with the optimizer) so I'll never rule this possibility out.

================

After some back and forth in the comments it looks like Redshift may be advancing a WHERE clause inappropriately. Here's how that could work.

Redshift gains a huge performance boost by applying WHERE clauses against table metadata. The optimizer will almost certainly flatten this set of 2 SELECTs to gain performance. It may try to take the outer WHERE clause - "where newcol > 10 and newcol < 150" - and apply it to the source data table's metadata. Reducing the rows read from disk could be a great help. It would do this by (effectively) changing the WHERE to - "where cast(oldcol as numeric) > 10 and cast(oldcol as numeric) < 150". Perfectly valid, right? But no, this cast is only being applied to some rows of the inner query, not all.

So how do you fix it / prove it?

Your query structure is basically a CASE statement but using a WHERE clause to do it. Your WHERE clause is "changing" how the data should be processed. The Redshift optimizer isn't likely thinking this way and getting crossed up (if this theory holds). If we add the data processing information into the data processing part of the query then this should help Redshift see that it cannot flatten the queries the way it wants to.

Here's the experiment I'd try:

select *
from
(
    select 
    decode(oldcol similar to '[0-9]+(.[0-9]+)?', true, cast(oldcol as numeric)) as newcol
    from tablename
    where oldcol similar to '[0-9]+(.[0-9]+)?' 
)
where newcol > 10 and newcol < 150

(decode() is just a shortened/simpler CASE syntax)

Now the optimizer will need to bring the decode() into the flattened WHERE clause and "protect" the query from the data error. This structure will make the inner WHERE clause unneeded but let's not change too many things at once.

IF THIS WORKS then we likely have a good handle on what is going on. There's no guarantee that this theory is correct in this case as changing the query have just "moved" the issue but it would be "likely correct". This info will also help in creating a simple test case to submit to AWS.

回复收藏 0 原文

~没有更多了~