使用 SQL Server 在 varchar 列中查找非 ASCII 字符
如何使用 SQL Server 返回包含非 ASCII 字符的行?
如果您能展示如何在一列中执行此操作,那就太好了。
我现在正在做类似的事情,但它不起作用
select *
from Staging.APARMRE1 as ar
where ar.Line like '%[^!-~ ]%'
为了额外的信用,如果它可以跨越表中的所有 varchar
列,那么会很出色!在此解决方案中,最好返回三列:
- 该记录的标识字段。 (这将允许使用另一个查询检查整个记录。)
- 列名
- 包含无效字符的文本
Id | FieldName | InvalidText |
----+-----------+-------------------+
25 | LastName | Solís |
56 | FirstName | François |
100 | Address1 | 123 Ümlaut street |
无效字符可以是空格 (3210) 到 ~ 范围之外的任何字符
(12710)
How can rows with non-ASCII characters be returned using SQL Server?
If you can show how to do it for one column would be great.
I am doing something like this now, but it is not working
select *
from Staging.APARMRE1 as ar
where ar.Line like '%[^!-~ ]%'
For extra credit, if it can span all varchar
columns in a table, that would be outstanding! In this solution, it would be nice to return three columns:
- The identity field for that record. (This will allow the whole record to be reviewed with another query.)
- The column name
- The text with the invalid character
Id | FieldName | InvalidText |
----+-----------+-------------------+
25 | LastName | Solís |
56 | FirstName | François |
100 | Address1 | 123 Ümlaut street |
Invalid characters would be any outside the range of SPACE (3210) through ~
(12710)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这是使用 PATINDEX 进行单列搜索的解决方案。
它还显示 StartPosition、InvalidCharacter 和 ASCII 代码。
Here is a solution for the single column search using PATINDEX.
It also displays the StartPosition, InvalidCharacter and ASCII code.
我已经成功运行了这段代码
,这对于已知的列效果很好。
为了额外加分,我编写了这个快速脚本来搜索给定表中的所有 nvarchar 列中的 Unicode 字符。
我不喜欢动态 SQL,但它确实可以用于像这样的探索性查询。
I've been running this bit of code with success
Which works well for known columns.
For extra credit, I wrote this quick script to search all nvarchar columns in a given table for Unicode characters.
I'm not a fan of dynamic SQL but it does have its uses for exploratory queries like this.
尝试这样的事情:
输出:
try something like this:
OUTPUT:
该脚本在一列中搜索非 ASCII 字符。它生成一个包含所有有效字符的字符串,此处为代码点 32 到 127。然后它搜索与列表不匹配的行:
This script searches for non-ascii characters in one column. It generates a string of all valid characters, here code point 32 to 127. Then it searches for rows that don't match the list:
在一些现实世界的数据上运行各种解决方案 - 12M 行 varchar 长度~30,大约 9k 可疑行,没有全文索引,patIndex 解决方案是最快的,并且它还选择最多的行。
(预运行 km。将缓存设置为已知状态,运行 3 个进程,最后再次运行 km - 最后 2 次运行 km 给出的时间在 2 秒内)
Gerhard Weiss 的 patindex 解决方案 - 运行时 0:38, MT返回 9144 行子
串数字解。 -- 运行时 1:16,返回 8996 行
Deon Robertson 的 udf 解决方案 -- 运行时 3:47,返回 7316 行
running the various solutions on some real world data - 12M rows varchar length ~30, around 9k dodgy rows, no full text index in play, the patIndex solution is the fastest, and it also selects the most rows.
(pre-ran km. to set the cache to a known state, ran the 3 processes, and finally ran km again - the last 2 runs of km gave times within 2 seconds)
patindex solution by Gerhard Weiss -- Runtime 0:38, returns 9144 rows
the substring-numbers solution by MT. -- Runtime 1:16, returned 8996 rows
udf solution by Deon Robertson -- Runtime 3:47, returns 7316 rows
网络上有一个用户定义的函数“解析字母数字”。 Google UDF 解析字母数字,您应该找到它的代码。此用户定义的函数删除所有不适合 0-9、az 和 AZ 之间的字符。
这应该会带回任何包含您的姓氏和无效字符的记录...尽管您的奖励积分问题更具挑战性,但我认为案例陈述可以处理它。这是有点伪代码,我不完全确定它是否有效。
我在论坛邮箱中写了这个......所以我不太确定它是否会按原样运行,但它应该接近。我不太确定如果单个记录有两个带有无效字符的字段,它将如何表现。
作为替代方案,您应该能够将 from 子句从单个表更改为看起来像这样的子查询:
这里的好处是对于每一列,您只需要在此处扩展 union 语句,而您需要将对该脚本的 case 语句版本中的每一列进行 3 次比较
There is a user defined function available on the web 'Parse Alphanumeric'. Google UDF parse alphanumeric and you should find the code for it. This user defined function removes all characters that doesn't fit between 0-9, a-z, and A-Z.
That should bring back any records that have a last_name with invalid chars for you...though your bonus points question is a bit more of a challenge, but I think a case statement could handle it. This is a bit psuedo code, I'm not entirely sure if it'd work.
I wrote this in the forum post box...so I'm not quite sure if that'll function as is, but it should be close. I'm not quite sure how it will behave if a single record has two fields with invalid chars either.
As an alternative, you should be able to change the from clause away from a single table and into a subquery that looks something like:
Benefit here is for every column you'll only need to extend the union statement here, while you need to put that comparisson three times for every column in the case statement version of this script
要查找哪个字段包含无效字符:
您可以使用以下查询对其进行测试:
结果将是:
当您编写 xml 文件并在验证它时收到无效字符错误时,它非常有用。
To find which field has invalid characters:
You can test it with this query:
The result will be:
It is very useful when you write xml files and get error of invalid characters when validate it.
这是我构建的 UDF,用于检测具有扩展 ASCII 字符的列。它速度很快,并且您可以扩展要检查的字符集。第二个参数允许您在检查标准字符集之外的任何内容或允许扩展集之间切换:
用法:
Here is a UDF I built to detectc columns with extended ascii charaters. It is quick and you can extended the character set you want to check. The second parameter allows you to switch between checking anything outside the standard character set or allowing an extended set:
USAGE:
我采用了 Gerhard Weiss 的脚本并将其变成了我自己的脚本,因为我的业务场景非常简单,只需要更新 10K 行。
I took Gerhard Weiss's script and made it my own as my business scenario was pretty simple and only needed to update 10K rows.