如何从 SQL Server 中的字符串中删除所有非字母字符?
如何从字符串中删除所有非字母字符?
非字母数字怎么办?
这是否必须是自定义函数,还是还有更通用的解决方案?
How could you remove all characters that are not alphabetic from a string?
What about non-alphanumeric?
Does this have to be a custom function or are there also more generalizable solutions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(21)
尝试这个函数:
像这样调用它:
一旦理解了代码,您应该会发现更改它以删除其他字符也相对简单。 您甚至可以使其足够动态以传递您的搜索模式。
Try this function:
Call it like this:
Once you understand the code, you should see that it is relatively simple to change it to remove other characters, too. You could even make this dynamic enough to pass in your search pattern.
George Mastros' 很棒的答案:
仅字母:
非字母数字:
, '^a-z')仅数字:
仅字母数字:
非字母数字:
, '^0-9')仅字母数字:
非字母数字:
, '^a-z')仅数字:
仅字母数字:
非字母数字:
, 'a-z0-9') , '^a-z')仅数字:
仅字母数字:
非字母数字:
, '^0-9')仅字母数字:
非字母数字:
, '^a-z')仅数字:
仅字母数字:
非字母数字:
, '^a-z0-9')非字母数字:
, '^a-z')仅数字:
仅字母数字:
非字母数字:
, '^0-9')仅字母数字:
非字母数字:
, '^a-z')仅数字:
仅字母数字:
非字母数字:
Parameterized version of George Mastros' awesome answer:
Alphabetic only:
Non-alphanumeric:
, '^a-z')Numeric only:
Alphanumeric only:
Non-alphanumeric:
, '^0-9')Alphanumeric only:
Non-alphanumeric:
, '^a-z')Numeric only:
Alphanumeric only:
Non-alphanumeric:
, 'a-z0-9') , '^a-z')Numeric only:
Alphanumeric only:
Non-alphanumeric:
, '^0-9')Alphanumeric only:
Non-alphanumeric:
, '^a-z')Numeric only:
Alphanumeric only:
Non-alphanumeric:
, '^a-z0-9')Non-alphanumeric:
, '^a-z')Numeric only:
Alphanumeric only:
Non-alphanumeric:
, '^0-9')Alphanumeric only:
Non-alphanumeric:
, '^a-z')Numeric only:
Alphanumeric only:
Non-alphanumeric:
不管你信不信,在我的系统中,这个丑陋的函数比 G Mastros 优雅的函数表现得更好。
Believe it or not, in my system this ugly function performs better than G Mastros elegant one.
我知道 SQL 不擅长字符串操作,但我没想到会这么困难。 这是一个简单的函数,用于从字符串中删除所有数字。 会有更好的方法来做到这一点,但这只是一个开始。
输出
第 2 轮 - 数据驱动的黑名单
输出
我向读者提出的挑战:你能提高效率吗? 使用递归怎么样?
I knew that SQL was bad at string manipulation, but I didn't think it would be this difficult. Here's a simple function to strip out all the numbers from a string. There would be better ways to do this, but this is a start.
Output
Round 2 - Data-Driven Blacklist
Output
My challenge to readers: Can you make this more efficient? What about using recursion?
原始答案
SQL Server 2017+ 的另一个可能选项(没有循环和/或递归)是使用
TRANSLATE()
和REPLACE()< 的基于字符串的方法/代码>。
T-SQL 语句:
或作为函数:
更新
感谢@ttugates 的评论,我发现这种方法存在一个小问题。 实际的边缘情况是
'V 0 S'
作为输入。LEN()
函数会截断尾随空格,从而导致“TRANSLATE 内置函数的第二个和第三个参数必须包含相同数量的字符”错误消息。 解决方案是对长度计算进行小的更改,但方法保持不变:'a'
,然后替换与''
。'0'
,然后替换为''
。LEN(REPLACE(@text, ' ', '.'))
。更新功能:
Original answer
Another possibe option for SQL Server 2017+, without loops and/or recursion, is a string-based approach using
TRANSLATE()
andREPLACE()
.T-SQL statement:
or as a function:
Update
Thanks to the @ttugates's comment, I've found a small issue in this approach. The actual edge case is the
'V 0 S'
as an input. TheLEN()
function truncates trailing spaces, which results in "The second and third arguments of the TRANSLATE built-in function must contain an equal number of characters" error message. A solution is a small change in the length calculation, but the approach remains the same:'a'
and after that replaced with''
.'0'
and after that replaced with''
.LEN(REPLACE(@text, ' ', '.'))
.Updated function:
这是一个不需要创建函数或列出要替换的字符的所有实例的解决方案。 它结合使用递归WITH 语句和PATINDEX 来查找不需要的字符。 它将替换列中所有不需要的字符 - 任何给定字符串中最多包含 100 个唯一的错误字符。 (例如“ABC123DEF234”将包含 4 个坏字符 1、2、3 和 4)100 限制是WITH 语句中允许的最大递归数,但这不会对要处理的行数施加限制,仅受可用内存的限制。
如果您不想要不同的结果,可以从代码中删除这两个选项。
Here's a solution that doesn't require creating a function or listing all instances of characters to replace. It uses a recursive WITH statement in combination with a PATINDEX to find unwanted chars. It will replace all unwanted chars in a column - up to 100 unique bad characters contained in any given string. (E.G. "ABC123DEF234" would contain 4 bad characters 1, 2, 3 and 4) The 100 limit is the maximum number of recursions allowed in a WITH statement, but this doesn't impose a limit on the number of rows to process, which is only limited by the memory available.
If you don't want DISTINCT results, you can remove the two options from the code.
在查看了所有给定的解决方案后,我认为必须有一个纯 SQL 方法,不需要函数或 CTE / XML 查询,并且不涉及难以维护的嵌套 REPLACE 语句。 这是我的解决方案:
这样做的优点是有效字符包含在子查询的一个字符串中,从而可以轻松地重新配置一组不同的字符。
缺点是您必须为每个字符添加一行 SQL,最多可达列的大小。 为了使该任务更容易,我只使用了下面的 Powershell 脚本,此示例针对 VARCHAR(64):
Having looked at all the given solutions I thought that there has to be a pure SQL method that does not require a function or a CTE / XML query, and doesn't involve difficult to maintain nested REPLACE statements. Here is my solution:
The advantage of doing it this way is that the valid characters are contained in the one string in the sub query making easy to reconfigure for a different set of characters.
The downside is that you have to add a row of SQL for each character up to the size of your column. To make that task easier I just used the Powershell script below, this example if for a VARCHAR(64):
如果您像我一样,无法向生产数据添加函数,但仍想执行这种过滤,那么这里有一个纯 SQL 解决方案,使用 PIVOT 表将过滤后的部分重新组合在一起。
注意 我对表格进行了硬编码,最多可包含 40 个字符,如果要过滤的字符串较长,则必须添加更多字符。
If you are like me and don't have access to just add functions to your production data but still want to perform this kind of filtering, here's a pure SQL solution using a PIVOT table to put the filtered pieces back together again.
N.B. I hardcoded the table up to 40 characters, you'll have to add more if you have longer strings to filter.
这是使用 iTVF 删除非字母字符的另一种方法。 首先,您需要一个基于模式的字符串拆分器。 以下摘自 Dwain Camp 的文章:
现在您有了基于模式的拆分器,您需要拆分与模式匹配的字符串:
然后将它们连接回去以获得所需的结果:
示例
结果:
Here is another way to remove non-alphabetic characters using an
iTVF
. First, you need a pattern-based string splitter. Here is one taken from Dwain Camp's article:Now that you have a pattern-based splitter, you need to split the strings that match the pattern:
and then concatenate them back to get the desired result:
SAMPLE
Result:
该解决方案受到 Allen 先生的解决方案的启发,需要一个整数
Numbers
表(如果您想要以良好的性能执行严格的查询操作,您应该拥有该表)。 它不需要 CTE。 您可以更改NOT IN (...)
表达式以排除特定字符,或将其更改为IN (...)
ORLIKE
表达式仅保留某些字符。This solution, inspired by Mr. Allen's solution, requires a
Numbers
table of integers (which you should have on hand if you want to do serious query operations with good performance). It does not require a CTE. You can change theNOT IN (...)
expression to exclude specific characters, or change it to anIN (...)
ORLIKE
expression to retain only certain characters.从性能角度来看,我会使用内联函数:
From performance perspective I'd use Inline Function:
这是另一个递归 CTE 解决方案,基于@Gerhard Weiss 此处的答案。 您应该能够将整个代码块复制并粘贴到 SSMS 中并在那里使用。 结果包括一些额外的列,以帮助我们了解正在发生的事情。 我花了一段时间才理解 PATINDEX (RegEx) 和递归 CTE 的所有原理。
Here's another recursive CTE solution, based on @Gerhard Weiss's answer here. You should be able to copy and paste the whole code block into SSMS and play with it there. The results include a few extra columns to help us understand what's going on. It took me a while until I understood all that's going on with both PATINDEX (RegEx) and the recursive CTE.
对于 SQL Server >= 2017...
For SQL Server >= 2017...
我把它放在调用 PatIndex 的两个地方。
对于上面的自定义函数RemoveNonAlphaCharacters,并将其重命名为RemoveNonAlphaNumericCharacters
I put this in both places where PatIndex is called.
for the custom function above RemoveNonAlphaCharacters and renamed it RemoveNonAlphaNumericCharacters
--首先创建一个函数
现在调用这个函数,就像
它的结果一样
--First create one function
Now call this function like
Its result like
使用 CTE 生成的数字表来检查每个字符,然后使用 FOR XML 连接到保留值的字符串,您可以...
Using a CTE generated numbers table to examine each character, then FOR XML to concat to a string of kept values you can...
这种方法对我来说不起作用,因为我试图保留阿拉伯字母,我试图替换正则表达式,但它也不起作用。 我编写了另一种在 ASCII 级别上工作的方法,因为这是我唯一的选择并且它有效。
去
this way didn't work for me as i was trying to keep the Arabic letters i tried to replace the regular expression but also it didn't work. i wrote another method to work on ASCII level as it was my only choice and it worked.
GO
虽然帖子有点老了,但我还是想说一下。
我在上述解决方案中遇到的问题是,它不会过滤掉 ç、ë、ï 等字符。我按如下方式修改了一个函数(我只使用了 80 varchar 字符串来节省内存):
Though post is a bit old, I would like to say the following.
Issue I had with above solution is that it does not filter out characters like ç, ë, ï, etc. I adapted a function as follows (I only used an 80 varchar string to save memory):
我刚刚发现它内置于 Oracle 10g(如果您使用的是 Oracle 10g)。 我必须去掉所有特殊字符才能进行电话号码比较。
I just found this built into Oracle 10g if that is what you're using. I had to strip all the special characters out for a phone number compare.