在脚本(bash / perl / python)中使用正则掩盖SSN
我正在尝试编写一个小脚本(最好是在bash中,但python或perl也可以使用)来掩盖SSN的前5位数字(以123456789或123-45-6789形式使用,因此它将输出xxxxxx6789或xxxx -XX-6789)。输入在文本文件中。
我知道我应该能够使用SED做到这一点,但是我在创建合适的正则义务方面遇到了麻烦(然后我必须进行替代)。它应该正确处理所有这些用例:
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
以便SSN可以在线的开头,中间或末尾发生。
输出(例如,对于前两行),应该用XS掩盖前5个数字):
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
我设法获得了一个GREP的正则态度,该正则是正确匹配我想要的表达式:
grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile
我认为我应该能够能够在SED或尴尬中使用分组来获得我想要的结果,但是我尝试过的任何事情都没有奏效。
I'm trying to write a small script (preferably in bash, but python or perl would also work) to mask the first 5 digits of a SSN (either in format 123456789 or 123-45-6789 - so it will output XXXXX6789 or XXX-XX-6789 respectively). The input is in a text file.
I know I should be able to do this with sed, but I'm having trouble with creating the right regex (and then I have to do the substitution). It should properly handle all these use cases:
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
So the SSN can occur at the beginning of a line, in the middle somewhere, or at the end.
The output (for the first two lines, for example) should have the first 5 numbers masked, say with Xs):
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
I've managed to get a grep regex that correctly matches only the expressions I want:
grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile
I think I should be able to use grouping in sed or awk to get the results I want, but none of the things I've tried have worked.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用
sed
Using
sed
使用GNU awk进行第三次arg
match()
和gensub()
和\<
\>
单词边界:With GNU awk for the 3rd arg to
match()
andgensub()
and\<
\>
word boundaries:您只需要捕获最终输出的内容:(可能的)破折号和最后四个数字。而且,Perl的Regex语法消除了不必要的后斜切,这很好。
(具体来说,在Perl Regex中,“魔术”功能总是连接到标点符号,而没有后斜线,或带有 backsslashes的Alphanumerics ; BackStslashing Putincation始终使它成为非特殊性的。)
You only need to capture the things that will end up in the output: the (possible) dashes and the last four digits. And, Perl's regex syntax eliminates unnecessary backslashes, which is nice.
(Specifically, in perl regex, "magic" functions are always attached to punctuation without backslashes, or alphanumerics with backslashes; backslashing punctuation will always make it non-special.)
假设前8行应该施加掩码(将最后3行未触及):
修改输入文件以在前2行中包含双匹配的SSN模式:
一个
SED> SED
使用OP的修改版本的想法REGEX:where:
-r
- 启用扩展的正则表面支持(消除需要逃脱Parens和Braces)([0-9] {3})
- 匹配3位数字(< em> 1st捕获组)( - {0,1})
- 匹配可选-
(第二捕获组)( - {0,1} [0-9] [0-9] {4})
- 匹配可选-
+ 4位数字(第四捕获组)xxx \ 2xx \ 2xx \ 4
- 用xxx ,按原样打印第二个捕获组,用
xx
替换第三捕获组,打印第四捕获组,如g
- 应用于行中的所有匹配项这会生成:
Assuming the first 8 lines should have a mask applied (leaving the last 3 lines untouched):
Modifying input file to include dual matching SSN patterns in the first 2 lines:
One
sed
idea using a modified version of OP's regex:Where:
-r
- enable extended regex support (eliminates need to escape parens and braces)([0-9]{3})
- match 3 digits (1st capture group)(-{0,1})
- match optional-
(2nd capture group)([0-9]{2})
- match 2 digits (3rd capture group)(-{0,1}[0-9]{4})
- match optional-
+ 4 digits (4th capture group)XXX\2XX\4
- replace 1st capture group withXXX
, print 2nd capture group as is, replace 3rd capture group withXX
, print 4th capture group as isg
- apply to all matches in a lineThis generates:
GREP倒置匹配正则匹配(固定):
GREP选项:
-v
:Inververs Match(无匹配的所有内容)。-e
:使用扩展的正则语法进行模式。REGEX详细信息:
([^0-9] |^)
:匹配非数字或线路的开始。[0-9] {3} - ?
:可选匹配3位数字,然后是破折号。[0-9] {2} - ?
:匹配2位数字,然后是破折号。[0-9] {4}
:匹配4位数字。([^0-9] | $)
:匹配非数字或线路的结尾。测试的测试
输出:
Grep invert match regex (fixed):
Grep options:
-v
: Inverts match (prints everything without a match).-E
: Uses the Extended regex grammar for the pattern.Regex detail:
([^0-9]|^)
: Matches a non-digit or beginning of line.[0-9]{3}-?
: Matches 3 digits optionally followed by a dash.[0-9]{2}-?
: Matches 2 digits optionally followed by a dash.[0-9]{4}
: Matches 4 digits.([^0-9]|$)
: Matches a non-digit or end of line.Testing
Output of test: