在脚本（bash / perl / python）中使用正则掩盖SSN

发布于 2025-02-08 08:34:46 字数 850 浏览 2 评论 0原文

我正在尝试编写一个小脚本（最好是在bash中，但python或perl也可以使用）来掩盖SSN的前5位数字（以123456789或123-45-6789形式使用，因此它将输出xxxxxx6789或xxxx -XX-6789）。输入在文本文件中。

我知道我应该能够使用SED做到这一点，但是我在创建合适的正则义务方面遇到了麻烦（然后我必须进行替代）。它应该正确处理所有这些用例：

123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

以便SSN可以在线的开头，中间或末尾发生。

输出（例如，对于前两行），应该用XS掩盖前5个数字）：

XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.

我设法获得了一个GREP的正则态度，该正则是正确匹配我想要的表达式：

grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile

我认为我应该能够能够在SED或尴尬中使用分组来获得我想要的结果，但是我尝试过的任何事情都没有奏效。

原文

I'm trying to write a small script (preferably in bash, but python or perl would also work) to mask the first 5 digits of a SSN (either in format 123456789 or 123-45-6789 - so it will output XXXXX6789 or XXX-XX-6789 respectively). The input is in a text file.

I know I should be able to do this with sed, but I'm having trouble with creating the right regex (and then I have to do the substitution). It should properly handle all these use cases:

123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

So the SSN can occur at the beginning of a line, in the middle somewhere, or at the end.

The output (for the first two lines, for example) should have the first 5 numbers masked, say with Xs):

XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.

I've managed to get a grep regex that correctly matches only the expressions I want:

grep '\b[0-9]\{3\}-\{0,1\}[0-9]\{2\}-\{0,1\}[0-9]\{4\}\b' testfile

I think I should be able to use grouping in sed or awk to get the results I want, but none of the things I've tried have worked.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南薇 2025-02-15 08:34:46

使用sed

$ sed '/\<[0-9]\{9\}\>\|\<[0-9-]\{11\}\>/{s/[0-9]\{5\}/XXXXX/;s/[0-9]\{3\}-[0-9]\{2\}/XXX-XX/g}' input_file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

Using sed

$ sed '/\<[0-9]\{9\}\>\|\<[0-9-]\{11\}\>/{s/[0-9]\{5\}/XXXXX/;s/[0-9]\{3\}-[0-9]\{2\}/XXX-XX/g}' input_file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

回复收藏 0 原文

脱离于你 2025-02-15 08:34:46

使用GNU awk进行第三次arg match（）和gensub（）和\＆lt; \＆gt;单词边界：

$ awk '
    match($0,/(.*)(\<[0-9]{3}-?[0-9]{2})(-?[0-9]{4}\>.*)/,a) {
        $0 = a[1] gensub(/[0-9]/,"X","g",a[2]) a[3]
    }
1' file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

With GNU awk for the 3rd arg to match() and gensub() and \< \> word boundaries:

$ awk '
    match($0,/(.*)(\<[0-9]{3}-?[0-9]{2})(-?[0-9]{4}\>.*)/,a) {
        $0 = a[1] gensub(/[0-9]/,"X","g",a[2]) a[3]
    }
1' file
XXXXX6789 needs to be matched.
XXX-XX-6789 does, too.
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

回复收藏 0 原文

梅倚清风 2025-02-15 08:34:46

perl -lpe 's/\b[0-9]{3}(-?)[0-9]{2}(-?)([0-9]{4})\b/XXX${1}XX$2$3/g'

您只需要捕获最终输出的内容：（可能的）破折号和最后四个数字。而且，Perl的Regex语法消除了不必要的后斜切，这很好。

（具体来说，在Perl Regex中，“魔术”功能总是连接到标点符号，而没有后斜线，或带有 backsslashes的Alphanumerics ; BackStslashing Putincation始终使它成为非特殊性的。）

perl -lpe 's/\b[0-9]{3}(-?)[0-9]{2}(-?)([0-9]{4})\b/XXX${1}XX$2$3/g'

You only need to capture the things that will end up in the output: the (possible) dashes and the last four digits. And, Perl's regex syntax eliminates unnecessary backslashes, which is nice.

(Specifically, in perl regex, "magic" functions are always attached to punctuation without backslashes, or alphanumerics with backslashes; backslashing punctuation will always make it non-special.)

回复收藏 0 原文

甜点 2025-02-15 08:34:46

假设前8行应该施加掩码（将最后3行未触及）：

修改输入文件以在前2行中包含双匹配的SSN模式：

$ cat testfile
123456789 needs to be matched (and again 123-45-6789)
123-45-6789 does, too (and again 123456789)
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

一个SED> SED使用OP的修改版本的想法REGEX：

sed -r 's/\b([0-9]{3})(-{0,1})([0-9]{2})(-{0,1}[0-9]{4})\b/XXX\2XX\4/g' testfile

where：

-r - 启用扩展的正则表面支持（消除需要逃脱Parens和Braces）
（[0-9] {3}） - 匹配3位数字（< em> 1st捕获组）
（ - {0,1}） - 匹配可选-（第二捕获组）
<代码>（[0-9] {2}） - 匹配2位数字（ 3rd Capture group ）
（ - {0,1} [0-9] [0-9] {4}） - 匹配可选- + 4位数字（第四捕获组）
xxx \ 2xx \ 2xx \ 4 - 用xxx ，按原样打印第二个捕获组，用xx替换第三捕获组，打印第四捕获组，如
g - 应用于行中的所有匹配项

这会生成：

XXXXX6789 needs to be matched (and again XXX-XX-6789)
XXX-XX-6789 does, too (and again XXXXX6789)
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out

Assuming the first 8 lines should have a mask applied (leaving the last 3 lines untouched):

Modifying input file to include dual matching SSN patterns in the first 2 lines:

$ cat testfile
123456789 needs to be matched (and again 123-45-6789)
123-45-6789 does, too (and again 123456789)
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

One sed idea using a modified version of OP's regex:

sed -r 's/\b([0-9]{3})(-{0,1})([0-9]{2})(-{0,1}[0-9]{4})\b/XXX\2XX\4/g' testfile

Where:

-r - enable extended regex support (eliminates need to escape parens and braces)
([0-9]{3}) - match 3 digits (1st capture group)
(-{0,1}) - match optional - (2nd capture group)
([0-9]{2}) - match 2 digits (3rd capture group)
(-{0,1}[0-9]{4}) - match optional - + 4 digits (4th capture group)
XXX\2XX\4 - replace 1st capture group with XXX, print 2nd capture group as is, replace 3rd capture group with XX, print 4th capture group as is
g - apply to all matches in a line

This generates:

XXXXX6789 needs to be matched (and again XXX-XX-6789)
XXX-XX-6789 does, too (and again XXXXX6789)
Mask this XXX-XX-6789 SSN please
Don't miss XXXXX6789 either.
XXXXX6789 should match.
XXX-XX-6789 should also match.
As should XXXXX6789
And XXX-XX-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out

回复收藏 0 原文

深海不蓝 2025-02-15 08:34:46

GREP倒置匹配正则匹配（固定）：

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' input-file.txt

GREP选项：

-v：Inververs Match（无匹配的所有内容）。
-e：使用扩展的正则语法进行模式。

REGEX详细信息：

（[^0-9] |^）：匹配非数字或线路的开始。
[0-9] {3} - ？：可选匹配3位数字，然后是破折号。
[0-9] {2} - ？：匹配2位数字，然后是破折号。
[0-9] {4}：匹配4位数字。
（[^0-9] | $）：匹配非数字或线路的结尾。

测试的测试

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' <<'EOF'
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
EOF

输出：

But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

Grep invert match regex (fixed):

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' input-file.txt

Grep options:

-v: Inverts match (prints everything without a match).
-E: Uses the Extended regex grammar for the pattern.

Regex detail:

([^0-9]|^): Matches a non-digit or beginning of line.
[0-9]{3}-?: Matches 3 digits optionally followed by a dash.
[0-9]{2}-?: Matches 2 digits optionally followed by a dash.
[0-9]{4}: Matches 4 digits.
([^0-9]|$): Matches a non-digit or end of line.

Testing

grep -vE '([^0-9]|^)[0-9]{3}-?[0-9]{2}-?[0-9]{4}([^0-9]|$)' <<'EOF'
123456789 needs to be matched.
123-45-6789 does, too.
Mask this 123-45-6789 SSN please
Don't miss 123456789 either.
123456789 should match.
123-45-6789 should also match.
As should 123456789
And 123-45-6789
But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.
EOF

Output of test:

But not 1234567890
1234567890 should also not match.
And 1234567890 is right out.

回复收藏 0 原文

~没有更多了~

关于作者

夏见

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

在脚本（bash / perl / python）中使用正则掩盖SSN

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

在脚本（bash / perl / python）中使用正则掩盖SSN

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。