awk sed 或正则表达式插入子字符串并更改大小写

发布于 2024-11-03 21:16:01 字数 485 浏览 1 评论 0原文

我正在对制表符分隔的文件进行一些转换,其中一列包含如下所示的层次标识符:

VI.d5.5
VII.b2.1
VII.b2.2
VII.b2.3
VII.c1

我需要将其转换为如下所示,在第一个和第二个点组之间插入第二个点组中的大写字母:

VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

I了解 sed 中的 \U 标志,但我不知道如何仅应用一次。例如,以下插入的字母和原始小写字母均大写:(不需要)

echo 'VII.b1.1' | sed -e 's/\([a-h]\)/\U\1.\1/'
VII.B.B1.1

我欢迎任何 shell(sed、awk、perl 等)或 vim 解决方案,它们允许我在适当的位置修改此列制表符分隔的文件。

I am doing some transformations on a tab-separated file wherein one column contains a heirarchical identifier like this:

VI.d5.5
VII.b2.1
VII.b2.2
VII.b2.3
VII.c1

I need to transform it to look like the following, inserting an up-cased letter from the second dot group between the first and second:

VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

I know about the \U flag in sed but I don't know how to apply it only once. For example, the following up-cases both the inserted letter and the original lower-case: (undesired)

echo 'VII.b1.1' | sed -e 's/\([a-h]\)/\U\1.\1/'
VII.B.B1.1

I would welcome any shell (sed, awk, perl, whatever) or vim solution that would allow me to modify this column in place in the tab-separated file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

夕色琉璃 2024-11-10 21:16:01

您尝试过 \u 而不是 \U 吗?根据 sed 信息页面 (info sed):

`\U'
     Turn the replacement to uppercase until a `\L' or `\E' is found,

`\u'
     Turn the next character to uppercase,

have you tried \u instead of \U? According to the sed info page (info sed):

`\U'
     Turn the replacement to uppercase until a `\L' or `\E' is found,

`\u'
     Turn the next character to uppercase,
向地狱狂奔 2024-11-10 21:16:01
sed -e 's/\.[a-z]/\U&\E&/'

Perl 也很好用:

perl -pe 's/\.[a-z]/uc(
amp;) . 
amp;/e'
sed -e 's/\.[a-z]/\U&\E&/'

Perl works well too:

perl -pe 's/\.[a-z]/uc(
amp;) . 
amp;/e'
夜清冷一曲。 2024-11-10 21:16:01

您不能在标准sed中执行此操作( 1),因为那里没有 \u\U 这样的东西。事实上,在我的所有系统(除了一个)上,它都失败了——而且还是默默地失败了,唉!我在 Mac 笔记本电脑和 Mac 台式机上尝试了 sed 版本,然后在 Solaris 服务器和 OpenBSD 服务器上尝试了它。我也在单独的 AIX 机器上尝试过,当然它在那里不起作用。 :(

但是,您应该能够以这种方式进行移植,这适用于我测试过的那些系统:

% cat sample
VI.d5.5                                                                           
VII.b2.1
VII.b2.2
VII.b2.3
VII.c1

% perl -wpe 's/([^.]+)\.(.)/$1.\u$2.$2/' /tmp/sample 
VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

不仅更可移植,而且也更容易。

这应该适用于过去 20 年发布的任何 Perl 版本年,包括 perl4。但是,如果您生活在最前沿并且至少安装了 5.10,那么您可以这样做:

% perl -M5.10.0 -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample
VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

-M5.10.0 只是为了使当然您确实拥有可用并已加载的 5.10 功能集,

那么 Unicode 又如何呢?

现在假设您的示例数据中包含 Unicode:

% cat /tmp/sample.utf8
Ⅵ.ð5.5
Ⅷ.ß2.3
Ⅺ.ç1

% uniquote /tmp/sample.utf8 
\N{U+2165}.\N{U+F0}5.5
\N{U+2167}.\N{U+DF}2.3
\N{U+216A}.\N{U+E7}1

% uniquote -v /tmp/sample.utf8
\N{ROMAN NUMERAL SIX}.\N{LATIN SMALL LETTER ETH}5.5
\N{ROMAN NUMERAL EIGHT}.\N{LATIN SMALL LETTER SHARP S}2.3
\N{ROMAN NUMERAL ELEVEN}.\N{LATIN SMALL LETTER C WITH CEDILLA}1

我可以向您保证您不会找到 sed 的版本。我去了我们牺牲的 Linux 盒子,虽然他们使用的 ɢɴᴜsed 对你的样本数据起作用,但它拒绝对这些字符之一进行案例映射。在我更喜欢的 Unicode 数据集中,即使我的语言环境设置正确,但 perl 版本仍然做了正确的事情,

但是使用 perl,只需添加 -CSD 即可。命令行选项告诉 perl 数据文件和 std{in,out,err} 都是 UTF-8 格式,然后运行相同的命令,您将看到真正的Qᴜɪᴛᴇ Iɴᴛᴇʀᴇsᴛɪɴɢ

% perl -CSD -wpe 's/([^.]+)\.(.)/$1.\u$2.$2/' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.Ss.ß2.3
Ⅺ.Ç.ç1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.Ss.ß2.3
Ⅺ.Ç.ç1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\U$1./' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.SS.ß2.3
Ⅺ.Ç.ç1

如您所见,\u标题大小写\ 的大写之间存在差异。 U 确实如此。这是因为小写字母“ß”在标题大写中是“Ss”,但在大写字母中是“SS”。奇怪但真实!诚然,这种情况在希腊字母中发生的情况比在我们使用的拉丁字母中发生的情况要多得多,但您仍然希望做得正确。

这是所有单引号,这样您就可以看到哪些代码点我们正在谈论:

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8 | uniquote
\N{U+2165}.\N{U+D0}.\N{U+F0}5.5
\N{U+2167}.Ss.\N{U+DF}2.3
\N{U+216A}.\N{U+C7}.\N{U+E7}1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8 | uniquote -v
\N{ROMAN NUMERAL SIX}.\N{LATIN CAPITAL LETTER ETH}.\N{LATIN SMALL LETTER ETH}5.5
\N{ROMAN NUMERAL EIGHT}.Ss.\N{LATIN SMALL LETTER SHARP S}2.3
\N{ROMAN NUMERAL ELEVEN}.\N{LATIN CAPITAL LETTER C WITH CEDILLA}.\N{LATIN SMALL LETTER C WITH CEDILLA}1

这样不是很酷吗?

You can’t do that in standard sed(1), because there is no such thing as \u or \U there. Indeed, on all my systems (but one) it fails — and silently, too, alas! I tried the sed version both on my Mac laptop and my Mac desktop, and then I tried it on our Solaris server and on our OpenBSD server. I tried it on the lone AIX box too, and of course it didn’t work there. :(

However, you should be able to do it portably this way, which works on those systems I tested:

% cat sample
VI.d5.5                                                                           
VII.b2.1
VII.b2.2
VII.b2.3
VII.c1

% perl -wpe 's/([^.]+)\.(.)/$1.\u$2.$2/' /tmp/sample 
VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

Not only is that more portable, it’s a lot easier, too.

That should work on any version of Perl released in the last 20 years, including perl4. However, if you’re living on the bleeding edge and so have at least 5.10 installed, then you can do it in this way instead:

% perl -M5.10.0 -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample
VI.D.d5.5
VII.B.b2.1
VII.B.b2.2
VII.B.b2.3
VII.C.c1

That ‑M5.10.0 is just to make sure you really have the 5.10 feature-set available and loaded.

What about Unicode?

Now suppose that your sample data had Unicode in it:

% cat /tmp/sample.utf8
Ⅵ.ð5.5
Ⅷ.ß2.3
Ⅺ.ç1

% uniquote /tmp/sample.utf8 
\N{U+2165}.\N{U+F0}5.5
\N{U+2167}.\N{U+DF}2.3
\N{U+216A}.\N{U+E7}1

% uniquote -v /tmp/sample.utf8
\N{ROMAN NUMERAL SIX}.\N{LATIN SMALL LETTER ETH}5.5
\N{ROMAN NUMERAL EIGHT}.\N{LATIN SMALL LETTER SHARP S}2.3
\N{ROMAN NUMERAL ELEVEN}.\N{LATIN SMALL LETTER C WITH CEDILLA}1

I can guarantee you that you aren’t going to find a version of sed that does the right thing on that data. It will mess up. I went to our sacrificial Linux box, and although the ɢɴᴜsed they use there works on your sample data, it refused to casemap one of those characters in my fancier Unicode dataset, even when I had the locale all set up right. But the perl version still did the right thing.

But with perl, just add the ‑CSD command-line options to tell perl that the datafiles and std{in,out,err} are all in UTF‑8, then run the same commands and you will see something that’s really Qᴜɪᴛᴇ Iɴᴛᴇʀᴇsᴛɪɴɢ:

% perl -CSD -wpe 's/([^.]+)\.(.)/$1.\u$2.$2/' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.Ss.ß2.3
Ⅺ.Ç.ç1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.Ss.ß2.3
Ⅺ.Ç.ç1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\U$1./' /tmp/sample.utf8
Ⅵ.Ð.ð5.5
Ⅷ.SS.ß2.3
Ⅺ.Ç.ç1

As you see, there is a difference between the titlecasing that \u does and the uppercasing that \U does. That’s because the lowercase letter “ß” is “Ss” in titlecase but “SS” in uppercase. Bizarre but true! This sort of thing admittedly happens a lot more with the Greek letters than it does with the Latin ones like we use, but you still want to do it right.

Here that is all uniquoted so you can see just which code points we’re talking about:

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8 | uniquote
\N{U+2165}.\N{U+D0}.\N{U+F0}5.5
\N{U+2167}.Ss.\N{U+DF}2.3
\N{U+216A}.\N{U+C7}.\N{U+E7}1

% perl -CSD -wpe 's/[^.]+\.\K(?=(.))/\u$1./' /tmp/sample.utf8 | uniquote -v
\N{ROMAN NUMERAL SIX}.\N{LATIN CAPITAL LETTER ETH}.\N{LATIN SMALL LETTER ETH}5.5
\N{ROMAN NUMERAL EIGHT}.Ss.\N{LATIN SMALL LETTER SHARP S}2.3
\N{ROMAN NUMERAL ELEVEN}.\N{LATIN CAPITAL LETTER C WITH CEDILLA}.\N{LATIN SMALL LETTER C WITH CEDILLA}1

Isn’t that way cool?

少女七分熟 2024-11-10 21:16:01

尝试使用 \u 而不是 \U,它将下一个字符变为大写。但如果你想使用 \U 那么你必须用 \E 或 \L 停止大写,就像

's/\([ah]\)/\U\1\E.\1/'

Try using \u instead of \U which turns the next character uppercase. But if you wanna use \U then you have to stop the uppercase with \E or \L do like

's/\([a-h]\)/\U\1\E.\1/'

终止放荡 2024-11-10 21:16:01
sed -e 's/\([^.]\+\)\.\(.\)/\1.\u\2\.\2/'

像这样:

$ sed -e 's/\([^.]\+\)\.\(.\)/\1.\u\2\.\2/' <<<'VI.d5.5'
VI.D.d5.5
sed -e 's/\([^.]\+\)\.\(.\)/\1.\u\2\.\2/'

like this:

$ sed -e 's/\([^.]\+\)\.\(.\)/\1.\u\2\.\2/' <<<'VI.d5.5'
VI.D.d5.5
酒绊 2024-11-10 21:16:01

这是一个 awk 解决方案。不需要混乱的正则表达式。基本思想:按点分割,获取第二个字段的第一个字符。然后使用 toupper() 函数更改其大小写。最后,替补回到第二场。

awk -F"." '{
    ch = toupper(substr($2,1,1))
    $2=ch"."$2
}1' OFS="." file

Here's an awk solution. No messy regular expressions needed. Basic idea: Split on dot, get the first character of 2nd field. Then change its case using toupper() function. Lastly, substitute back to 2nd field.

awk -F"." '{
    ch = toupper(substr($2,1,1))
    $2=ch"."$2
}1' OFS="." file
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文