sed 删除下划线并提升字符
我正在尝试将一些代码从旧的命名方案迁移到新的命名方案,旧的命名方案是:
int some_var_name;
新的命名方案是
int someVarName_:
所以我希望使用某种形式的 sed / regexy 来简化这一过程。所以从根本上来说需要发生的是:
查找包含 _ 的小写单词,将下划线替换为空,并将 _ 右侧的字符提升为大写。之后在比赛末尾添加 _。
是否可以使用 Sed 和/或 Awk 和正则表达式来做到这一点?如果不是为什么不呢?
任何示例脚本将不胜感激。
非常感谢您的帮助。
编辑:
为了清楚起见,重命名是针对许多使用错误的命名约定编写的文件,需要与代码库的其余部分保持一致。预计这不会进行完美的替换,使所有内容都处于可编译状态。相反,脚本将运行,然后手动检查是否有任何异常。替换脚本纯粹是为了减轻必须手动纠正所有内容的负担,我相信您会同意这是相当乏味的。
I am trying to migrate some code from an old naming scheme to the new one the old naming scheme is:
int some_var_name;
New one is
int someVarName_:
So what I would ilke is some form of sed / regexy goodness to ease the process. So fundamentally what needs to happen is:
find lower case word with contained _ replace underscore with nothing and promote the char to the right of the _ to uppercase. After this appending an _ to the end of the match.
Is it possible to do this with Sed and/or Awk and regex? If not why not?
Any examples scripts would be appreciated.
thanks very much for any assistance.
EDIT:
For a bit of clarity the renaming is for a number of files that were written with the wrong naming convention and need to be brought into line with the rest of the codebase. It is not expected that this do a perfect replace that leaves everything in a compilable state. Rather the script will be run and then looked over by hand for any anomalies. The replace script would be purely to ease the burden of having to correct everything by hand, which i'm sure you would agree is considerably tedious.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
sed -re 's,[az]+(_[az]+)+,&_,g' -e 's,_([az]),\u\1,g'
说明:
这是一个包含 2 个表达式的 sed 命令(每个表达式都在
-e
后面加引号。)s,,,g
是全局替换。您通常会看到它带有斜杠而不是逗号,但我认为当您在模式中使用反斜杠(并且没有逗号)时,这更容易阅读。尾随的 g(表示“全局”)意味着将此替换应用于每行上的所有匹配项,而不仅仅是第一行。第一个表达式会将下划线附加到由小写单词 (
[az]+
) 组成的每个标记,后跟由下划线分隔的非零个小写单词 ((_[az]+ )+
)。我们将其替换为&_
,其中&
表示“所有匹配的内容”,而_
只是文字下划线。所以总的来说,这个表达式是说在每个 underscore_separated_lowercase_token 的末尾添加一个下划线。第二个表达式与模式
_([az]))
匹配,其中(
和)
之间的所有内容都是捕获组。这意味着我们稍后可以将其作为\1
引用(因为它是第一个捕获组。如果还有更多,它们将是\2
、\3
,等等。)。所以我们说要匹配下划线后面的小写字母,并记住该字母。我们将其替换为
\u\1
,这是我们刚刚记住的字母,但将\u
变为大写。这段代码没有做任何聪明的事情来避免修改
#include
行或类似的事情;它将用其等效的大写字母替换下划线后面的每个小写字母。sed -re 's,[a-z]+(_[a-z]+)+,&_,g' -e 's,_([a-z]),\u\1,g'
Explanation:
This is a sed command with 2 expressions (each in quotes after a
-e
.)s,,,g
is a global substitution. You usually see it with slashes instead of commas, but I think this is easier to read when you're using backslashes in the patterns (and no commas). The trailing g (for "global") means to apply this substitution to all matches on each line, rather than just the first.The first expression will append an underscore to every token made up of a lowercase word (
[a-z]+
) followed by a nonzero number of lowercase words separated by underscores ((_[a-z]+)+
). We replace this with&_
, where&
means "everything that matched", and_
is just a literal underscore. So in total, this expression is saying to add an underscore to the end of every underscore_separated_lowercase_token.The second expression matches the pattern
_([a-z]))
, where everything between(
and)
is a capturing group. This means we can refer back to it later as\1
(because it's the first capturing group. If there were more, they would be\2
,\3
, and so on.). So we're saying to match a lowercase letter following an underscore, and remember the letter.We replace it with
\u\1
, which is the letter we just remembered, but made uppercase by that\u
.This code doesn't do anything clever to avoid munging
#include
lines or the like; it will replace every instance of a lowercase letter following an underscore with its uppercase equivalent.几年前,我成功地将 300,000 LOC 23 年历史的旧代码库转换为驼峰命名法。只花了两天时间。但有一些挥之不去的影响需要几个月的时间才能解决。这是惹恼其他程序员的非常好方法。
我相信简单、愚蠢、类似 sed 的方法有其优点。据我所知,基于 IDE 的工具等不能:
代码并且遗留代码必须在多个不同的编译器/操作系统平台上维护(=很多#ifdef)。
类似于 sed 的愚蠢方法的主要缺点是字符串(例如关键字)可能会无意中被更改。我只为 C 做过这个; C++ 可能是另一桶鱼。
大约有五个阶段:
对于步骤 1,要生成要更改的标记列表,命令:
将在 list1 中生成:
在此示例中,您确实不想更改这两个标记,因此请手动编辑列表以删除它们。但您可能会错过一些,因此在本示例中,假设您保留了这些。
下一步 2 是生成一个脚本来进行更改。例如,命令:
将把 _a、_b、_c 和 _t 更改为 A、B、C 和 T,以生成:
您只需将其扩展以覆盖 d、e、f、...、x、y ,z,
我假设您已经为您的开发环境编写了类似“glob_sub”的内容。 (如果没有,现在放弃。)我的版本(csh,Cygwin)看起来像:(
我的一些 sed 不支持 --in-place 选项,所以我必须使用 mv。)
第三步是将 list2 中的这个脚本应用到您的代码库中。例如,在 csh 中使用
source list2
。第四步,编译。编译器将(希望如此!)反对
xxxx_timeT
。事实上,它可能应该只反对timeT
但额外的xxx_
增加了保险。所以从 time_t 来看你犯了一个错误。使用例如撤消它第五步也是最后一步是使用您最喜欢的 diff 实用程序手动检查您的更改,然后通过删除所有不需要的
xxx_
前缀进行清理。 Grepping for"xxx_
也将有助于检查字符串中的标记。(事实上,添加 _xxx 后缀可能是一个好主意。)A few years ago I successfully converted a legacy 300,000 LOC 23-year-old code base to camelCase. It took only two days. But there were a few lingering affects that took a couple of months to sort out. And it is an very good way to annoy your fellow coders.
I believe that a simple, dumb, sed-like approach has advantages. IDE-based tools, and the like, cannot, as far as I know:
And the legacy code had to be maintained on several different compiler/OS platforms (= lots of #ifdefs).
The main disadvantage of a dumb, sed-like approach is that strings (such as keywords) can inadvertently be changed. And I've only done this for C; C++ might be another kettle of fish.
There are about five stages:
For step 1, to generate a list of tokens that you wish to change, the command:
will produce in list1:
In this sample, you really don't want to change these two tokens, so manually edit the list to delete them. But you'll probably miss some, so for the sake of this example, suppose you keep these.
The next step, 2, is to generate a script to do the changes. For example, the command:
will change _a, _b, _c, and _t to A, B, C, and T, to produce:
You just have to extend it to cover d, e, f, ..., x, y, z,
I'm presuming you have already written something like 'glob_sub' for your development environment. (If not, give up now.) My version (csh, Cygwin) looks like:
(Some of my sed's don't support the --in-place option, so I have to use a mv.)
The third step is to apply this script in list2 to your code base. For example, in csh use
source list2
.The fourth step is to compile. The compiler will (hopefully!) object to
xxxx_timeT
. Indeed, it should likely object to justtimeT
but the extraxxx_
adds insurance. So for time_t you've made a mistake. Undo it with e.g.The fifth and final step is to do a manual inspection of your changes using your favorite diff utility, and then clean-up by removing all the unwanted
xxx_
prefixes. Grepping for"xxx_
will also help check for tokens in strings. (Indeed, adding a _xxx suffix is probably a good idea.)考虑使用 sed 来搜索和替换所有文本,如下所示。如果没有 C++ 分词器来识别标识符(特别是您的标识符,而不是标准库中的标识符),您就完蛋了。 Push_back 被重命名为 PushBack_。地图::插入到地图::insert_。映射到map_。 basic_string 到 basicString_。 printf 到 printf_ (如果你使用 C 库)等等。如果你不加区别地这样做,你将会陷入一个受伤的世界。
我不知道有任何现有工具可以自动将 some_var_name 重命名为 someVarName_ 而不会出现上述问题。人们否决这篇文章可能是因为他们不明白我在这里的意思。我并不是说 sed 不能做到这一点,我只是说它不会给你你想要的东西,只需按原样使用它。解析器需要上下文信息才能正确执行此操作,否则它将替换比应替换的更多内容。
如果它可以识别哪些标记是标识符(特别是您的标识符),则可以编写一个解析器来执行此操作(例如:使用 sed),但我怀疑是否有一个专门用于您想要执行的操作的工具没有一些手动肘部润滑脂(尽管我可能是错的)。以这种方式对所有文本进行简单的搜索和替换本质上是有问题的。
然而,Visual AssistX(可以选择替换文档中的实例)或任何其他能够为标识符出现的每个实例智能地重命名标识符的重构工具至少可以大大减轻以这种方式重构代码的负担。如果您有一个名为 some_var_name 的符号,并且在系统中的一千个不同位置引用了它,则使用 VAssistX 您只需使用一个重命名函数即可巧妙地重命名所有引用(这不仅仅是文本搜索和替换)。 查看 Visual Assist X 的重构功能。
使用 VAX 以这种方式重构一百个变量可能需要 15 分钟到半小时(如果使用热键,速度会更快),但它肯定胜过使用文本搜索并用 sed 替换,如其他答案中所述,并且具有各种替换了不应替换的代码。
[主观]顺便说一句:如果你问我的话,下划线仍然不属于驼峰式大小写。 lowerCamelCase 命名约定应使用 lowerCamelCase。关于这一点有很多有趣的论文,但至少你们的惯例是一致的。如果它是一致的,那么这就是一个巨大的优势,而不是像 fooBar_Baz 这样的东西,一些愚蠢的程序员写的东西认为它以某种方式使事情更容易对规则做出特殊例外。[/主观]
Consider using sed to search and replace all text like this. Without a C++ tokenizer to recognize identifiers (and specifically your identifiers and not those in the standard library, e.g.), you are screwed. push_back gets renamed to pushBack_. map::insert to map::insert_. map to map_. basic_string to basicString_. printf to printf_ (if you use C libraries), etc. You're going to be in a world of hurt if you do it indiscriminately.
I don't know of any existing tool to automagically rename some_var_name to someVarName_ without the problems described above. People voted this post down probably because they didn't understand what I meant here. I'm not saying sed can't do it, I'm just saying it won't give you what you want to just use it as is. The parser needs contextual information to do this right, else it'll replace a lot more things it shouldn't than it should.
It would be possible to write a parser that would do this (ex: using sed) if it could recognize which tokens were identifiers (specifically your identifiers), but I doubt there's a tool specifically for what you want to do that does it off the bat without some manual elbow grease (though I could be wrong). Doing a simple search and replace on all text this way would be inherently problematic.
However, Visual AssistX (which can optionally replace instances in documentation) or any other refactoring tool capable of smartly renaming identifiers for every instance in which they occur at least eases the burden of refactoring code this way quite considerably. If you have a symbol named some_var_name and it's referenced in a thousand different places in your system, with VAssistX you can just use one rename function to rename all references smartly (this is not a mere text search and replace). Check out the refactoring features of Visual Assist X.
It might take 15 minutes to a half hour to refactor a hundred variables this way with VAX (faster if you use the hotkeys), but it certainly beats using a text search and replace with sed like described in the other answer and having all kinds of code replaced that shouldn't be replaced.
[subjective]BTW: underscores still don't belong in camel case if you ask me. A lowerCamelCase naming convention should use lowerCamelCase. There are plenty of interesting papers on this, but at least your convention is consistent. If it's consistent, then that's a huge plus as opposed to something like fooBar_Baz which some goofy coders write who think it somehow makes things easier to make special exceptions to the rule.[/subjective]