如何删除除换行符之外的所有非单词字符?
我有一个这样的文件:
my line - some words & text
oh lóok i've got some characters
我想对其进行“标准化”并删除所有非单词字符。我想最终得到这样的结果:
mylinesomewordstext
ohlóokivegotsomecharacters
我现在在命令行上使用 Linux,我希望有一些我可以使用的单行代码。
我尝试了这个:
cat file | perl -pe 's/\W//'
但这删除了所有换行符并将所有内容放在一行。有什么办法可以告诉 Perl 不要在 \W
中包含换行符吗?或者还有其他办法吗?
I have a file like this:
my line - some words & text
oh lóok i've got some characters
I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:
mylinesomewordstext
ohlóokivegotsomecharacters
I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.
I tried this:
cat file | perl -pe 's/\W//'
But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W
? Or is there some other way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这会删除与
\w
或\n
不匹配的字符:This removes characters that don't match
\w
or\n
:@sth 的解决方案使用 Perl,它(至少在我的系统上)不兼容 Unicode,因此它丢失了重音 o 字符。
另一方面,
sed
Unicode 兼容(根据 此页面上的列表),并给出正确的结果:@sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.
On the other hand,
sed
is Unicode compatible (according to the lists on this page), and gives a correct result:在 Perl 中,我只需添加 -l 开关,该开关通过将换行符附加到每个 print() 的末尾来重新添加换行符:
请注意,您不需要
cat
。In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():
Notice that you don't need the
cat
.之前的响应没有回显“ó”字符。至少就我而言。
The previous response isn't echoing the "ó" character. At least in my case.
shell 脚本的最佳实践规定您应该使用 tr 程序而不是 sed 来替换单个字符,因为它更快、更高效。如果替换较长的字符串,显然使用 sed 。
随着时间的推移运行时我得到:
当我运行 sed 答案(sed -e 's/\W//g' file)时,我得到:
虽然不是“巨大”差异,但在针对较大数据集运行时您会注意到差异。另请注意,我没有将 cat 的输出通过管道传输到 tr 中,而是使用 I/O 重定向(少生成一个进程)。
Best practices for shell scripting dictate that you should use the tr program for replacing single characters instead of sed, because it's faster and more efficient. Obviously use sed if replacing longer strings.
When run with time I get:
When I run the sed answer (sed -e 's/\W//g' file) with time I get:
While not a "huge" difference, you'll notice the difference when running against larger data sets. Also please notice how I didn't pipe cat's output into tr, instead using I/O redirection (one less process to spawn).