如何删除除换行符之外的所有非单词字符？

发布于 2024-08-04 10:30:38 字数 421 浏览 9 评论 0原文

我有一个这样的文件：

my line - some words & text
oh lóok i've got some characters

我想对其进行“标准化”并删除所有非单词字符。我想最终得到这样的结果：

mylinesomewordstext
ohlóokivegotsomecharacters

我现在在命令行上使用 Linux，我希望有一些我可以使用的单行代码。

我尝试了这个：

cat file | perl -pe 's/\W//'

但这删除了所有换行符并将所有内容放在一行。有什么办法可以告诉 Perl 不要在 \W 中包含换行符吗？或者还有其他办法吗？

原文

I have a file like this:

my line - some words & text
oh lóok i've got some characters

I want to 'normalize' it and remove all the non-word characters. I want to end up with something like this:

mylinesomewordstext
ohlóokivegotsomecharacters

I'm using Linux on the command line at the moment, and I'm hoping there's some one-liner I can use.

I tried this:

cat file | perl -pe 's/\W//'

But that removed all the newlines and put everything one line. Is there someway I can tell Perl to not include newlines in the \W? Or is there some other way?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

指尖上的星空 2024-08-11 10:30:38

这会删除与 \w 或 \n 不匹配的字符：

cat file | perl -C -pe 's/[^\w\n]//g'

This removes characters that don't match \w or \n:

cat file | perl -C -pe 's/[^\w\n]//g'

回复收藏 0 原文

天涯离梦残月幽梦 2024-08-11 10:30:38

@sth 的解决方案使用 Perl，它（至少在我的系统上）不兼容 Unicode，因此它丢失了重音 o 字符。

另一方面，sed Unicode 兼容（根据此页面上的列表），并给出正确的结果：

$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters

@sth's solution uses Perl, which is (at least on my system) not Unicode compatible, thus it loses the accented o character.

On the other hand, sed is Unicode compatible (according to the lists on this page), and gives a correct result:

$ sed 's/\W//g' a.txt
mylinesomewordstext
ohlóokivegotsomecharacters

回复收藏 0 原文

猫七 2024-08-11 10:30:38

在 Perl 中，我只需添加 -l 开关，该开关通过将换行符附加到每个 print() 的末尾来重新添加换行符：

 perl -ple 's/\W//g' file

请注意，您不需要 cat。

In Perl, I'd just add the -l switch, which re-adds the newline by appending it to the end of every print():

 perl -ple 's/\W//g' file

Notice that you don't need the cat.

回复收藏 0 原文

-小熊_ 2024-08-11 10:30:38

之前的响应没有回显“ó”字符。至少就我而言。

sed 's/\W//g' file

The previous response isn't echoing the "ó" character. At least in my case.

sed 's/\W//g' file

回复收藏 0 原文

两人的回忆 2024-08-11 10:30:38

shell 脚本的最佳实践规定您应该使用 tr 程序而不是 sed 来替换单个字符，因为它更快、更高效。如果替换较长的字符串，显然使用 sed 。

tr -d '[:blank:][:punct:]' <文件

随着时间的推移运行时我得到：

真实0m0.003s
用户 0m0.000s
系统0m0.004s

当我运行 sed 答案（sed -e 's/\W//g' file）时，我得到：

真实0m0.003s
用户 0m0.004s
系统0m0.004s

虽然不是“巨大”差异，但在针对较大数据集运行时您会注意到差异。另请注意，我没有将 cat 的输出通过管道传输到 tr 中，而是使用 I/O 重定向（少生成一个进程）。

回复收藏 0 原文

~没有更多了~

关于作者

挽袖吟

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何删除除换行符之外的所有非单词字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如何删除除换行符之外的所有非单词字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。