如何查找 Windows 行结束符 (EOL)
我有数百 GB 的数据,需要使用 Cygwin 中的 unix 粘贴实用程序粘贴在一起,但如果文件中存在 Windows EOL 字符,它将无法正常工作。数据可能有也可能没有 Windows EOL 字符,如果不需要,我不想花时间运行 dos2unix。
所以我的问题是,在Cygwin中,如何判断这些文件是否有Windows EOL CRLF字符?
我尝试创建一些测试数据并运行
sed -r 's/\r\n//' testdata.txt
但是无论是否运行 dos2unix ,这似乎都是匹配的。
谢谢。
I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.
So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?
I've tried creating some test data and running
sed -r 's/\r\n//' testdata.txt
But that appears to match regardless of whether dos2unix has been run or not.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
当您准备好转换它们时,取消注释“#dos2unix "$i"”。
Uncomment "#dos2unix "$i"" when you are ready to convert them.
您可以使用
file
找到:CRLF是这里的重要值。
You can find out using
file
:CRLF is the significant value here.
如果您希望退出代码与
sed
不同,则事实并非如此。根据比赛情况,是否进行替换。除非出现错误,否则退出代码将为 true。不过,您可以从
grep
获取可用的退出代码。If you expect the exit code to be different from
sed
, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.You can get a usable exit code from
grep
, however.grep 递归,带有文件模式过滤器
输出文件名、行号和行本身
grep recursive, with file pattern filter
output file name, line number and line itself
您可以使用 dos2unix 的 -i 选项来获取有关 DOS Unix Mac 换行符(按顺序)、BOM 和文本/二进制的信息,而无需转换文件。
使用“c”标志,dos2unix 将报告将被转换的文件,iow 文件具有 DOS 换行符。要报告所有带有 DOS 换行符的 txt 文件,您可以这样做:
要仅转换这些文件,您只需执行以下操作:
如果您需要递归目录,您可以执行以下操作:
另请参阅 dos2unix 的手册页。
You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.
With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:
To convert only these files you simply do:
If you need to go recursive over directories you do:
See also the man page of dos2unix.
如上所述,“文件”解决方案有效。也许下面的代码片段可能会有所帮助。
As stated above the 'file' solution works. Maybe the following code snippet may help.
感谢您提供使用 file(1) 命令的提示,但它确实需要更多改进。我遇到的情况是,不仅纯文本文件而且某些“.sh”脚本的 eol 错误。无论 eol 为何,“file”都会按如下方式报告它们:
因此需要“file -e soft”选项(至少对于 Linux):
这会在目录 xxx 和子目录中查找所有具有 DOS eol 的文件。
Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:
So the "file -e soft" option was needed (at least for Linux):
This finds all the files with DOS eol in directory xxx and subdirs.
file(1)
实用程序知道其中的区别:file(1)
已经过优化,可以尝试读取尽可能少的文件内容,因此您可能会很幸运并且非常幸运减少查找和修复 CRLF 终止符时需要执行的磁盘 IO 量。请注意,CRLF 的某些情况应保留:SMTP 的捕获将使用 CRLF。但这取决于你。 :)
The
file(1)
utility knows the difference:file(1)
has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)