如何查找 Windows 行结束符 (EOL)

发布于 2024-10-24 20:00:08 字数 319 浏览 10 评论 0原文

我有数百 GB 的数据,需要使用 Cygwin 中的 unix 粘贴实用程序粘贴在一起,但如果文件中存在 Windows EOL 字符,它将无法正常工作。数据可能有也可能没有 Windows EOL 字符,如果不需要,我不想花时间运行 dos2unix。

所以我的问题是,在Cygwin中,如何判断这些文件是否有Windows EOL CRLF字符?

我尝试创建一些测试数据并运行

sed -r 's/\r\n//' testdata.txt

但是无论是否运行 dos2unix ,这似乎都是匹配的。

谢谢。

I have several hundred GB of data that I need to paste together using the unix paste utility in Cygwin, but it won't work properly if there are windows EOL characters in the files. The data may or may not have windows EOL characters, and I don't want to spend the time running dos2unix if I don't have to.

So my question is, in Cygwin, how can I figure out whether these files have windows EOL CRLF characters?

I've tried creating some test data and running

sed -r 's/\r\n//' testdata.txt

But that appears to match regardless of whether dos2unix has been run or not.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

#!/bin/bash
for i in $(find . -type f); do
        if file $i | grep CRLF ; then
                echo $i
                file $i
                #dos2unix "$i"
        fi
done

当您准备好转换它们时,取消注释“#dos2unix "$i"”。

#!/bin/bash
for i in $(find . -type f); do
        if file $i | grep CRLF ; then
                echo $i
                file $i
                #dos2unix "$i"
        fi
done

Uncomment "#dos2unix "$i"" when you are ready to convert them.

且行且努力 2024-10-31 20:00:09

您可以使用file找到:

file /mnt/c/BOOT.INI 
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators

CRLF是这里的重要值。

You can find out using file:

file /mnt/c/BOOT.INI 
/mnt/c/BOOT.INI: ASCII text, with CRLF line terminators

CRLF is the significant value here.

夜唯美灬不弃 2024-10-31 20:00:09

如果您希望退出代码与 sed 不同,则事实并非如此。根据比赛情况,是否进行替换。除非出现错误,否则退出代码将为 true。

不过,您可以从 grep 获取可用的退出代码。

#!/bin/bash
for f in *
do
    if head -n 10 "$f" | grep -qs 
\r'
    then
        dos2unix "$f"
    fi
done

If you expect the exit code to be different from sed, it won't be. It will perform a substitution or not depending on the match. The exit code will be true unless there's an error.

You can get a usable exit code from grep, however.

#!/bin/bash
for f in *
do
    if head -n 10 "$f" | grep -qs 
\r'
    then
        dos2unix "$f"
    fi
done
他是夢罘是命 2024-10-31 20:00:09

grep 递归,带有文件模式过滤器

grep -Pnr --include=*file.sh '\r

输出文件名、行号和行本身

./test/file.sh:2:here is windows line break
.

输出文件名、行号和行本身

grep recursive, with file pattern filter

grep -Pnr --include=*file.sh '\r

output file name, line number and line itself

./test/file.sh:2:here is windows line break
.

output file name, line number and line itself

兔姬 2024-10-31 20:00:09

您可以使用 dos2unix 的 -i 选项来获取有关 DOS Unix Mac 换行符(按顺序)、BOM 和文本/二进制的信息,而无需转换文件。

$ dos2unix -i *.txt
    6       0       0  no_bom    text    dos.txt
    0       6       0  no_bom    text    unix.txt
    0       0       6  no_bom    text    mac.txt
    6       6       6  no_bom    text    mixed.txt
   50       0       0  UTF-16LE  text    utf16le.txt
    0      50       0  no_bom    text    utf8unix.txt
   50       0       0  UTF-8     text    utf8dos.txt

使用“c”标志,dos2unix 将报告将被转换的文件,iow 文件具有 DOS 换行符。要报告所有带有 DOS 换行符的 txt 文件,您可以这样做:

$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt

要仅转换这些文件,您只需执行以下操作:

dos2unix -ic *.txt | xargs dos2unix

如果您需要递归目录,您可以执行以下操作:

find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix

另请参阅 dos2unix 的手册页。

You can use dos2unix's -i option to get information about DOS Unix Mac line breaks (in that order), BOMs, and text/binary without converting the file.

$ dos2unix -i *.txt
    6       0       0  no_bom    text    dos.txt
    0       6       0  no_bom    text    unix.txt
    0       0       6  no_bom    text    mac.txt
    6       6       6  no_bom    text    mixed.txt
   50       0       0  UTF-16LE  text    utf16le.txt
    0      50       0  no_bom    text    utf8unix.txt
   50       0       0  UTF-8     text    utf8dos.txt

With the "c" flag dos2unix will report files that would be converted, iow files have have DOS line breaks. To report all txt files with DOS line breaks you could do this:

$ dos2unix -ic *.txt
dos.txt
mixed.txt
utf16le.txt
utf8dos.txt

To convert only these files you simply do:

dos2unix -ic *.txt | xargs dos2unix

If you need to go recursive over directories you do:

find -name '*.txt' | xargs dos2unix -ic | xargs dos2unix

See also the man page of dos2unix.

伪心 2024-10-31 20:00:09

如上所述,“文件”解决方案有效。也许下面的代码片段可能会有所帮助。

#!/bin/ksh
EOL_UNKNOWN="Unknown"       # Unknown EOL
EOL_MAC="Mac"               # File EOL Classic Apple Mac  (CR)
EOL_UNIX="Unix"             # File EOL UNIX               (LF)
EOL_WINDOWS="Windows"       # File EOL Windows            (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...

# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
    EOL_FILE=$EOL_UNKNOWN

    # Check for EOL-windows
    EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_WINDOWS
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_MAC
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_UNIX
       return
    fi

    return
   } # getFileEOL   
   ...

   # Using this snippet
   getEolFile $SVN_PROPFILE
   echo "Found EOL: $EOL_FILE"
   exit -1

As stated above the 'file' solution works. Maybe the following code snippet may help.

#!/bin/ksh
EOL_UNKNOWN="Unknown"       # Unknown EOL
EOL_MAC="Mac"               # File EOL Classic Apple Mac  (CR)
EOL_UNIX="Unix"             # File EOL UNIX               (LF)
EOL_WINDOWS="Windows"       # File EOL Windows            (CRLF)
SVN_PROPFILE="name-of-file" # Filename to check.
...

# Finds the EOL used in the requested File
# $1 Name of the file (requested filename)
# $r EOL_FILE set to enumerated EOL-values.
getEolFile() {
    EOL_FILE=$EOL_UNKNOWN

    # Check for EOL-windows
    EOL_CHECK=`file $1 | grep "ASCII text, with CRLF line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_WINDOWS
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text, with CR line terminators"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_MAC
       return
    fi

    # Check for Classic Mac EOL
    EOL_CHECK=`file $1 | grep "ASCII text"`
    if [[ -n $EOL_CHECK ]] ; then
       EOL_FILE=$EOL_UNIX
       return
    fi

    return
   } # getFileEOL   
   ...

   # Using this snippet
   getEolFile $SVN_PROPFILE
   echo "Found EOL: $EOL_FILE"
   exit -1
明媚如初 2024-10-31 20:00:09

感谢您提供使用 file(1) 命令的提示,但它确实需要更多改进。我遇到的情况是,不仅纯文本文件而且某些“.sh”脚本的 eol 错误。无论 eol 为何,“file”都会按如下方式报告它们:

xxx/y/z.sh: application/x-shellscript

因此需要“file -e soft”选项(至少对于 Linux):

bash$ find xxx -exec file -e soft {} \; | grep CRLF

这会在目录 xxx 和子目录中查找所有具有 DOS eol 的文件。

Thanks for the tip to use file(1) command, however it does need a bit more refinement. I had the situation where not only plain text files but also some ".sh" scripts had the wrong eol. And "file" reports them as follows regardless of eol:

xxx/y/z.sh: application/x-shellscript

So the "file -e soft" option was needed (at least for Linux):

bash$ find xxx -exec file -e soft {} \; | grep CRLF

This finds all the files with DOS eol in directory xxx and subdirs.

浅笑依然 2024-10-31 20:00:08

file(1) 实用程序知道其中的区别:

$ file * | grep ASCII
2:                                       ASCII text
3:                                       ASCII English text
a:                                       ASCII C program text
blah:                                    ASCII Java program text
foo.js:                                  ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc:              ASCII text, with very long lines
windows:                                 ASCII text, with CRLF line terminators

file(1) 已经过优化,可以尝试读取尽可能少的文件内容,因此您可能会很幸运并且非常幸运减少查找和修复 CRLF 终止符时需要执行的磁盘 IO 量。

请注意,CRLF 的某些情况应保留:SMTP 的捕获将使用 CRLF。但这取决于你。 :)

The file(1) utility knows the difference:

$ file * | grep ASCII
2:                                       ASCII text
3:                                       ASCII English text
a:                                       ASCII C program text
blah:                                    ASCII Java program text
foo.js:                                  ASCII C++ program text
openssh_5.5p1-4ubuntu5.dsc:              ASCII text, with very long lines
windows:                                 ASCII text, with CRLF line terminators

file(1) has been optimized to try to read as little of a file as possible, so you may be lucky and drastically reduce the amount of disk IO you need to perform when finding and fixing the CRLF terminators.

Note that some cases of CRLF should stay in place: captures of SMTP will use CRLF. But that's up to you. :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文