如何检测来自不同操作系统的文本文件的行结尾?
在 C 中,我通常一次读取一个字符(例如,在 FSM 循环中,同时进行标记化和解析)。不幸的是,一些操作系统使用不同的方法来标记行尾,例如 Unix ("\n"
)、Mac OS ("\r"
) 和 DOS/ Windows(“\r\n”
)。
因此我的问题是:如何正确检测来自不同操作系统的文本文件的行结尾?
我当前的方法是将 '\r'
视为 '\n'
并忽略空行。不幸的是,这种方法只有在空行不改变底层文本的语义的情况下才有效。
我不想“检测”每个文件的行结束样式,并且我当然不想要基于 #ifdef
或其他类型的条件编译的解决方案。还有有效的解决方案吗?
In C, I usually read text files one character at a time (e.g. in the loop of a FSM, tokenizing and parsing at the same time). Unfortunately, some operating systems use different methods to mark the end of a line, e.g. Unix ("\n"
), Mac OS ("\r"
) and DOS/Windows ("\r\n"
).
Hence my question: how do I properly detect line endings across text files from different operating systems?
My current approach is to treat '\r'
as '\n'
and ignore empty lines. Unfortunately, this approach only works as long as empty lines don't change the semantics of the underlying text.
I wouldn't want to "detect" the line ending style for each file, and I certainly don't want solutions based on #ifdef
or other kinds of conditional compilation. Are there any valid solutions left?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我通常不建议一次读取一个字符,但对于您的情况,我建议您使用以下逻辑“窥视”一个字符......
您不能真正相信全部文件具有一定的亲和力,甚至文件本身遵循相同的约定,因此您应该针对所有情况进行编码。在这种情况下,如果您看到 \r,您可能会看到 \n,如果您确实消耗了下一个字符并继续前进。
I normally don't recommend reading a file one char at a time but for your case I would suggest you "peek" ahead one char use the following logic...
You can't really trust that all files are of a certain affinity or even that a file follows the same convention throughout itself, thus you should code for all cases. In this case if you see \r you might see a \n and if you do consume the next char and move on.
不幸的是,如果文件被传递,或者使用允许您指定行结尾的编辑器进行编辑,或者由于许多其他类似原因,则文件可能具有混合行结尾。
确定文件的“the”行结束样式可能需要进行投票——以X样式结束的行最多。
我所做的是
\r
视为换行符。如果下一个char 是
\n
丢弃它。 (如果下一个字符不是
\n
仍然是\r
算作换行符)
将
\n
视为换行符,除非你因为 (1)
Unfortunately, a file can have mixed line endings if it's been passed around, or edited with editors that allow you to specify the line ending, or for any number of other similar reasons.
Determining "the" line ending style for a file could be a matter of taking a vote -- the most lines that end in style X wins.
What I've done is
treat
\r
as a newline. if the nextchar is
\n
discard it. (if thenext char is not
\n
the\r
stillcounts as a newline)
treat
\n
as anewline, unless you're throwing it away becuase of (1)
我通常的方法是将
'\n'
视为行终止符,如果前一个字符是'\r'
,则将其删除(通常我最终会覆盖其中一个)或另一个为 0)。如果您还想支持旧版 Mac 文本文件('\r'
- 仅换行符),那么您可以采取处理单独的'\r'
、单独的方法'\n'
,或成对的"\r\n"
作为换行符。My usual approach is to treat
'\n'
as the line terminator, and if the previous character was'\r'
, remove it (usually I end up overwriting either one or the other with 0). If you also want to support legacy Mac text files though ('\r'
-only newlines) then you can take the approach of treating either lone'\r'
, lone'\n'
, or the pair"\r\n"
as a line break.