让 std::ifstream 处理 LF、CR 和 CRLF?
具体来说,我对 istream& 感兴趣。 getline ( istream& is, string& str );。 ifstream 构造函数是否有一个选项可以告诉它在后台将所有换行符编码转换为 '\n' ?我希望能够调用 getline 并让它优雅地处理所有行结尾。
更新:澄清一下,我希望能够编写几乎可以在任何地方编译的代码,并且可以从几乎任何地方获取输入。包括罕见的带有“\r”而没有“\n”的文件。最大限度地减少软件用户的不便。
解决这个问题很容易,但我仍然对标准中灵活处理所有文本文件格式的正确方法感到好奇。
getline
将整行读入字符串,直至“\n”。 '\n' 从流中消耗,但 getline 不将其包含在字符串中。到目前为止还好,但在包含在字符串中的“\n”之前可能有一个“\r”。
在文本文件中可以看到三种类型的行结尾: '\n' 是 Unix 机器上的常规结尾,'\r'(我认为)在旧的 Mac 操作系统上使用,Windows 使用一对,'\r' 后面跟着 '\n'。
问题是 getline
将 '\r' 留在了字符串的末尾。
ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
// BUT, there might be an '\r' at the end now.
}
编辑 感谢 Neil 指出 f.good()
不是我想要的。 !f.fail()
是我想要的。
我可以自己手动删除它(请参阅此问题的编辑),这对于 Windows 文本文件来说很容易。但我担心有人会输入仅包含“\r”的文件。在这种情况下,我认为 getline 将消耗整个文件,认为它是一行!
.. 这甚至没有考虑 Unicode :-)
.. 也许 Boost 有一种很好的方法来一次使用任何文本文件类型的一行?
编辑 我正在使用它来处理 Windows 文件,但我仍然觉得我不应该这么做!这不会分叉仅包含“\r”的文件。
if(!line.empty() && *line.rbegin() == '\r') {
line.erase( line.length()-1, 1);
}
Specifically I'm interested in istream& getline ( istream& is, string& str );
. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getline
and have it gracefully handle all line endings.
Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.
It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.
getline
reads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.
There are three types of line endings seen in text files:
'\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.
The problem is that getline
leaves the '\r' on the end of the string.
ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
// BUT, there might be an '\r' at the end now.
}
Edit Thanks to Neil for pointing out that f.good()
isn't what I wanted. !f.fail()
is what I want.
I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!
.. and that's not even considering Unicode :-)
.. maybe Boost has a nice way to consume one line at a time from any text-file type?
Edit I'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.
if(!line.empty() && *line.rbegin() == '\r') {
line.erase( line.length()-1, 1);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
正如 Neil 指出的那样,“C++ 运行时应该正确处理适合您的特定平台的任何行结束约定。”
然而,人们确实在不同平台之间移动文本文件,所以这还不够好。这是一个处理所有三个行结尾(“\r”、“\n”和“\r\n”)的函数:
这是一个测试程序:
As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."
However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):
And here is a test program:
C++ 运行时应该正确处理适合您的特定平台的任何结尾约定。具体来说,这段代码应该适用于所有平台:
当然,如果您正在处理来自另一个平台的文件,那么一切都将失败。
由于两个最常见的平台(Linux 和 Windows)都以换行符终止行,而 Windows 则在换行符前面添加回车符,因此您可以检查上述代码中的
line
字符串的最后一个字符查看它是否是\r
,如果是,则在进行特定于应用程序的处理之前将其删除。例如,您可以为自己提供一个 getline 样式的函数,如下所示(未经测试,仅出于教学目的使用索引、substr 等):
The C++ runtime should deal correctly with whatever the endline convention is for your particular platform. Specifically, this code should work on all platforms:
Of course, if you are dealing with files from another platform, all bets are off.
As the two most common platforms (Linux and Windows) both terminate lines with a newline character, with Windows preceding it with a carriage return,, you can examine the last character of the
line
string in the above code to see if it is\r
and if so remove it before doing your application-specific processing.For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):
您是以BINARY还是TEXT模式读取文件?在 TEXT 模式下,回车/换行对 CRLF 被解释为 TEXT 行尾或行尾字符,但在BINARY 一次仅获取 一个 个字节,这意味着必须 任一字符都被忽略并保留在缓冲区中以作为另一个字节获取!回车在打字机中是指打印臂所在的打字机车已到达纸张的右边缘并返回到左边缘。这是一个非常机械的模型,即机械打字机的模型。然后换行意味着纸卷向上旋转一点,以便纸张就位以开始另一行打字。据我所知,ASCII 中的低位数字之一意味着不输入而向右移动一个字符,即死字符,当然 \b 意味着退格:将汽车向后移动一个字符。这样你就可以添加特殊效果,如底层(键入下划线)、删除线(键入减号)、近似不同的重音、取消(键入 X),而无需扩展键盘,只需调整汽车沿线的位置即可输入换行符。因此,您可以使用字节大小的 ASCII 电压来自动控制打字机,而无需计算机介入。当引入自动打字机时,自动意味着一旦到达纸张的最远边缘,小车就会返回到左侧并且应用换行,即,假定当卷向上移动时,汽车会自动返回!因此,您不需要两个控制字符,只需要一个控制字符,即 \n、换行符或换行符。
这与编程无关,但 ASCII 更古老,嘿!看起来有些人在开始做文字事情时并没有思考! UNIX 平台假定是一台电动自动打字机; Windows模型更完整并且允许控制机械机器,尽管一些控制字符在计算机中变得越来越没有用,例如铃字符,0x07,如果我没记错的话......一些被遗忘的文本最初肯定是用控制字符捕获的对于电控打字机,它延续了该模型...
实际上,正确的变化是只包含 \r,换行符,回车符是不必要的,即自动的,因此:
将是处理所有情况的最正确方法文件类型。但请注意,TEXT 模式下的 \n 实际上是字节对 0x0d 0x0a,但 0x0d IS 只是 \r:\n 包括 TEXT 中的 \r > 模式,但不是 BINARY 模式,因此 \n 和 \r\n 是等效的......或者应该是。这实际上是一个非常基本的行业混乱,典型的行业惯性,因为惯例是在所有平台上讲CRLF,然后陷入不同的二进制解释。严格来说,包含 ONLY 0x0d(回车)为 \n(CRLF 或换行)的文件在 TEXT 模式下格式错误(打字机:只需返回汽车并删除所有内容...),并且是非面向行的二进制格式( \r 或 \r\n 表示面向行),因此您不应该将其作为文本阅读!该代码可能会因某些用户消息而失败。这不仅取决于操作系统,还取决于 C 库的实现,从而增加了混乱和可能的变化……(特别是对于透明的 UNICODE 翻译层,为令人困惑的变化添加了另一个关节点)。
前面的代码片段(机械打字机)的问题是,如果\r(自动打字机文本)后面没有\n字符,效率非常低。然后它还采用 BINARY 模式,其中 C 库被迫忽略文本解释(区域设置)并放弃纯粹的字节。两种模式之间的实际文本字符应该没有区别,只有控制字符不同,所以一般来说读取 BINARY 比 TEXT 模式更好。此解决方案对于独立于 C 库变体的 BINARY 模式典型 Windows 操作系统文本文件非常有效,而对于其他平台文本格式(包括将 Web 翻译为文本)则效率较低。如果您关心效率,最好的方法是使用函数指针,以您喜欢的方式测试 \r 与 \r\n 行控件,然后将最佳的 getline 用户代码选择到指针中并从中调用它它。
顺便说一句,我记得我也发现了一些 \r\r\n 文本文件...它会转换为双行文本,就像一些打印文本消费者仍然需要的那样。
Are you reading the file in BINARY or in TEXT mode? In TEXT mode the pair carriage return/line feed, CRLF, is interpreted as TEXT end of line, or end of line character, but in BINARY you fetch only ONE byte at a time, which means that either character MUST be ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATIC means that once you reach the farthest edge of the paper, the car is returned to the left AND the line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.
This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...
Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:
would be the most correct way to handle all types of files. Note however that \n in TEXT mode is actually the byte pair 0x0d 0x0a, but 0x0d IS just \r: \n includes \r in TEXT mode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY 0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXT mode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).
The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARY mode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARY is better than TEXT mode. This solution is efficient for BINARY mode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.
Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.
一种解决方案是首先搜索并将所有行结尾替换为 '\n' - 就像 Git 默认情况下所做的那样。
One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.
除了编写自己的自定义处理程序或使用外部库之外,您运气不好。最简单的方法是检查以确保
line[line.length() - 1]
不是“\r”。在 Linux 上,这是多余的,因为大多数行都会以 '\n' 结尾,这意味着如果这是在循环中,您将损失相当多的时间。在 Windows 上,这也是多余的。但是,以“\r”结尾的经典 Mac 文件又如何呢? std::getline 不适用于 Linux 或 Windows 上的这些文件,因为 '\n' 和 '\r' '\n' 均以 '\n' 结尾,从而无需检查 '\r'。显然,这样一个处理这些文件的任务不会很好地工作。当然,还有大量的 EBCDIC 系统,这是大多数图书馆不敢解决的。检查“\r”可能是解决您问题的最佳方法。以二进制模式读取将允许您检查所有三个公共行结尾('\r'、'\r\n' 和 '\n')。如果您只关心 Linux 和 Windows,因为旧式 Mac 行结尾不应存在太久,请仅检查 '\n' 并删除结尾的 '\r' 字符。
Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure
line[line.length() - 1]
is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.
不幸的是,接受的解决方案的行为与 std::getline() 并不完全相同。为了获得这种行为(根据我的测试),需要进行以下更改:
根据 https://en.cppreference.com/w/cpp/string/basic_string/getline:
从输入中提取字符并将其附加到 str 直到发生以下情况之一(按顺序检查列出)
如果由于某种原因没有提取任何字符(甚至没有丢弃的分隔符),getline 设置失败位并返回。
Unfortunately the accepted solution does not behave exactly like
std::getline()
. To obtain that behavior (to my tests), the following change is necessary:According to https://en.cppreference.com/w/cpp/string/basic_string/getline:
Extracts characters from input and appends them to str until one of the following occurs (checked in the order listed)
If no characters were extracted for whatever reason (not even the discarded delimiter), getline sets failbit and returns.
如果知道每行有多少个项目/数字,则可以读取具有例如 4 个数字的一行,因为
这也适用于其他行结尾。
If it is known how many items/numbers each line has, one could read one line with e.g. 4 numbers as
This also works with other line endings.