Ruby 读取不同文件大小的行读取

发布于 2024-07-14 22:12:20 字数 633 浏览 6 评论 0原文

我需要做一些文件大小至关重要的事情。 这会产生

filename = "testThis.txt"
total_chars = 0
file = File.new(filename, "r")
file_for_writing = nil
while (line = file.gets)
  total_chars += line.length
end
puts "original size #{File.size(filename)}"
puts "Totals #{total_chars}"

像这样的

original size 20121
Totals 20061

奇怪结果为什么第二个结果不足?

编辑:回答者的预感是正确的:测试文件中有 60 行。 如果我改变这条线,

  total_chars += line.length + 1

它就可以完美工作。 但在 *nix 上这个改变会是错误的吗?

编辑:后续行动现在此处。 谢谢!

I need to do something where the file sizes are crucial. This is producing strange results

filename = "testThis.txt"
total_chars = 0
file = File.new(filename, "r")
file_for_writing = nil
while (line = file.gets)
  total_chars += line.length
end
puts "original size #{File.size(filename)}"
puts "Totals #{total_chars}"

like this

original size 20121
Totals 20061

Why is the second one coming up short?

Edit: Answerers' hunches are right: the test file has 60 lines in it. If I change this line

  total_chars += line.length + 1

it works perfectly. But on *nix this change would be wrong?

Edit: Follow up is now here. Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

陌生 2024-07-21 22:12:21

文件中存储了一些特殊字符来描述行:

  • Windows/DOS 上的 CR LF (0x0D 0x0A) (\r\n) 和
  • UNIX 系统上的 0x0A (\n)。

Ruby 的 gets 使用 UNIX 方法。 因此,如果您读取 Windows 文件,则每读取一行就会丢失 1 个字节,因为 \r\n 字节会转换为 \n。

另外,String.length 也不能很好地衡量字符串的大小(以字节为单位)。 如果字符串不是 ASCII,则一个字符可能由多个字节 (Unicode) 表示。 也就是说,它返回字符串中的字符数,而不是字节数。

要获取文件的大小,请使用File.size(file_name)

There are special characters stored in the file that delineate the lines:

  • CR LF (0x0D 0x0A) (\r\n) on Windows/DOS and
  • 0x0A (\n) on UNIX systems.

Ruby's gets uses the UNIX method. So, if you read a Windows file you would lose 1 byte for every line you read as the \r\n bytes are converted to \n.

Also String.length is not a good measure of the size of the string (in bytes). If the String is not ASCII, one character may be represented by more than one byte (Unicode). That is, it returns the number of characters in the String, not the number of bytes.

To get the size of a file, use File.size(file_name).

怼怹恏 2024-07-21 22:12:21

我的猜测是您使用的是 Windows,并且您的“testThis.txt”文件具有 \r\n 行结尾。 当文件以文本模式打开时,每一行结尾将转换为单个 \n 字符。 因此,每行将丢失 1 个字符。

你的测试文件有 60 行吗? 这与这个解释是一致的。

My guess would be that you are on Windows, and your "testThis.txt" file has \r\n line endings. When the file is opened in text mode, each line ending will be converted to a single \n character. Therefore you'll lose 1 character per line.

Does your test file have 60 lines in it? That would be consistent with this explanation.

不气馁 2024-07-21 22:12:21

行结束问题是最有可能的罪魁祸首。

还值得注意的是,如果文本文件的字符编码不是 ASCII,则两者之间也会存在差异。 如果文件是 UTF-8,则这适用于英语和一些仅使用标准 ASCII 字母符号的欧洲语言。 除此之外,文件大小和字符数可能会有很大差异(与字符数相比,文件大小最多可达 4 倍甚至 6 倍)。

依赖“1 个字符 = 1 个字节”只是自找麻烦,因为它几乎肯定会在某个时候失败。

The line-ending issues is the most likely culprit here.

It's also worth noting that if the character encoding of the text file is something other than ASCII, you will have a discrepancy between the 2 as well. If the file is UTF-8, this will work for english and some european languages that use just standard ASCII alphabet symbols. Beyond that, the file size and character counts can vary wildly (up to 4 or even 6 times the file size compared to the character count).

Relying on '1 character = 1 byte' is just asking for trouble as it is almost certainly going to fail at some point.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文