如何在不读取整个文件的情况下找出文件有多少个字符?
如果文件是文本文件,并且 StreamReader 可以计算出它使用的编码,那么如何在不读取整个文件的情况下找出它有多少个字符?
我正在读取 1GB CSV 文件,使用 StreamReader
读取它至少需要 4 秒。 File.ReadAllText().Length
会导致 System.OutOfMemoryException
。
我想如果我有 FileInfo(filename).Length
和 Encoding
,那么我可以计算字符数。
If the file is a text file, and StreamReader
can figure out the Encoding
it uses, how can I find out how much characters it has without reading the whole file?
I'm reading 1GB CSV files and it takes at least 4 seconds to read it with a StreamReader
. File.ReadAllText().Length
would cause System.OutOfMemoryException
.
I imagine if I had the FileInfo(filename).Length
and the Encoding
, then I can calculate the number of characters.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
对于 ASCII、CP-437、CP-1252、ISO-8859-1 或类似的代码页,字符数将是字节数。
如果文件是UTF-16,那么你无法从字节数得知字符数,但很可能类似于字节数/2。无论如何,你可以准确计算出内存大小需要将文件保存在 .NET 字符串中,因为它将是文件的大小(因为 .NET 在内部使用 UTF-16)加上恒定的开销。此类字符串的长度将是字节数除以 2。
如果文件采用 UTF-8(或任何其他可变宽度编码),则字符数可能会很大,最多可达文件数的数倍。字节,也可以是每个字节一个字符。这仅取决于数据。
如果文件采用 UTF-32 格式(这极不可能),则字符数将恰好是文件长度(以字节为单位)除以四。但即使这是确切的字符数,它也不表示从此文件创建的 .NET 字符串的长度,因为这可能涉及对高平面中的字符使用代理代码点,因此答案仍然取决于您打算如何处理这些信息。
For ASCII, CP-437, CP-1252, ISO-8859-1, or code pages similar to these, then the number of characters will be the number of bytes.
If the file is in UTF-16, then you cannot know the number of characters from the number of bytes, but it will likely be something similar to the number of bytes / 2. In any case, you can exactly calculate the size of memory needed to hold the file in a .NET string, because it will be the size of the file (since .NET uses UTF-16 internally) plus a constant overhead. The Length of such a string will be number of bytes divided by 2.
If the file is in UTF-8 (or any other vairable-width encoding), then the number of characters could be a wide range up to several times the number of bytes, or it could be one character per byte. It just depends on the data.
If the file is in UTF-32 (which is extremely unlikely), then the number of characters will be exactly the length of the file in bytes divided by four. But even though this is the exact number of characters, it does not indicate the length of the .NET string created from this file, since that might involve the use of surrogate code points for characters in the high planes, so the answer still depends on what you inted to do with the information.
我不认为它真的可以 - 某些编码使用不同的字节数对字符进行编码,因此您确实需要将字节转换为字符才能找到字符数。
例如,在UTF-8中,从\u0000到\u007F的字符仅用1个字节表示; \0u0080 和 \u07FF 之间需要 2 个字节,依此类推。
I don't think it really can - some encodings encode characters with different number of bytes, so you'd really need to convert the bytes into characters to find the number of characters.
For example, in UTF-8, the characters from \u0000 to \u007F are represented in 1 byte only; between \0u0080 and \u07FF they need 2 bytes, and so on.
对于某些编码,此方法有效(ASCII、Window 1262、IBM-850 等),但不适用于 UTF8 和 UTF7,因为它们的某些字符编码为 1 字节,某些字符编码为 2(我相信有些字符甚至编码为 2)。
For some encodings this works (ASCII, Window 1262, IBM-850, etc), but not for UTF8 and UTF7, since they have some characters encoded as 1 byte, some as 2 (and I believe some even more as 2).
这样做的问题是,如果文件是 UTF8 编码的,那么每个字符可以占用 1 到 4 个字节,因此如果不以某种方式处理文件,就无法“计算”字符数。
其他编码方法可能会更有效。
The problem with this is if the file is UTF8 encoded then each character can occupy between 1 and 4 bytes, therefore you have no way of 'calculating' the number of characters without processing the file in some way.
Other encoding methods may prove more fruitful.
你不能。原因是,某些编码(特别是 UTF-8)具有可变的字符宽度:有些字符仅占用 1 个字节(ASCII),很多占用 2 个字节,甚至有每个字符 3 个或更多字节的情况。因此,如果不解码字符,就不可能知道编码下文件的长度。
另外,C# 字符串中的所有字符都表示为 UTF-16,AFAIK,所以除非您有一个非常奇怪的文本(即您使用了来自外部的许多字符 plane 0),您可以通过将字符数乘以 2 来相当轻松地估计以字节为单位的内存需求(反之亦然,通过将字符数加倍来估计字符数)字节大小)。
现在,一个更好的问题是 - 为什么需要字符数?您稍后要对 CSV 文件执行什么操作,想要将其全部加载到内存中?为什么知道它的大小会有帮助?
You can't. The reason is, some encoding (notably, UTF-8) have variable character width: some characters take up only 1 byte (ASCII), a lot take up 2 bytes, there are even cases with 3 or more bytes per character. Thus, without decoding the characters, it is impossible to know the length of the file under an encoding.
Also, all characters in C# strings are represented as UTF-16, AFAIK, so unless you have a very weird text (i.e. you're using many characters from outside plane 0), you can estimate the memory requirements in bytes rather easily, by multiplying the character count by 2 (and vice versa, estimate the number of characters by doubling the byte size).
Now, a better question is - why do you need the character count? What is it that you're doing with the CSV file later, that you want to load it all up into the memory, and why would knowing its size help?