使用fread()读取基于文本的文件 - 最佳实践

发布于 2025-01-23 20:27:33 字数 1830 浏览 2 评论 0 原文

考虑此代码以读取基于文本的文件。这种 fread()在出色的书籍 c编程中简要介绍了使用: kn king 的现代方法。 还有其他读取基于文本的文件的方法,但是在这里我仅关注 fread()

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    // Declare file stream pointer.
    FILE *fp = fopen("Note.txt", "r");
    // fopen() call successful.
    if(fp != NULL)
    {
        // Navigate through to end of the file.
        fseek(fp, 0, SEEK_END);
        // Calculate the total bytes navigated.
        long filesize = ftell(fp);
        // Navigate to the beginning of the file so
        // it can be read.
        rewind(fp);
        // Declare array of char with appropriate size.
        char content[filesize + 1];
        // Set last char of array to contain NULL char.
        content[filesize] = '\0';
        // Read the file content.
        fread(content, filesize, 1, fp);
        // Close file stream pointer.
        fclose(fp);
        // Print file content.
        printf("%s\n", content);
    }
    // fopen() call unsuccessful.
    else
    {
        printf("File could not be read.\n");
    }
    return 0;
}

这种方法有一些问题。我的看法是,这不是执行 fread()的安全方法,因为如果我们尝试读取一个非常大的字符串,可能会出现溢出。这个意见有效吗?

为了避免此问题,我们可以使用缓冲区的大小,并继续读取该尺寸的字符阵列。如果Filesize小于缓冲区的大小,那么我们只按照上述代码中所述执行 fread()。否则,我们将总文件大小除以缓冲区的大小并获得结果,我们将用作int的一部分,作为迭代循环的总数,我们将调用 fread()将读取缓冲阵列附加到较大的字符串中。现在,对于最终 fread(),我们将在循环之后执行,我们将不得不准确读取数据(文件filesize%bufferSize)字节中的数据数组,并最终将此数组添加到较大的字符串(我们将拥有 malloc -ED,并提前使用Filesize + 1)。我发现,如果我们使用BufferSize作为其第二个参数执行 fread(),则将读取大小(BufferSize -Chunksize)的额外垃圾数据,并且数据可能会损坏。我的假设在这里正确吗?请说明是否/我是如何忽略了什么的。

同样,有一个问题是,非ASCII字符的大小可能没有1个字节。在这种情况下,我会认为正在读取适当的数量,但是每个字节都被读成一个字符,因此文本以某种方式扭曲了? fread()处理多字节字符的读取?

Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King.
There are other methods of reading text based files, but here I am concerned with fread() only.

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
    // Declare file stream pointer.
    FILE *fp = fopen("Note.txt", "r");
    // fopen() call successful.
    if(fp != NULL)
    {
        // Navigate through to end of the file.
        fseek(fp, 0, SEEK_END);
        // Calculate the total bytes navigated.
        long filesize = ftell(fp);
        // Navigate to the beginning of the file so
        // it can be read.
        rewind(fp);
        // Declare array of char with appropriate size.
        char content[filesize + 1];
        // Set last char of array to contain NULL char.
        content[filesize] = '\0';
        // Read the file content.
        fread(content, filesize, 1, fp);
        // Close file stream pointer.
        fclose(fp);
        // Print file content.
        printf("%s\n", content);
    }
    // fopen() call unsuccessful.
    else
    {
        printf("File could not be read.\n");
    }
    return 0;
}

There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?

To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.

Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

別甾虛僞 2025-01-30 20:27:33

这不是执行FREAD()的安全方法,因为如果我们尝试读取一个非常大的字符串,可能会出现溢出。这个意见有效吗?

fread()不在乎 strings null字符终止数组)。它像在的倍数中读取数据一样也许在 text 模式下进行了一些数据处理(例如,字节订单 - 字节订单 - 标记)。

我的假设在这里正确吗?

失败的假设:

  • 假设 ftell()返回值等于 fread() bytes的总和。
    该假设可以在 text 模式中为false(如op打开文件)和 fseek() to to to to to Technally nestem in >二进制模式。


  • 假设不检查 fread()的返回值是可以的。使用 fread()的返回值知道是否发生了错误,文件终止以及读取了多少个字节的倍数。

  • 假设不需要错误检查。 , ftell() fread() fseek()而不是 rewind()都应进行错误检查。特别是, ftell()很容易在 streams 上失败。

  • 假设没有 null字符已读取。不确定文本文件是通过阅读所有内容并附加 null字符来制作一个 string 。稳健的代码检测和/或用嵌入式空字符的应对。

  • 多字节:假设输入符合编码要求。示例:强大的代码检测(和拒绝)无效文件。

  • 极端:假设文件 length&lt; = long_max ,最大值从 ftell()返回。文件可能更大。

但是每个字节都被读成一个字符,所以文本以某种方式扭曲了? fread()处理多字节字符的读数如何?

fread()不在多字节边界上起作用,只有 unsigned char 的倍数。给定的 fread()可能以多字节的一部分结尾,下一个 fread()将继续从中部byte中继续。


而不是2个通过方法考虑1单次通行证

// Pseudo code
total_read = 0      
Allocate buffer, say 4096

forever
  if buffer full
    double buffer_size (`realloc()`)
  u = unused portion of buffer 
  fread u bytes into unused portion of buffer
  total_read += number_just_read
  if (number_just_read < u) 
    quit loop

Resize buffer total_read (+ 1 if appending a '\0')

,也可以考虑需要在处理数据之前读取整个文件。我不知道较高的目标,但是经常在其到达时处理数据会减少资源影响和更快的吞吐量。


高级

文本文件可能很简单 ascii 仅,8-bit 代码页定义,已定义,各种UTF编码之一( byte-orde-orde-mark 等。 '\ n',超越ASCII的鲁棒文本处理是无关紧要的

。 他们的要求。


满足

//                       v --- multiple of 1 byte
fread(content, filesize, 1, fp);

this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?

fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.

Are my assumptions here correct?

Failed assumptions:

  • Assuming ftell() return value equals the sum of fread() bytes.
    The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.

  • Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.

  • Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.

  • Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.

  • Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.

  • Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.

but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?

fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.


Instead of of 2 pass approach consider 1 single pass

// Pseudo code
total_read = 0      
Allocate buffer, say 4096

forever
  if buffer full
    double buffer_size (`realloc()`)
  u = unused portion of buffer 
  fread u bytes into unused portion of buffer
  total_read += number_just_read
  if (number_just_read < u) 
    quit loop

Resize buffer total_read (+ 1 if appending a '\0')

Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.


Advanced

Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.

ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.


*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.

//                       v --- multiple of 1 byte
fread(content, filesize, 1, fp);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文