fseek / 循环倒带

发布于 2024-07-14 14:11:33 字数 405 浏览 15 评论 0原文

我在代码中遇到一种情况,其中有一个巨大的函数可以逐行解析记录、验证并写入另一个文件。

如果文件中有错误,它会调用另一个函数来拒绝记录并写入拒绝原因。

由于程序中存在内存泄漏,它会因 SIGSEGV 而崩溃。 从崩溃位置“重新启动”文件的一种解决方案是将最后处理的记录写入一个简单文件。

为了实现这一点,需要将处理循环中的当前记录号写入文件。 如何确保循环内的文件上的数据被覆盖?

在循环中使用 fseek 到第一个位置/倒带是否会降低性能?

记录的数量有时可能很多(最多 500K)。

谢谢。

编辑:内存泄漏已经修复。 建议重新启动解决方案为 额外的安全措施和方法提供重启机制以及 SKIP n 记录解决方案。 很抱歉没有早点提到。

I have a situation in a code where there is a huge function that parses records line-by-line, validates and writes to another file.

In case there are errors in the file, it calls another function that rejects the record and writes the reject reason.

Due to a memory leak in the program, it crashes with SIGSEGV. One solution to kind of "Restart" the file from where it crashed, was to write the last processed record to a simple file.

To achieve this the current record number in the processing loop needs to be written to a file. How do I make sure that the data is overwritten on the file within the loop?

Does using fseek to first position / rewind within a loop degrade the performance?

The number of records can be lot, at times (upto 500K).

Thanks.

EDIT: The memory leak has already been fixed. The restart solution was suggested as
an additional safety measure and means to provide a restart mechanism along with a SKIP n records solution. Sorry for not mentioning it earlier.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

与风相奔跑 2024-07-21 14:11:33

遇到此类问题时,您可以采用以下两种方法之一:

  1. 您建议的方法:对于您读取的每条记录,写出记录号(或位置)由输入文件上的 ftell 返回)到单独的书签文件。 为了确保您准确地从上次中断的位置继续,以免引入重复记录,您必须在每次写入后fflush(对书签和输出/拒绝文件)。 ,以及一般的无缓冲写入操作,会显着减慢典型(无故障)场景的速度。 为了完整起见,请注意,您可以通过三种方式写入书签文件:
    • fopen(..., 'w') / fwrite / fclose - 非常慢
    • 倒带/截断/fwrite/fflush - 稍快
    • 倒回/fwrite/fflush - 稍快一些; 您可以跳过truncate,因为记录编号(或ftell位置)将始终与前一个记录编号(或ftell位置)一样长或更长),并且将完全覆盖它,前提是您在启动时截断文件一次(这回答了您原来的问题)
  2. 假设一切都会顺利在大多数情况下; 失败后恢复时,只需计算已输出的记录数(正常输出加上拒绝),并从输入文件中跳过同等数量的记录。
    • 这使得典型(无故障)场景的处理速度非常快,而不会在发生故障后恢复场景时显着影响性能。
    • 您不需要fflush文件,或者至少不需要那么频繁。 您仍然需要在切换到写入拒绝文件之前fflush主输出文件,并在切换回写入主输出文件之前fflush拒绝文件(可能是对于 50 万条记录的输入,需要几百或几千次。)只需从输出/拒绝文件中删除最后一个未终止的行,该行之前的所有内容都将保持一致。

我强烈推荐方法#2。 与方法#2 所需的任何额外(缓冲)读取相比,方法#1(三种可能性中的任意一种)所需的写入成本极其昂贵(fflush 可能需要几毫秒;将其乘以 500k,您就可以得到结果)获取分钟 - 而计算 500k 记录文件中的行数只需要几秒钟,而且,文件系统缓存正在与您合作,而不是反对您。)


编辑
只是想澄清一下实现方法 2 所需的确切步骤:

  • 分别在写入输出和拒绝文件时,只需在从写入一个文件切换到写入另一个文件时进行刷新。 考虑以下场景来说明执行这些文件切换刷新的必要性:

    • 假设您向主输出文件写入 1000 条记录,则
    • 您必须将 1 行写入拒绝文件,而不先手动刷新主输出文件,然后
    • 您在主输出文件中再写入 200 行,而不先手动刷新拒绝文件,然后
    • 运行时自动为您刷新主输出文件,因为您在主输出文件的缓冲区中积累了大量数据,即 1200 条记录
      • 但是运行时尚未自动将拒绝文件刷新到磁盘,因为文件缓冲区仅包含一条记录,容量不足以自动刷新
    • 此时你的程序崩溃了
    • 您恢复并计算主输出文件中的 1200 条记录(运行时为您刷新这些记录),但拒绝文件中的 0 (!) 条记录(未刷新)。
    • 您在记录 #1201 处恢复处理输入文件,假设您只有 1200 条记录成功处理到主输出文件; 被拒绝的记录将丢失,并且第1200条有效记录将被重复
    • 你不想要这个!
  • 现在考虑在切换输出/拒绝文件后手动刷新:
    • 假设您向主输出文件写入 1000 条记录,则
    • 您遇到一条属于拒绝文件的无效记录; 最后一条记录有效; 这意味着您将切换到写入拒绝文件:在写入拒绝文件之前刷新主输出文件
    • 您现在向拒绝文件写入 1 行,然后
    • 您遇到一条属于主输出文件的有效记录; 最后一条记录无效; 这意味着您要切换到写入主输出文件:在写入主输出文件之前刷新拒绝文件
    • 您在主输出文件中再写入 200 行,而不先手动刷新拒绝文件,然后
    • 假设运行时不会自动刷新任何内容,因为自上次手动刷新主输出文件以来缓冲的 200 条记录不足以触发自动刷新
    • 此时你的程序崩溃了
    • 您恢复并计数主输出文件中的 1000 条有效记录(您在切换到拒绝文件之前手动刷新了这些记录),以及拒绝文件中的 1 条记录(您在切换回主输出文件之前手动刷新了)。< /里>
    • 您正确地恢复处理记录 #1001 处的输入文件,这是紧随无效记录之后的第一条有效记录。
    • 您重新处理接下来的 200 条有效记录,因为它们没有被刷新,但您没有丢失记录,也没有重复记录
  • 如果您对运行时自动刷新之间的间隔不满意,您也可以每 100 或每1000 条记录。 这取决于处理记录是否比刷新更昂贵(如果处理更昂贵,则经常刷新,也许在每个记录之后刷新,否则仅在输出/拒绝之间切换时刷新。)

  • 从失败中恢复

    • 打开输出文件和拒绝文件进行读取和写入,并开始读取并计算每条记录(如 records_resume_counter),直到到达末尾文件
    • 除非您在输出每个记录后进行刷新,否则您还需要对输出文件和拒绝文件中的最后一条记录执行一些特殊处理:
      • 在从中断的输出/拒绝文件中读取记录之前,请记住您在所述输出/拒绝文件中所处的位置(使用ftell),我们将其称为last_valid_record_ends_here
      • 阅读记录。 验证该记录不是部分记录(即运行时尚未将文件刷新到记录的中间)。
      • 如果每行有一个记录,则可以通过检查记录中的最后一个字符是否是回车符或换行符(\n 或 `r`)来轻松验证这一点< /em>
        • 如果记录已完成,则增加记录计数器并继续处理下一条记录(或文件末尾,以先到者为准。)
        • 如果记录是部分的,则fseek返回到last_valid_record_ends_here,并停止从此输出/拒绝文件中读取; 不要增加计数器; 继续下一个输出或拒绝文件,除非您已完成所有这些


    • 打开输入文件进行读取并跳过其中的 records_resume_counter 条记录
      • 继续处理并输出到output/rejects文件; 这将自动附加到您停止读取/计数已处理记录的输出/拒绝文件
      • 如果您必须对部分记录刷新执行特殊处理,则您输出的下一条记录将覆盖上一次运行的部分信息(位于 last_valid_record_ends_here) - 您将不会有重复、垃圾或丢失的情况记录。

When faced with this kind of problem, you can adopt one of two methods:

  1. the method you suggested: for each record you read, write out the record number (or the position returned by ftell on the input file) to a separate bookmark file. To ensure that you resume exactly where you left off, as to not introduce duplicate records, you must fflush after every write (to both bookmark and output/reject files.) This, and unbuffered write operations in general, slow down the typical (no-failure) scenario significantly. For completeness' sake, note that you have three ways of writing to your bookmark file:
    • fopen(..., 'w') / fwrite / fclose - extremely slow
    • rewind / truncate / fwrite / fflush - marginally faster
    • rewind / fwrite / fflush - somewhat faster; you may skip truncate since the record number (or ftell position) will always be as long or longer than the previous record number (or ftell position), and will completely overwrite it, provided you truncate the file once at startup (this answers your original question)
  2. assume everything will go well in most cases; when resuming after failure, simply count the number of records already output (normal output plus rejects), and skip an equivalent number of records from the input file.
    • This keeps the typical (no-failure) scenarios very fast, without significantly compromising performance in case of resume-after-failure scenarios.
    • You do not need to fflush files, or at least not so often. You still need to fflush the main output file before switching to writing to the rejects file, and fflush the rejects file before switching back to writing to the main output file (probably a few hundred or thousand times for a 500k-record input.) Simply remove the last unterminated line from the output/reject files, everything up to that line will be consistent.

I strongly recommend method #2. The writing entailed by method #1 (whichever of the three possibilities) is extremely expensive compared to any additional (buffered) reads required by method #2 (fflush can take several milliseconds; multiply that by 500k and you get minutes - whereas counting the number of lines in a 500k-record file takes mere seconds and, what's more, the filesystem cache is working with, not against you on that.)


EDIT
Just wanted to clarify the exact steps you need to implement method 2:

  • when writing to the output and rejects files respectively you only need to flush when switching from writing to one file to writing to another. Consider the following scenario as illustration of the ncessity of doing these flushes-on-file-switch:

    • suppose you write 1000 records to the main output file, then
    • you have to write 1 line to the rejects file, without manually flushing the main output file first, then
    • you write 200 more lines to the main output file, without manually flushing the rejects file first, then
    • the runtime automatically flushes the main output file for you because you have accumulated a large volume of data in the buffers for the main output file, i.e. 1200 records
      • but the runtime has not yet automatically flushed the rejects file to disk for you, as the file buffer only contains one record, which is not sufficient volume to automatically flush
    • your program crashes at this point
    • you resume and count 1200 records in the main output file (the runtime flushed those out for you), but 0 (!) records in the rejects file (not flushed).
    • you resume processing the input file at record #1201, assuming you only had 1200 records successfully processed to the main output file; the rejected record would be lost, and the 1200'th valid record will be repeated
    • you do not want this!
  • now consider manually flushing after switching output/reject files:
    • suppose you write 1000 records to the main output file, then
    • you encounter one invalid record which belongs to the rejects file; the last record was valid; this means you're switching to writing to the rejects file: flush the main output file before writing to the rejects file
    • you now write 1 line to the rejects file, then
    • you encounter one valid record which belongs to the main output file; the last record was invalid; this means you're switching to writing to the main output file: flush the rejects file before writing to the main output file
    • you write 200 more lines to the main output file, without manually flushing the rejects file first, then
    • assume that the runtime did not automatically flush anything for you, because 200 records buffered since the last manual flush on the main output file are not enough to trigger an automatic flush
    • your program crashes at this point
    • you resume and count 1000 valid records in the main output file (you manually flushed those before switching to the rejects file), and 1 record in the rejects file (you manually flushed before switching back to the main output file).
    • you correctly resume processing the input file at record #1001, which is the first valid record immediately after the invalid record.
    • you reprocess the next 200 valid records because they were not flushed, but you get no missing records and no duplicates either
  • if you are not happy with the interval between the runtime's automatic flushes, you may also do manual flushes every 100 or every 1000 records. This depends on whether processing a record is more expensive than flushing or not (if procesing is more expensive, flush often, maybe after each record, otherwise only flush when switching between output/rejects.)

  • resuming from failure

    • open the output file and the rejects file for both reading and writing, and begin by reading and counting each record (say in records_resume_counter) until you reach the end of file
    • unless you were flushing after each record you are outputting, you will also need to perform a bit of special treatment for the last record in both the output and rejects file:
      • before reading a record from the interrupted output/rejects file, remember the position you are at in the said output/rejects file (use ftell), let's call it last_valid_record_ends_here
      • read the record. validate that the record is not a partial record (i.e. the runtime has not flushed the file up to the middle of a record).
      • if you have one record per line, this is easily verified by checking that the last character in the record is a carriage return or line feed (\n or `r`)
        • if the record is complete, increment the records counter and proceed with the next record (or end of file, whichever comes first.)
        • if the record is partial, fseek back to last_valid_record_ends_here, and stop reading from this output/reject files; do not increment the counter; proceed to the next output or rejects file unless you've gone through all of them
    • open the input file for reading and skip records_resume_counter records from it
      • continue processing and outputting to the output/rejects file; this will automatically append to the output/rejects file where you left off reading/counting already processed records
      • if you had to perform special processing for partial record flushes, the next record you output will overwrite its partial information from the previous run (at last_valid_record_ends_here) - you will have no duplicate, garbage or missing records.
愁以何悠 2024-07-21 14:11:33

如果您可以更改代码以使其将最后处理的记录写入文件,为什么不能更改它来修复内存泄漏?

在我看来,解决问题的根本原因而不是治疗症状是更好的解决方案。

fseek()fwrite() 会降低性能,但远不及打开/写入/关闭类型操作。

我假设您将把 ftell() 值存储在第二个文件中(这样您就可以从上次停下的地方继续)。 您还应该始终 fflush() 文件,以确保数据从 C 运行时库写入操作系统缓冲区。 否则,您的 SEGV 将确保该值不是最新的。

If you can change the code to have it write the last processed record to a file, why can't you change it to fix the memory leak?

It seems to me to be a better solution to fix the root cause of the problem rather than treat the symptoms.

fseek() and fwrite() will degrade the performance but nowhere near as much as a open/write/close-type operation.

I'm assuming you'll be storing the ftell() value in the second file (so you can pick up where you left off). You should always fflush() the file as well to ensure that data is written from the C runtime library down to the OS buffers. Otherwise your SEGV will ensure the value isn't up to date.

温暖的光 2024-07-21 14:11:33

与写出整个记录相比,在每个记录的开头调用 ftell() 并写入文件指针的位置可能会更容易。 当您必须重新启动程序时, fseek() 到文件中最后写入的位置并继续。

当然,修复内存泄漏是最好的;)

Rather than writing out the entire record, it would probably be easier to call ftell() at the beginning of each, and write the position of the file pointer. When you have to restart the program, fseek() to the last written position in the file and continue.

Of course, fixing the memory leak would be best ;)

余厌 2024-07-21 14:11:33

如果为每条记录写入最后处理的位置,这将对性能产生显着影响,因为您需要提交写入(通常通过关闭文件),然后再次重新打开文件。 在其他作品中,fseek 是您最不用担心的。

If you write the last processed position for every record, this will have a noticeable impact on performance because you will need to commit the write (typically by closing the file) and then reopen the file again. In other works, the fseek is the least of your worries.

离线来电— 2024-07-21 14:11:33

我会停止挖更深的洞,而只是通过 Valgrind 运行程序。 这样做应该可以避免泄漏以及其他问题。

I would stop digging a deeper hole and just run the program through Valgrind. Doing so should obviate the leak, as well as other problems.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文