fseek / 循环倒带
我在代码中遇到一种情况,其中有一个巨大的函数可以逐行解析记录、验证并写入另一个文件。
如果文件中有错误,它会调用另一个函数来拒绝记录并写入拒绝原因。
由于程序中存在内存泄漏,它会因 SIGSEGV 而崩溃。 从崩溃位置“重新启动”文件的一种解决方案是将最后处理的记录写入一个简单文件。
为了实现这一点,需要将处理循环中的当前记录号写入文件。 如何确保循环内的文件上的数据被覆盖?
在循环中使用 fseek 到第一个位置/倒带是否会降低性能?
记录的数量有时可能很多(最多 500K)。
谢谢。
编辑:内存泄漏已经修复。 建议重新启动解决方案为 额外的安全措施和方法提供重启机制以及 SKIP n 记录解决方案。 很抱歉没有早点提到。
I have a situation in a code where there is a huge function that parses records line-by-line, validates and writes to another file.
In case there are errors in the file, it calls another function that rejects the record and writes the reject reason.
Due to a memory leak in the program, it crashes with SIGSEGV. One solution to kind of "Restart" the file from where it crashed, was to write the last processed record to a simple file.
To achieve this the current record number in the processing loop needs to be written to a file. How do I make sure that the data is overwritten on the file within the loop?
Does using fseek to first position / rewind within a loop degrade the performance?
The number of records can be lot, at times (upto 500K).
Thanks.
EDIT: The memory leak has already been fixed. The restart solution was suggested as
an additional safety measure and means to provide a restart mechanism along with a SKIP n records solution. Sorry for not mentioning it earlier.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
遇到此类问题时,您可以采用以下两种方法之一:
ftell
返回)到单独的书签文件。 为了确保您准确地从上次中断的位置继续,以免引入重复记录,您必须在每次写入后fflush
(对书签
和输出/拒绝文件)。 ,以及一般的无缓冲写入操作,会显着减慢典型(无故障)场景的速度。 为了完整起见,请注意,您可以通过三种方式写入书签文件:fopen(..., 'w') / fwrite / fclose
- 非常慢倒带/截断/fwrite/fflush
- 稍快倒回/fwrite/fflush
- 稍快一些; 您可以跳过truncate
,因为记录编号(或ftell
位置)将始终与前一个记录编号(或ftell
位置)一样长或更长),并且将完全覆盖它,前提是您在启动时截断文件一次(这回答了您原来的问题)fflush
文件,或者至少不需要那么频繁。 您仍然需要在切换到写入拒绝文件之前fflush
主输出文件,并在切换回写入主输出文件之前fflush
拒绝文件(可能是对于 50 万条记录的输入,需要几百或几千次。)只需从输出/拒绝文件中删除最后一个未终止的行,该行之前的所有内容都将保持一致。我强烈推荐方法#2。 与方法#2 所需的任何额外(缓冲)读取相比,方法#1(三种可能性中的任意一种)所需的写入成本极其昂贵(
fflush
可能需要几毫秒;将其乘以 500k,您就可以得到结果)获取分钟 - 而计算 500k 记录文件中的行数只需要几秒钟,而且,文件系统缓存正在与您合作,而不是反对您。)编辑
只是想澄清一下实现方法 2 所需的确切步骤:
分别在写入输出和拒绝文件时,只需在从写入一个文件切换到写入另一个文件时进行刷新。 考虑以下场景来说明执行这些文件切换刷新的必要性:
如果您对运行时自动刷新之间的间隔不满意,您也可以每 100 或每1000 条记录。 这取决于处理记录是否比刷新更昂贵(如果处理更昂贵,则经常刷新,也许在每个记录之后刷新,否则仅在输出/拒绝之间切换时刷新。)
从失败中恢复
records_resume_counter
),直到到达末尾文件ftell
),我们将其称为last_valid_record_ends_here
\n
或 `r`)来轻松验证这一点< /em>fseek
返回到last_valid_record_ends_here
,并停止从此输出/拒绝文件中读取; 不要增加计数器; 继续下一个输出或拒绝文件,除非您已完成所有这些records_resume_counter
条记录last_valid_record_ends_here
) - 您将不会有重复、垃圾或丢失的情况记录。When faced with this kind of problem, you can adopt one of two methods:
ftell
on the input file) to a separate bookmark file. To ensure that you resume exactly where you left off, as to not introduce duplicate records, you mustfflush
after every write (to bothbookmark
and output/reject files.) This, and unbuffered write operations in general, slow down the typical (no-failure) scenario significantly. For completeness' sake, note that you have three ways of writing to your bookmark file:fopen(..., 'w') / fwrite / fclose
- extremely slowrewind / truncate / fwrite / fflush
- marginally fasterrewind / fwrite / fflush
- somewhat faster; you may skiptruncate
since the record number (orftell
position) will always be as long or longer than the previous record number (orftell
position), and will completely overwrite it, provided you truncate the file once at startup (this answers your original question)fflush
files, or at least not so often. You still need tofflush
the main output file before switching to writing to the rejects file, andfflush
the rejects file before switching back to writing to the main output file (probably a few hundred or thousand times for a 500k-record input.) Simply remove the last unterminated line from the output/reject files, everything up to that line will be consistent.I strongly recommend method #2. The writing entailed by method #1 (whichever of the three possibilities) is extremely expensive compared to any additional (buffered) reads required by method #2 (
fflush
can take several milliseconds; multiply that by 500k and you get minutes - whereas counting the number of lines in a 500k-record file takes mere seconds and, what's more, the filesystem cache is working with, not against you on that.)EDIT
Just wanted to clarify the exact steps you need to implement method 2:
when writing to the output and rejects files respectively you only need to flush when switching from writing to one file to writing to another. Consider the following scenario as illustration of the ncessity of doing these flushes-on-file-switch:
if you are not happy with the interval between the runtime's automatic flushes, you may also do manual flushes every 100 or every 1000 records. This depends on whether processing a record is more expensive than flushing or not (if procesing is more expensive, flush often, maybe after each record, otherwise only flush when switching between output/rejects.)
resuming from failure
records_resume_counter
) until you reach the end of fileftell
), let's call itlast_valid_record_ends_here
\n
or `r`)fseek
back tolast_valid_record_ends_here
, and stop reading from this output/reject files; do not increment the counter; proceed to the next output or rejects file unless you've gone through all of themrecords_resume_counter
records from itlast_valid_record_ends_here
) - you will have no duplicate, garbage or missing records.如果您可以更改代码以使其将最后处理的记录写入文件,为什么不能更改它来修复内存泄漏?
在我看来,解决问题的根本原因而不是治疗症状是更好的解决方案。
fseek()
和fwrite()
会降低性能,但远不及打开/写入/关闭类型操作。我假设您将把
ftell()
值存储在第二个文件中(这样您就可以从上次停下的地方继续)。 您还应该始终fflush()
文件,以确保数据从 C 运行时库写入操作系统缓冲区。 否则,您的 SEGV 将确保该值不是最新的。If you can change the code to have it write the last processed record to a file, why can't you change it to fix the memory leak?
It seems to me to be a better solution to fix the root cause of the problem rather than treat the symptoms.
fseek()
andfwrite()
will degrade the performance but nowhere near as much as a open/write/close-type operation.I'm assuming you'll be storing the
ftell()
value in the second file (so you can pick up where you left off). You should alwaysfflush()
the file as well to ensure that data is written from the C runtime library down to the OS buffers. Otherwise your SEGV will ensure the value isn't up to date.与写出整个记录相比,在每个记录的开头调用 ftell() 并写入文件指针的位置可能会更容易。 当您必须重新启动程序时, fseek() 到文件中最后写入的位置并继续。
当然,修复内存泄漏是最好的;)
Rather than writing out the entire record, it would probably be easier to call ftell() at the beginning of each, and write the position of the file pointer. When you have to restart the program, fseek() to the last written position in the file and continue.
Of course, fixing the memory leak would be best ;)
如果为每条记录写入最后处理的位置,这将对性能产生显着影响,因为您需要提交写入(通常通过关闭文件),然后再次重新打开文件。 在其他作品中,fseek 是您最不用担心的。
If you write the last processed position for every record, this will have a noticeable impact on performance because you will need to commit the write (typically by closing the file) and then reopen the file again. In other works, the fseek is the least of your worries.
我会停止挖更深的洞,而只是通过 Valgrind 运行程序。 这样做应该可以避免泄漏以及其他问题。
I would stop digging a deeper hole and just run the program through Valgrind. Doing so should obviate the leak, as well as other problems.