使用 STL 从文件中删除除最后 500,000 字节之外的所有字节

发布于 2024-07-10 01:34:56 字数 206 浏览 5 评论 0原文

我们的日志记录类在初始化时会将日志文件截断为 500,000 字节。 从那时起,日志语句将附加到文件中。

我们这样做是为了保持较低的磁盘使用率,我们是一种商品最终用户产品。

显然保留前 500,000 字节是没有用的,所以我们保留最后 500,000 字节。

我们的解决方案存在一些严重的性能问题。 什么是有效的方法来做到这一点?

Our logging class, when initialised, truncates the log file to 500,000 bytes. From then on, log statements are appended to the file.

We do this to keep disk usage low, we're a commodity end-user product.

Obviously keeping the first 500,000 bytes is not useful, so we keep the last 500,000 bytes.

Our solution has some serious performance problem. What is an efficient way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

梦醒时光 2024-07-17 01:35:09

Widefinder 2 有很多关于可用高效 IO 的讨论(或者更准确地说, “注释”栏下的链接有很多关于可用的高效 IO 的信息)。

回答你的问题:(

  1. 标题)使用[标准库]从文件中删除前 500,000 字节

标准库在文件系统操作方面有些限制。 如果你不限于标准库,你可以很容易地提前结束一个文件(也就是说,“这一点之后的所有内容都不再是这个文件的一部分”),但是很难晚启动一个文件(“所有内容都不再是这个文件的一部分”)在此之前不再是该文件的一部分”)。

只需在文件中查找 500,000 字节,然后启动到新文件的缓冲副本,这将是高效的。 但是一旦你这样做了,标准库就没有现成的“重命名此文件”功能。 本机操作系统功能可以有效地重命名文件,Boost.Filesystem 或 STLSoft 也可以。

  1. (实际问题)我们的日志记录类在初始化时会在文件末尾之前查找 500,000 个字节,将其余部分复制到 std::string 中,然后将其写回到文件中。

在这种情况下,您将删除文件的最后一位,并且在标准库之外很容易做到这一点。 只需使用文件系统操作将文件大小设置为 500,000 字节(例如 ftruncateSetEndOfFile)。 之后的任何内容都将被忽略。

Widefinder 2 has a lot of talk about efficient IO available (or, more accurately, the links under the "Notes" column have a lot of information about efficient IO available).

Answering your question:

  1. (Title) Remove first 500,000 bytes from a file with the [standard library]

The standard library is somewhat limited when it comes to filesystem operations. If you're not limited to the standard library you can end a file prematurely very easily (that is, say "everything after this point is no longer part of this file"), but it's very hard to start a file late ("everything before this point is no longer part of this file").

It would be efficient to simply seek 500,000 bytes into the file and then start a buffered copy to a new file. But once you've done that, the standard libary doesn't have a ready-made "rename this file" function. Native OS functions can rename files efficiently, as can Boost.Filesystem or STLSoft.

  1. (Actual question) Our logging class, on initialisation, seeks to 500,000 bytes before the end of the file, copies the rest to a std::string and then writes that back to the file.

In this case you're dropping the last bit of the file, and it's very easy to do outside the standard library. Simply use the filesystem operations to set the file size to 500,000 bytes (e.g., ftruncate, SetEndOfFile). Anything after that will be ignored.

风渺 2024-07-17 01:35:08

我不认为这与计算机有关,而是你们如何编写日志记录类。 我觉得很奇怪,你将最后 500k 读入字符串中,为什么要这样做?

只需附加到日志文件即可。

  fstream myfile;
  myfile.open("test.txt",ios::app);

I don't think it is anything computer related, but how you guys have written your logging class. It sounds strange to me that you read the last 500k into a string, why would you do that?

Just append to the logfile.

  fstream myfile;
  myfile.open("test.txt",ios::app);
小红帽 2024-07-17 01:35:06

所以你想要文件的结尾 - 你正在将其复制到某种缓冲区来做什么? “将其写回”到文件中是什么意思? 您的意思是它会覆盖该文件,在 init 时将其截断为原始文件的 500k 字节+它添加的内容?

建议:

  • 重新思考您正在做的事情。 如果这有效并且是所期望的,那么它有什么问题呢? 为什么要改变? 是否存在性能问题? 您是否开始想知道所有日志条目都去了哪里? 对于此类问题来说,提供更多的问题比发布现有的行为更有帮助。 除非他们知道完整的问题,否则没有人可以对此做出充分的评论 - 因为它是主观的。

  • 如果是我,我的任务是重新设计你的日志机制,我会建立一种机制来切断日志文件:时间长度或大小。

    如果是我,我的任务是重新设计你的日志机制

So you want the end of the file- you are copying that to some sort of buffer to do what with it? What do you mean 'writes that back' to the file. Do you mean that it overwrites the file, truncating on init to 500k bytes of the original+ what it adds?

Suggestions:

  • Rethink what you are doing. If this works and is what is desired, what is wrong with it? Why change? is there a performance problem? Are you starting to wonder where all your log entries went? It helps most for this type of question to provide more of the problem than to post the existing behavior. No one can fully comment on this unless they know the complete problem- because it is subjective.

  • If it were me and I were tasked at reworking your logging mechanism i'd build in a mechanism to cut off the log files to either: length of time or size.

鯉魚旗 2024-07-17 01:35:05

另一种解决方案是让日志记录类检测日志文件大小何时超过 500k,并打开一个新的日志文件,并关闭旧的日志文件。

然后日志记录类将查看旧文件,并删除最旧的文件。

记录器将有两个配置参数。

  1. 500k 为何时启动新日志的阈值以及
  2. 要保留的旧日志的数量。

这样,日志文件管理就可以自我维护。

An alternative solution would be to have the logging class detect when the log file size exceeded 500k, and open a new log file, and close the old one.

Then the logging class would look at the old files, and delete the oldest one

The logger would have two configuration parameters.

  1. 500k for the threshold of when to start a new log
  2. the number of old logs to keep around.

That way, the logging file management would be a self-maintaining thing.

笔芯 2024-07-17 01:35:04

如果您可以在重新初始化之间生成几个 GB 的日志文件,那么仅在初始化时截断文件似乎并没有真正的帮助。

我认为我会尝试提出一种专门的文本文件格式,以便始终替换适当的内容,并带有指向“当前”行的指针。 您需要一个恒定的行宽来仅分配一次磁盘空间,并将指针放在该文件的第一行或最后一行。

这样,文件就永远不会增长或缩小,并且您将始终拥有最后的N个条目。

N=6 的插图(“|”表示直到那里为止的空格填充):

#myapp logfile, lines = 6, width = 80, pointer = 4                              |
[2008-12-01 15:23] foo bakes a cake                                             |
[2008-12-01 16:15] foo has completed baking a cake                              |
[2008-12-01 16:16] foo eats the cake                                            |
[2008-12-01 16:17] foo tells bar: I have made you a cake, but I have eaten it   |
[2008-12-01 13:53] bar would like some cake                                     |
[2008-12-01 14:42] bar tells foo: sudo bake me a cake                           |

If you can generate a logfile of several GB between reinitializations, it seems that truncating the file only at initialization will not really help.

I think that I would try to come up with a specialized text file format in order to always replace contents in place, with a pointer to the "current" line wrapping around. You would need a constant line width to allocate the disk space just once, and put the pointer at either the first or last line of this file.

This way, the file would never grow or shrink, and you would always have the last N entries.

Illustration with N=6 ("|" indicates space padding until there):

#myapp logfile, lines = 6, width = 80, pointer = 4                              |
[2008-12-01 15:23] foo bakes a cake                                             |
[2008-12-01 16:15] foo has completed baking a cake                              |
[2008-12-01 16:16] foo eats the cake                                            |
[2008-12-01 16:17] foo tells bar: I have made you a cake, but I have eaten it   |
[2008-12-01 13:53] bar would like some cake                                     |
[2008-12-01 14:42] bar tells foo: sudo bake me a cake                           |
半山落雨半山空 2024-07-17 01:35:02

如果您碰巧使用 Windows,则不必费心复制部分内容。 只需调用 FSCTL_SET_SPARSEFSCTL_SET_ZERO_DATA

If you happen to use windows, don't bother copying parts around. Simply tell Windows you don't need the first bytes anymore by calling FSCTL_SET_SPARSE and FSCTL_SET_ZERO_DATA

辞取 2024-07-17 01:35:01

我可能会:

  • 创建一个新文件。
  • 在旧文件中查找。
  • 从旧文件到新文件进行缓冲读/写。
  • 将新文件重命名为旧文件。

要执行前三个步骤(省略错误检查,例如,如果文件小于 500k 大,我不记得eekg 会做什么):

#include <fstream>

std::ifstream ifs("logfile");
ifs.seekg(-500*1000, std::ios_base::end);
std::ofstream ofs("logfile.new");
ofs << ifs.rdbuf();

那么我认为您必须执行一些非标准的操作来重命名文件。

显然,您需要 500k 的可用磁盘空间才能正常工作,因此,如果您截断日志文件的原因是因为它刚刚填满了磁盘,那么这就不好了。

我不确定为什么寻找速度很慢,所以我可能错过了一些东西。 我不希望寻道时间取决于文件的大小。 可能取决于文件的是,我不确定这些函数是否可以在 32 位系统上处理 2GB+ 文件。

如果复制本身很慢,那么根据平台的不同,您可能可以通过使用更大的缓冲区来加快复制速度,因为这减少了系统调用的次数,也许更重要的是,减少了磁盘头在读取之间查找的次数点和写入点。 为此:

const int bufsize = 64*1024; // or whatever
std::vector<char> buf(bufsize);
...
ifs.rdbuf()->pubsetbuf(&buf[0], bufsize);

使用不同的值进行测试并查看。 您也可以尝试增加 ofstream 的缓冲区,我不确定这是否会产生影响。

请注意,在“实时”日志文件上使用我的方法是很麻烦的。 例如,如果在副本和重命名之间附加了一个日志条目,那么您将永远丢失它,并且您尝试替换的文件上的任何打开句柄都可能会导致问题(在 Windows 上会失败,在 Linux 上会失败)将替换该文件,但旧文件仍将占用空间并仍会被写入,直到句柄关闭)。

如果截断是从执行所有日志记录的同一个线程完成的,那么就没有问题,您可以保持简单。 否则,您将需要使用锁或其他方法。

这是否完全可靠取决于平台和文件系统:移动和替换可能是也可能不是原子操作,但通常不是,因此您可能必须重命名旧文件,然后重命名新文件,然后删除旧文件,并进行错误恢复,在启动时检测是否存在重命名的旧文件,如果有,则将其放回并重新启动截断。 STL不能帮你处理平台差异,但是有boost::filesystem。

抱歉,这里有很多警告,但很大程度上取决于平台。 如果您使用的是 PC,那么我很困惑为什么复制区区半兆的数据会花费任何时间。

I would probably:

  • create a new file.
  • seek in the old file.
  • do a buffered read/write from old file to new file.
  • rename the new file over the old one.

To do the first three steps (error-checking omitted, for example I can't remember what seekg does if the file is less than 500k big):

#include <fstream>

std::ifstream ifs("logfile");
ifs.seekg(-500*1000, std::ios_base::end);
std::ofstream ofs("logfile.new");
ofs << ifs.rdbuf();

Then I think you have to do something non-standard to rename the file.

Obviously you need 500k disk space free for this to work, though, so if the reason you're truncating the log file is because it has just filled the disk, this is no good.

I'm not sure why the seek is slow, so I may be missing something. I would not expect seek time to depend on the size of the file. What may depend on the file, is that I'm not sure whether these functions handle 2GB+ files on 32-bit systems.

If the copy itself is slow, then depending on platform you might be able to speed it up by using a bigger buffer, since this reduces the number of system calls and perhaps more importantly the number of times the disk head has to seek between the read point and the write point. To do this:

const int bufsize = 64*1024; // or whatever
std::vector<char> buf(bufsize);
...
ifs.rdbuf()->pubsetbuf(&buf[0], bufsize);

Test it with different values and see. You could also try increasing the buffer for the ofstream, I'm not sure whether that will make a difference.

Note that using my approach on a "live" logging file is hairy. For example, if a log entry is appended between the copy and the rename, then you lose it forever, and any open handles on the file you're trying to replace could cause problems (it'll fail on Windows, and on linux it will replace the file, but the old one will still occupy space and still be written to until the handle is closed).

If the truncation is done from the same thread which is doing all the logging, then there's no problem and you can keep it simple. Otherwise you'll need to use a lock, or a different approach.

Whether this is entirely robust depends on platform and filesystem: move-and-replace may or may not be an atomic operation, but usually isn't, so you may have to rename the old file out of the way, then rename the new file, then delete the old one, and have an error-recovery which on startup detects if there's a renamed old file and, if so, puts it back and restarts the truncate. The STL can't help you deal with platform differences, but there is boost::filesystem.

Sorry there are so many caveats here, but a lot depends on platform. If you're on a PC, then I'm mystified why copying a measly half meg of data takes any time at all.

信愁 2024-07-17 01:35:00

“我可能会创建一个新文件,在旧文件中查找,从旧文件到新文件进行缓冲读/写,将新文件重命名为旧文件。”

我认为你最好简单地这样做:

#include <fstream>
std::ifstream ifs("logfile");  //One call to start it all. . .
ifs.seekg(-512000, std::ios_base::end);  // One call to find it. . .
char tmpBuffer[512000];
ifs.read(tmpBuffer, 512000);  //One call to read it all. . .
ifs.close();
std::ofstream ofs("logfile", ios::trunc);
ofs.write(tmpBuffer, 512000); //And to the FS bind it.

这可以通过简单地将最后 512K 复制到缓冲区,以截断模式打开日志文件(清除日志文件的内容),然后将相同的 512K 吐回到开头来避免文件重命名。文件的。

请注意,上面的代码尚未经过测试,但我认为这个想法应该是合理的。

您可以将 512K 加载到内存中的缓冲区中,关闭输入流,然后打开输出流; 这样,您就不需要两个文件,因为您需要输入、关闭、打开、输出 512 字节,然后继续。 您可以通过这种方式避免重命名/文件重定位魔法。

如果您在某种程度上不厌恶将 C 与 C++ 混合,您也可以:(

注意:伪代码;我不记得 mmap 调用了)

int myfd = open("mylog", O_RDONLY); // Grab a file descriptor
(char *) myptr = mmap(mylog, myfd, filesize - 512000) // mmap the last 512K
std::string mystr(myptr, 512000) // pull 512K from our mmap'd buffer and load it directly into the std::string
munmap(mylog, 512000); //Unmap the file
close(myfd); // Close the file descriptor

取决于很多事情, mmap 可能比寻找更快。 如果您好奇的话,谷歌搜索“fseek vs mmap”会产生一些有趣的读物。

华泰

"I would probably create a new file, seek in the old file, do a buffered read/write from old file to new file, rename the new file over the old one."

I think you'd be better off simply:

#include <fstream>
std::ifstream ifs("logfile");  //One call to start it all. . .
ifs.seekg(-512000, std::ios_base::end);  // One call to find it. . .
char tmpBuffer[512000];
ifs.read(tmpBuffer, 512000);  //One call to read it all. . .
ifs.close();
std::ofstream ofs("logfile", ios::trunc);
ofs.write(tmpBuffer, 512000); //And to the FS bind it.

This avoids the file rename stuff by simply copying the last 512K to a buffer, opening your logfile in truncate mode (clears the contents of the logfile), and spitting that same 512K back into the beginning of the file.

Note that the above code hasn't been tested, but I think the idea should be sound.

You could load the 512K into a buffer in memory, close the input stream, then open the output stream; in this way, you wouldn't need two files since you'd input, close, open, output the 512 bytes, then go. You avoid the rename / file relocation magic this way.

If you don't have an aversion to mixing C with C++ to some extent, you could also perhaps:

(Note: pseudocode; I don't remember the mmap call off the top of my head)

int myfd = open("mylog", O_RDONLY); // Grab a file descriptor
(char *) myptr = mmap(mylog, myfd, filesize - 512000) // mmap the last 512K
std::string mystr(myptr, 512000) // pull 512K from our mmap'd buffer and load it directly into the std::string
munmap(mylog, 512000); //Unmap the file
close(myfd); // Close the file descriptor

Depending on many things, mmap could be faster than seeking. Googling 'fseek vs mmap' yields some interesting reading about it, if you're curious.

HTH

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文