使用 ifstream 在两个线程中处理同一文件
我的应用程序中有一个输入文件,其中包含大量信息。按顺序读取它,并且一次仅读取一个文件偏移量不足以满足我的应用程序的使用。理想情况下,我希望有两个线程,它们具有从同一文件的两个唯一文件偏移量读取的独立且不同的 ifstream。我不能只启动一个 ifstream
,然后使用其复制构造函数复制它(因为它不可复制)。 那么,我该如何处理这个问题?
我立即想到两种方法,
- 为第二个线程构造一个新的
ifstream
,在同一个文件上打开它。 - 在两个线程之间共享打开的
ifstream
的单个实例(例如使用boost::shared_ptr
)。当线程获得时间片时,寻求当前线程当前感兴趣的适当文件偏移量。
这两种方法中的一种是首选吗?
还有我还没有想到的第三种(或第四种)选择吗?
显然,我最终受到硬盘驱动器的限制来回旋转,但我有兴趣利用(如果可能的话),是同时在两个文件偏移处进行一些操作系统级磁盘缓存。
谢谢。
I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstream
s reading from two unique file offsets of the same file. I can't just start one ifstream
up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?
Immediately I can think of two ways,
- Construct a new
ifstream
for the second thread, open it on the same file. - Share a single instance of an open
ifstream
across both threads (using for instanceboost::shared_ptr<>
). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.
Is one of these two methods preferred?
Is there a third (or fourth) option that I have not yet thought of?
Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
两个
std::ifstream
实例可能是这里的最佳选择。现代 HDD 针对大型 I/O 请求队列进行了优化,因此同时从两个 std::ifstream 实例读取应该会提供相当好的性能。如果您只有一个 std::ifstream,您将不得不担心对其的同步访问,而且它可能会破坏操作系统的自动顺序访问预读缓存,从而导致性能较差。
Two
std::ifstream
instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from twostd::ifstream
instances concurrently should give quite nice performance.If you have a single
std::ifstream
you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.在两者之间,我更喜欢第二个。同一文件的两个打开可能会导致文件之间的视图不一致,具体取决于底层操作系统。
对于第三个选项,将引用或原始指针传递到另一个线程中。只要语义是一个线程“拥有”istream,原始指针或引用就可以了。
最后请注意,在绝大多数硬件上,加载大文件时,磁盘是瓶颈,而不是 CPU。使用两个线程会使情况变得更糟,因为您将顺序文件访问转变为随机访问。典型的硬盘顺序访问速度可能为 100MB/s,但随机访问速度最高可达 3 或 4 MB/s。
Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.
For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.
Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.
其他选项:
istrstream
对此很有用,istringstream
则不然)。Other option:
istrstream
is good for this,istringstream
is not).这实际上取决于您的系统。现代系统通常会读取
前面;在文件内查找可能会抑制这种情况,所以应该
绝对要避免。
也许值得尝试一下预读在您的系统上是如何工作的:
打开文件,然后依次读取它的前半部分,看看如何
需要很长时间。然后打开它,寻找中间,然后阅读第二个
一半依次进行。 (在我过去见过的一些系统上,一个简单的
任何时候,seek 都会关闭预读。)最后,打开它,然后
读取所有其他记录;这将使用相同的方法模拟两个线程
文件描述符。 (对于所有这些测试,使用固定长度记录,并且
以二进制模式打开。还采取一切必要的措施来确保
文件中的所有数据都已从操作系统的缓存中清除
开始测试——Unix下,复制10或20GB的文件
/dev/null
通常就足够了。这会给你一些想法,但要真正确定的是,最好的
解决方案是测试真实案例。如果分享一个我会感到惊讶
单个
ifstream
(因此是单个文件描述符),并且不断寻求,赢得,但你永远不知道。
我还推荐系统特定的解决方案,例如
mmap
,但如果您有获得了这么多数据,您很可能无法绘制它
无论如何,一劳永逸。 (您仍然可以使用
mmap
,映射它的部分一次,但它变得更加复杂。)
最后,是否有可能将数据分割成
文件更小?这可能是最快的解决方案。 (理想情况下,
这将在数据生成或导入到
系统。)
It really depends on your system. A modern system will generally read
ahead; seeking within the file is likely to inhibit this, so should
definitly be avoided.
It might be worth experimenting how read-ahead works on your system:
open the file, then read the first half of it sequentially, and see how
long that takes. Then open it, seek to the middle, and read the second
half sequentially. (On some systems I've seen in the past, a simple
seek, at any time, will turn off read-ahead.) Finally, open it, then
read every other record; this will simulate two threads using the same
file descriptor. (For all of these tests, use fixed length records, and
open in binary mode. Also take whatever steps are necessary to ensure
that any data from the file is purged from the OS's cache before
starting the test—under Unix, copying a file of 10 or 20 Gigabytes
to
/dev/null
is usually sufficient for this.That will give you some ideas, but to be really certain, the best
solution would be to test the real cases. I'd be surprised if sharing a
single
ifstream
(and thus a single file descriptor), and constantlyseeking, won, but you never know.
I'd also recommend system specific solutions like
mmap
, but if you'vegot that much data, there's a good chance you won't be able to map it
all in one go anyway. (You can still use
mmap
, mapping sections of itat a time, but it becomes a lot more complicated.)
Finally, would it be possible to get the data already cut up into
smaller files? That might be the fastest solution of all. (Ideally,
this would be done where the data is generated or imported into the
system.)
我的投票是单个读取器,它将数据传递给多个工作线程。
如果您的文件位于单个磁盘上,那么多个读取器将降低您的读取性能。是的,您的内核可能具有一些出色的缓存或排队功能,但它将花费更多的时间来查找而不是读取数据。
My vote would be a single reader, which hands the data to multiple worker threads.
If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.