It could be a limitation of the current scheduler. Google "Galbraith's sched:autogroup patch" or "linux miracle patch" (yes really!). There's apparently a 200-line patch in the process of being refined and merged which adds group scheduling, about which Linus says:
I'm also very happy with just what it does to interactive performance. Admittedly, my "testcase" is really trivial (reading email in a web-browser, scrolling around a bit, while doing a "make -j64" on the kernel at the same time), but it's a test-case that is very relevant for me. And it is a huge improvement.
Because, copying a large file (bigger than the available buffer cache) usually involves bringing it through the buffer cache, which generally causes less recently-used pages to be thrown out, which must then be brought back in.
Other processes which are doing tiny small amounts of occasional IO (say just stat'ing a directory) then get their caches all blown away and must do physical reads to bring those pages back in.
Hopefully this can get fixed by a copy-command which can detect this kind of thing and advise the kernel accordingly (e.g. with posix_fadvise) so that a large one-off bulk transfer of a file which does not need to be subsequently read does not completely discard all clean pages from the buffer cache, which now normally mostly happens.
对于cp,它还使用了大量的可用内存带宽,因为每个数据块都被复制到用户空间或从用户空间复制。这也往往会从 CPU 缓存和 TLB 中弹出其他进程所需的数据,这会在其他进程发生缓存未命中时减慢速度。
A high rate of IO operations usually means a high rate of interrupts that must be serviced by the CPU, which takes CPU time.
In the case of cp, it also uses a considerable amount of the available memory bandwidth, as each block of data is copied to and from userspace. This will also tend to eject data required by other processes from the CPUs caches and TLB, which will slow down other processes as they take cache misses.
Here's something I don't understand (this is the first time I've looked at this question): perfmon on my laptop (running Windows Vista) is showing 2000 interrupts/second (1000 on each core) when it's almost idle (doing nothing but displaying perfmon). I can't imagine which device is generating 2000 interrupts/second, and I would have thought that's enough to blow away the CPU caches (my guess is that the CPU quantum for a busy thread is something like 50 msec). It's also showing an average of 350 DPCs/sec.
Do high end hardware suffer from similar issues ?
One type of hardware difference might be the disk hardware and disk device driver, generating more or fewer interrupts and/or other contentions.
发布评论
评论(4)
这可能是当前调度程序的限制。谷歌“Galbraith's sched:autogroup patch”或“linux Miracle patch”(是的,真的!)。显然有一个 200 行的补丁正在完善和合并的过程中,它添加了组调度,关于它 莱纳斯说:
此处前后对比视频。
It could be a limitation of the current scheduler. Google "Galbraith's sched:autogroup patch" or "linux miracle patch" (yes really!). There's apparently a 200-line patch in the process of being refined and merged which adds group scheduling, about which Linus says:
Before-and-after videos here.
因为,复制大文件(大于可用缓冲区高速缓存)通常涉及将其通过缓冲区高速缓存,这通常会导致最近较少使用的页面被丢弃,然后必须将其带回。
其他正在执行微小操作的进程少量的偶尔 IO(比如只是统计一个目录)然后将它们的缓存全部清除,并且必须进行物理读取以将这些页面带回。
希望这可以通过可以检测到这种情况的复制命令来解决并相应地通知内核(例如,使用 posix_fadvise),以便对不需要随后读取的文件进行大量一次性批量传输,而不会完全丢弃缓冲区高速缓存中的所有干净页面,这通常是现在发生的情况。
Because, copying a large file (bigger than the available buffer cache) usually involves bringing it through the buffer cache, which generally causes less recently-used pages to be thrown out, which must then be brought back in.
Other processes which are doing tiny small amounts of occasional IO (say just stat'ing a directory) then get their caches all blown away and must do physical reads to bring those pages back in.
Hopefully this can get fixed by a copy-command which can detect this kind of thing and advise the kernel accordingly (e.g. with posix_fadvise) so that a large one-off bulk transfer of a file which does not need to be subsequently read does not completely discard all clean pages from the buffer cache, which now normally mostly happens.
高 IO 操作率通常意味着必须由 CPU 处理的高中断率,这会占用 CPU 时间。
对于
cp
,它还使用了大量的可用内存带宽,因为每个数据块都被复制到用户空间或从用户空间复制。这也往往会从 CPU 缓存和 TLB 中弹出其他进程所需的数据,这会在其他进程发生缓存未命中时减慢速度。A high rate of IO operations usually means a high rate of interrupts that must be serviced by the CPU, which takes CPU time.
In the case of
cp
, it also uses a considerable amount of the available memory bandwidth, as each block of data is copied to and from userspace. This will also tend to eject data required by other processes from the CPUs caches and TLB, which will slow down other processes as they take cache misses.关于中断,我猜咖啡馆的假设是:
您需要测试的统计数据是每个 CPU 每秒的中断数。
我不知道是否可以将中断绑定到单个 CPU:请参阅 http:// /www.google.com/#q=cpu+affinity+interrupt 了解更多详情。
这是我不明白的事情(这是我第一次看到这个问题):我的笔记本电脑(运行 Windows Vista)上的 perfmon 在几乎空闲(什么也不做)时显示每秒 2000 个中断(每个核心 1000 个)但显示 perfmon)。我无法想象哪个设备每秒生成 2000 个中断,而且我认为这足以破坏 CPU 缓存(我的猜测是繁忙线程的 CPU 量程约为 50 毫秒)。它还显示平均 350 DPC/秒。
一种类型的硬件差异可能是磁盘硬件和磁盘设备驱动程序,产生更多或更少的中断和/或其他争用。
To do with interrupts, I'm guessing that caf's hypothesis is:
The statistics you'd need to test that would be the number of interrupts per second per CPU.
I don't know whether it's possible to tie interrupts to a single CPU: see http://www.google.com/#q=cpu+affinity+interrupt for further details.
Here's something I don't understand (this is the first time I've looked at this question): perfmon on my laptop (running Windows Vista) is showing 2000 interrupts/second (1000 on each core) when it's almost idle (doing nothing but displaying perfmon). I can't imagine which device is generating 2000 interrupts/second, and I would have thought that's enough to blow away the CPU caches (my guess is that the CPU quantum for a busy thread is something like 50 msec). It's also showing an average of 350 DPCs/sec.
One type of hardware difference might be the disk hardware and disk device driver, generating more or fewer interrupts and/or other contentions.