我有一个“服务器”程序,可以更新共享内存中的许多链接列表以响应外部事件。我希望客户端程序尽快注意到任何列表上的更新(最低延迟)。一旦数据被填充并且其下一个指针被设置到有效位置,服务器就会将链表节点的 state_
标记为 FILLED
。在此之前,其 state_
为 NOT_FILLED_YET
。我正在使用内存屏障来确保客户端在其中的数据实际准备好之前不会将 state_
视为 FILLED
(而且它似乎有效,我从未见过损坏的情况)数据)。此外,state_
是易失性的,以确保编译器不会将客户端对其的检查移出循环。
保持服务器代码完全相同,我为客户端提供了 3 种不同的方法来扫描链接列表以查找更改。问题是:为什么第三种方法最快?
方法 1:连续对所有链表(称为“通道”)进行循环,查看是否有任何节点已更改为“已填充”:
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
当通道数量较小时,方法 1 给出的延迟非常低。但是当通道数量增加(250K+)时,由于循环遍历所有通道,它变得非常慢。于是我尝试了...
方法2:给每个链表一个ID。在旁边保留一个单独的“更新列表”。每次更新其中一个链表时,将其 ID 推送到更新列表中。现在我们只需要监视单个更新列表,并检查从中获取的 ID。
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
方法 2 的延迟非常严重。方法 1 可能会给出低于 10us 的延迟,而方法 2 会莫名其妙地给出 8ms 的延迟!使用 gettimeofday 看来 update_cursor->state_ 中的更改从服务器视图传播到客户端视图非常慢(我在多核机器上,所以我假设延迟是由于缓存造成的)。所以我尝试了一种混合方法...
方法3:保留更新列表。但连续循环所有通道,并在每次迭代中检查更新列表是否已更新。如果有,请按照推到上面的数字进行操作。如果没有,请检查我们当前迭代到的通道。
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
此方法的延迟与方法 1 一样好,但可扩展到大量通道。问题是,我不知道为什么。只是为了打破事情:如果我取消注释“通过更新找到”部分,它会在每个延迟日志消息之间打印。这意味着只能在更新列表中找到东西!所以我不明白这个方法如何比方法 2 更快。
生成随机字符串作为测试数据的完整、可编译代码(需要 GCC 和 boost-1.41)位于:http://pastebin.com/0kuzm3Uf
更新:所有 3 种方法都有效地自旋锁定,直到发生更新。区别在于他们需要多长时间才能注意到更新已经发生。它们都不断地加重处理器的负担,因此这并不能解释速度差异。我正在一台 4 核机器上进行测试,没有运行其他任何东西,因此服务器和客户端没有什么可竞争的。我什至制作了一个代码版本,其中更新发出条件信号并让客户端等待该条件 - 它对任何方法的延迟都没有帮助。
Update2:尽管有3种方法,但我一次只尝试了1种,因此只有1个服务器和1个客户端在竞争state_成员。
I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_
as FILLED
once its data is filled in and its next pointer has been set to a valid location. Until then, its state_
is NOT_FILLED_YET
. I am using memory barriers to make sure that clients don't see the state_
as FILLED
before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_
is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.
发布评论
评论(4)
假设:方法 2 以某种方式阻止服务器写入更新。
除了处理器核心本身之外,您可以锤炼的东西之一就是您的一致性缓存。当您读取给定核心上的值时,该核心上的 L1 缓存必须获取对该缓存行的读访问权限,这意味着它需要使任何其他缓存对该行的写访问权限无效。反之亦然来写入一个值。因此,这意味着您不断地在“写入”状态(在服务器核心的缓存中)和“读取”状态(在所有客户端核心的缓存中)之间来回切换缓存线。
x86 缓存性能的复杂性并不是我完全熟悉的,但看起来完全合理(至少在理论上),通过让三个不同的线程尽可能地使用 read 来敲击这个内存位置,这似乎是完全合理的(至少在理论上)访问请求大约会在服务器上造成拒绝服务攻击,有时会阻止服务器写入该缓存行几毫秒。
您可以做一个实验来检测这一点,方法是查看服务器实际将值写入更新列表所需的时间,并查看是否存在与延迟相对应的延迟。
您还可以尝试通过在单个核心上运行所有内容来尝试从等式中删除缓存,以便客户端和服务器线程从同一个 L1 缓存中提取内容。
Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.
我不知道你是否读过 Herb Sutter 的并发专栏。它们非常有趣,尤其是当您遇到缓存问题时。
实际上,
Method2
在这里似乎更好,因为 id 通常小于数据意味着您不必太频繁地往返于主内存(这很费力)。然而,实际可能发生的情况是,您有这样一行缓存:
这会产生争用。
以下是 Herb Sutter 的文章:消除虚假共享。基本思想就是人为地增加列表中的 ID,使其完全占据一行缓存。
当您阅读该系列文章时,请查看该系列中的其他文章。也许你会得到一些想法。有一个很好的无锁循环缓冲区,我认为这可以帮助您更新列表:)
I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the
Method2
seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).However, what can actually happen is that you have such a line of cache:
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)
我注意到在方法 1 和方法 3 中都有一行 ACQUIRE_MEMORY_BARRIER,我认为这与多线程/竞争条件有关?
不管怎样,方法 2 没有任何睡眠,这意味着下面的代码...
将会对处理器造成影响。执行此类生产者/消费者任务的典型方法是使用某种信号量向读者发出更新列表已更改的信号。搜索生产者/消费者多线程应该会给您提供大量示例。这里的主要思想是,这允许线程在等待 update_cursor->state 更改时进入睡眠状态。这可以防止该线程窃取所有 cpu 周期。
I've noticed in both method 1 and method 3 you have a line,
ACQUIRE_MEMORY_BARRIER
, which I assume has something to do with multi-threading/race conditions?Either way, method 2 doesn't have any sleeps which means the following code...
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.
答案很难弄清楚,公平地说,我提供的信息很难,但如果有人真正编译了我提供的源代码,他们就有机会;)我说打印了“通过更新列表找到”在每条延迟日志消息之后,但这实际上并不是真的——只有我可以在终端中回滚时才如此。一开始,在没有使用更新列表的情况下发现了大量更新。
问题是,在我在更新列表中设置起点和在每个数据列表中设置起点之间,会存在一些滞后,因为这些操作需要时间。请记住,在整个过程中,列表一直在增长。考虑最简单的情况,我有 2 个数据列表,A 和 B。当我在更新列表中设置起点时,由于列表 A 上有 30 个更新,列表 B 上有 30 个更新,其中恰好有 60 个元素。假设它们已经交替:
但是当我将更新列表设置到那里之后,B 有大量更新,而 A 没有更新。然后我在每个数据列表中设置了起始位置。我的数据列表起点将在更新激增之后,但更新列表中的起点是在更新激增之前,所以现在我要检查大量更新没有找到他们。上面的混合方法效果最好,因为通过在找不到更新时迭代所有元素,它可以快速缩小更新列表所在位置和数据列表所在位置之间的时间间隙。
The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.