在查看我们的一些日志记录时,我在分析器中注意到我们在 operator<<
格式化整数等方面花费了大量时间。看起来有一个共享锁,每当在格式化 int(并且可能是双精度)时调用 ostream::operator<<
时都会使用该共享锁。经过进一步调查,我将其范围缩小到以下示例:
Loop1 使用 ostringstream 进行格式化:
DWORD WINAPI doWork1(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
out << "[0";
for (int j = 1; j < 100; ++j)
out << ", " << j;
out << "]\n";
}
return 0;
}
Loop2 使用相同的 ostringstream 执行除 int 格式之外的所有操作,这是通过 itoa 完成的:
DWORD WINAPI doWork2(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
char buffer[13];
out << "[0";
for (int j = 1; j < 100; ++j)
{
_itoa_s(j, buffer, 10);
out << ", " << buffer;
}
out << "]\n";
}
return 0;
}
在我的测试中,我使用 1、2、3 和 4 个线程多次运行每个循环(我有一台 4 核机器)。试验次数是恒定的。这是输出:
doWork1: all ostringstream
n Total
1 557
2 8092
3 15916
4 15501
doWork2: use itoa
n Total
1 200
2 112
3 100
4 105
如您所见,使用 ostringstream 时的性能非常糟糕。当添加更多线程时,情况会变得更糟 30 倍,而 itoa 的速度会快 2 倍。
一种想法是按照 _configthreadlocale(_ENABLE_PER_THREAD_LOCALE) rel="nofollow noreferrer">本文中的 M$。这似乎对我没有帮助。 这是另一位用户 谁似乎有类似的问题。
我们需要能够在应用程序并行运行的多个线程中格式化整数。鉴于这个问题,我们要么需要弄清楚如何使其工作,要么找到另一种格式化解决方案。我可以使用运算符<<编写一个简单的类重载整型和浮点类型,然后有一个模板化版本,只调用operator<<在底层流上。有点难看,但我认为我可以让它工作,尽管可能不适用于用户定义的运算符<<(ostream&,T),因为它不是一个ostream
。
我还应该明确指出,这是使用 Microsoft Visual Studio 2005 构建的。我相信这个限制来自于他们对标准库的实现。
When looking at some of our logging I've noticed in the profiler that we were spending a lot of time in the operator<<
formatting ints and such. It looks like there is a shared lock that is used whenever ostream::operator<<
is called when formatting an int(and presumably doubles). Upon further investigation I've narrowed it down to this example:
Loop1 that uses ostringstream
to do the formatting:
DWORD WINAPI doWork1(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
out << "[0";
for (int j = 1; j < 100; ++j)
out << ", " << j;
out << "]\n";
}
return 0;
}
Loop2 that uses the same ostringstream
to do everything but the int format, that is done with itoa
:
DWORD WINAPI doWork2(void* param)
{
int nTimes = *static_cast<int*>(param);
for (int i = 0; i < nTimes; ++i)
{
ostringstream out;
char buffer[13];
out << "[0";
for (int j = 1; j < 100; ++j)
{
_itoa_s(j, buffer, 10);
out << ", " << buffer;
}
out << "]\n";
}
return 0;
}
For my test I ran each loop a number of times with 1, 2, 3 and 4 threads (I have a 4 core machine). The number of trials is constant. Here is the output:
doWork1: all ostringstream
n Total
1 557
2 8092
3 15916
4 15501
doWork2: use itoa
n Total
1 200
2 112
3 100
4 105
As you can see, the performance when using ostringstream is abysmal. It gets 30 times worse when adding more threads whereas the itoa gets about 2 times faster.
One idea is to use _configthreadlocale(_ENABLE_PER_THREAD_LOCALE)
as recommended by M$ in this article. That doesn't seem to help me. Here's another user who seem to be having a similar issue.
We need to be able to format ints in several threads running in parallel for our application. Given this issue we either need to figure out how to make this work or find another formatting solution. I may code up a simple class with operator<< overloaded for the integral and floating types and then have a templated version that just calls operator<< on the underlying stream. A bit ugly, but I think I can make it work, though maybe not for user defined operator<<(ostream&,T)
because it's not an ostream
.
I should also make clear that this is being built with Microsoft Visual Studio 2005. And I believe this limitation comes from their implementation of the standard library.
发布评论
评论(3)
如果 Visual Studio 2005 的标准库实现有错误,为什么不尝试其他实现呢?例如:
甚至 Dinkumware Studio 2005标准库是基于的,也许从2005年开始就已经解决了这个问题。
编辑:你提到的另一个用户使用的是Visual Studio 2008 SP1,这意味着Dinkumware可能还没有解决这个问题。
If the Visual Studio 2005's standard library implementation has bugs why not try other implementations? Like:
or even Dinkumware upon which Visual Studio 2005 standard library is based on, maybe the have fixed the problem since 2005.
Edit: The other user you mentioned used Visual Studio 2008 SP1, which means that probably Dinkumware has not fixed this issue.
我并不感到惊讶,MS 在相当多的共享资源上设置了“全局”锁 - 让我们最头痛的是几年前的 BSTR 内存锁。
您能做的最好的事情就是复制代码并用您自己的类替换 ostream 锁和共享转换内存。我已经这样做了,我使用 printf 风格的日志系统编写流(即我必须使用 printf 记录器,并用我的流运算符包装它)。一旦你将其编译到你的应用程序中,你应该和 itoa 一样快。当我在办公室时,我会抓取一些代码并粘贴给您。
编辑:
正如所承诺的:
抱歉,我不能让你拥有所有这些,但这 3 种方法展示了基础知识 - 我分配一个缓冲区,根据需要调整它的大小(m_size 是缓冲区大小,m_length 是当前文本长度)并在持续时间内保留它记录对象的。缓冲区内容在 endl 方法中写入文件(或 OutputDebugString,或列表框)。我还有一个日志记录“级别”来限制运行时的输出。因此,您只需用此替换对 ostringstream 的调用,然后 Write() 方法将缓冲区泵入文件并清除长度。希望这有帮助。
Doesn't surprise me, MS has put "global" locks on a fair few shared resources - the biggest headache for us was the BSTR memory lock a few years back.
The best thing you can do is copy the code and replace the ostream lock and shared conversion memory with your own class. I have done that where I write the stream using a printf-style logging system (ie I had to use a printf logger, and wrapped it with my stream operators). Once you've compiled that into your app you should be as fast as itoa. When I'm in the office I'll grab some of the code and paste it for you.
EDIT:
as promised:
Sorry I can't let you have all of it, but those 3 methods show the basics - I allocate a buffer, resize it if needed (m_size is buffer size, m_length is current text length) and keep it for the duration of the logging object. The buffer contents get written to file (or OutputDebugString, or a listbox) in the endl method. I also have a logging 'level' to restrict output at runtime. So you just replace your calls to ostringstream with this, and the Write() method pumps the buffer to a file and clears the length. Hope this helps.
问题可能是内存分配。 “new”使用的 malloc 有一个内部锁。只要你走进去就能看到。尝试使用线程本地分配器并查看不良性能是否消失。
The problem could be memory allocation. malloc which "new" uses has an internal lock. You can see it if you step into it. Try to use a thread local allocator and see if the bad performance disappears.