我认为 STL 导致我的应用程序内存使用量增加了三倍

发布于 2024-07-10 19:58:21 字数 2848 浏览 4 评论 0原文

我在我的应用程序中输入一个 200mb 的文件,由于一个非常奇怪的原因,我的应用程序的内存使用量超过 600mb。 我尝试过向量和双端队列,以及 std::string 和 char * 但无济于事。 我需要我的应用程序的内存使用情况与我正在阅读的文件几乎相同,任何建议都会非常有帮助。 是否存在导致内存消耗如此之大的错误? 您能指出问题所在还是我应该重写整个内容吗?

Windows Vista SP1 x64、Microsoft Visual Studio 2008 SP1、32 位发布版本、Intel CPU

到目前为止的整个应用程序:

#include <string>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <time.h>



static unsigned int getFileSize (const char *filename)
{
    std::ifstream fs;
    fs.open (filename, std::ios::binary);
    fs.seekg(0, std::ios::beg);
    const std::ios::pos_type start_pos = fs.tellg();
    fs.seekg(0, std::ios::end);
    const std::ios::pos_type end_pos = fs.tellg();
    const unsigned int ret_filesize (static_cast<unsigned int>(end_pos - start_pos));
    fs.close();
    return ret_filesize;
}
void str2Vec (std::string &str, std::vector<std::string> &vec)
{
    int newlineLastIndex(0);
    for (int loopVar01 = str.size(); loopVar01 > 0; loopVar01--)
    {
        if (str[loopVar01]=='\n')
        {
            newlineLastIndex = loopVar01;
            break;
        }
    }
    int remainder(str.size()-newlineLastIndex);

    std::vector<int> indexVec;
    indexVec.push_back(0);
    for (unsigned int lpVar02 = 0; lpVar02 < (str.size()-remainder); lpVar02++)
    {
        if (str[lpVar02] == '\n')
        {
            indexVec.push_back(lpVar02);
        }
    }
    int memSize(0);
    for (int lpVar03 = 0; lpVar03 < (indexVec.size()-1); lpVar03++)
    {
        memSize = indexVec[(lpVar03+1)] - indexVec[lpVar03];
        std::string tempStr (memSize,'0');
        memcpy(&tempStr[0],&str[indexVec[lpVar03]],memSize);
        vec.push_back(tempStr);
    }
}
void readFile(const std::string &fileName, std::vector<std::string> &vec)
{
    static unsigned int fileSize = getFileSize(fileName.c_str());
    static std::ifstream fileStream;
    fileStream.open (fileName.c_str(),std::ios::binary);
    fileStream.clear();
    fileStream.seekg (0, std::ios::beg);
    const int chunks(1000); 
    int singleChunk(fileSize/chunks);
    int remainder = fileSize - (singleChunk * chunks);
    std::string fileStr (singleChunk, '0');
    int fileIndex(0);
    for (int lpVar01 = 0; lpVar01 < chunks; lpVar01++)
    {
        fileStream.read(&fileStr[0], singleChunk);
        str2Vec(fileStr, vec);
    }
    std::string remainderStr(remainder, '0');
    fileStream.read(&remainderStr[0], remainder);
    str2Vec(fileStr, vec);      
}
int main (int argc, char *argv[])
{   
        std::vector<std::string> vec;
        std::string inFile(argv[1]);
        readFile(inFile, vec);
}

I am inputting a 200mb file in my application and due to a very strange reason the memory usage of my application is more than 600mb. I have tried vector and deque, as well as std::string and char * with no avail. I need the memory usage of my application to be almost the same as the file I am reading, any suggestions would be extremely helpful.
Is there a bug that causes so much memory consumption? Could you pinpoint the problem or should I rewrite the whole thing?

Windows Vista SP1 x64, Microsoft Visual Studio 2008 SP1, 32Bit Release Version, Intel CPU

The whole application until now:

#include <string>
#include <vector>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <sstream>
#include <iterator>
#include <algorithm>
#include <time.h>



static unsigned int getFileSize (const char *filename)
{
    std::ifstream fs;
    fs.open (filename, std::ios::binary);
    fs.seekg(0, std::ios::beg);
    const std::ios::pos_type start_pos = fs.tellg();
    fs.seekg(0, std::ios::end);
    const std::ios::pos_type end_pos = fs.tellg();
    const unsigned int ret_filesize (static_cast<unsigned int>(end_pos - start_pos));
    fs.close();
    return ret_filesize;
}
void str2Vec (std::string &str, std::vector<std::string> &vec)
{
    int newlineLastIndex(0);
    for (int loopVar01 = str.size(); loopVar01 > 0; loopVar01--)
    {
        if (str[loopVar01]=='\n')
        {
            newlineLastIndex = loopVar01;
            break;
        }
    }
    int remainder(str.size()-newlineLastIndex);

    std::vector<int> indexVec;
    indexVec.push_back(0);
    for (unsigned int lpVar02 = 0; lpVar02 < (str.size()-remainder); lpVar02++)
    {
        if (str[lpVar02] == '\n')
        {
            indexVec.push_back(lpVar02);
        }
    }
    int memSize(0);
    for (int lpVar03 = 0; lpVar03 < (indexVec.size()-1); lpVar03++)
    {
        memSize = indexVec[(lpVar03+1)] - indexVec[lpVar03];
        std::string tempStr (memSize,'0');
        memcpy(&tempStr[0],&str[indexVec[lpVar03]],memSize);
        vec.push_back(tempStr);
    }
}
void readFile(const std::string &fileName, std::vector<std::string> &vec)
{
    static unsigned int fileSize = getFileSize(fileName.c_str());
    static std::ifstream fileStream;
    fileStream.open (fileName.c_str(),std::ios::binary);
    fileStream.clear();
    fileStream.seekg (0, std::ios::beg);
    const int chunks(1000); 
    int singleChunk(fileSize/chunks);
    int remainder = fileSize - (singleChunk * chunks);
    std::string fileStr (singleChunk, '0');
    int fileIndex(0);
    for (int lpVar01 = 0; lpVar01 < chunks; lpVar01++)
    {
        fileStream.read(&fileStr[0], singleChunk);
        str2Vec(fileStr, vec);
    }
    std::string remainderStr(remainder, '0');
    fileStream.read(&remainderStr[0], remainder);
    str2Vec(fileStr, vec);      
}
int main (int argc, char *argv[])
{   
        std::vector<std::string> vec;
        std::string inFile(argv[1]);
        readFile(inFile, vec);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

平生欢 2024-07-17 19:58:21

你的记忆正在支离破碎。

尝试这样的事情:

  HANDLE heaps[1025];
  DWORD nheaps = GetProcessHeaps((sizeof(heaps) / sizeof(HANDLE)) - 1, heaps);

  for (DWORD i = 0; i < nheaps; ++i) 
  {
    ULONG  HeapFragValue = 2;
    HeapSetInformation(heaps[i],
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue));
  }

Your memory is being fragmented.

Try something like this :

  HANDLE heaps[1025];
  DWORD nheaps = GetProcessHeaps((sizeof(heaps) / sizeof(HANDLE)) - 1, heaps);

  for (DWORD i = 0; i < nheaps; ++i) 
  {
    ULONG  HeapFragValue = 2;
    HeapSetInformation(heaps[i],
                       HeapCompatibilityInformation,
                       &HeapFragValue,
                       sizeof(HeapFragValue));
  }
鹤舞 2024-07-17 19:58:21

如果我没看错的话,最大的问题是这个算法会自动将所需的内存加倍。

在 ReadFile() 中,您将整个文件读入一组“singleChunk”大小的字符串(块),然后在 str2Vec() 的最后一个循环中,为该块的每个换行符分隔的段分配一个临时字符串。 所以你要把内存加倍。

您还遇到速度问题 - str2vec 对块进行 2 次传递以找到所有换行符。 没有理由你不能一次做到这一点。

If I'm reading this right, the biggest issue is that this algorithm automatically doubles doubles the required memory.

In ReadFile(), you read the whole file into a set of 'singleChunk' sized strings (chunks), and then in the last loop in str2Vec() you allocate a tempstring for every newline separated segment of the chunk. So you're doubling the memory right there.

You've also got speed issues - str2vec does 2 passes over the chunk to find all the newlines. There's no reason you can't do that in one.

小嗷兮 2024-07-17 19:58:21

您可以做的另一件事是将整个文件加载到一个内存块中。 然后创建一个指向每行第一个字符的指针向量,同时用 \0 替换换行符,使其以空终止。 (当然假设你的字符串中不应该有 \0 。)

它不一定像拥有字符串向量那么方便,但拥有 const char* 向量可能“同样好”。

Another thing you could do is to load the entire file into one block of memory. Then make a vector of pointers to the first character of each line, and at the same time, replace the newline with a \0 so it's null-terminated. (Presuming of course that your strings aren't supposed to have \0 in them.)

It's not necessarily as convenient as having a vector of strings, but having a vector of const char* is potentially "just as good."

耳根太软 2024-07-17 19:58:21

STL 容器的存在是为了抽象出内存操作。 如果你有严格的内存限制,那么你就无法真正将它们抽象出来。

我建议使用 mmap() 读取文件(或者,在 Windows 中,使用 MapViewOfFile())。

The STL containers exist to abstract away memory operations. If you have a hard memory limit, then you can't really abstract those away.

I would recommend using mmap() to read the file in (or, in Windows, MapViewOfFile()).

帅的被狗咬 2024-07-17 19:58:21

在 readFile 中,您至少有 2 个文件副本 - ifstream 和复制到 std::vector 中的数据。 只要您打开文件,并且按照原样复制它,就很难将总内存占用量降至文件大小的两倍以下。

Inside readFile you have at least 2 copies of your file - the ifstream, and the data copied into your std::vector . As long as you have the file open, and you're copying it like you are, it's going to be hard to get the total memory footprint down below double the file size.

ζ澈沫 2024-07-17 19:58:21

首先,您如何确定内存使用情况? 任务管理器不是一个合适的工具,因为它显示的并不是实际的内存使用情况。

其次,除了(出于某种原因?)静态变量之外,当您读完文件后唯一没有释放的数据是向量。 所以测试一下它的容量,测试一下它包含的每个字符串的容量。 找出他们各自使用了多少内存。 您拥有确定内存消耗位置的工具。

First, how are you determining memory usage? Task Manager is not a suitable tool for that, as what it shows isn't actually memory usage.

Second, apart from your (for some reason?) static variables, the only data that does not get freed when you're done reading the file, is the vector. So test its capacity, and test the capacity of each string it contains. Find out how much memory they each use. You've got the tools to determine where the memory is being spent.

徒留西风 2024-07-17 19:58:21

我认为您尝试编写自己的缓冲策略是错误的。

流已经实现了非常好的缓冲策略。 如果您认为需要更大的缓冲区,您可以将基本缓冲区安装到流中,而无需任何额外的代码来控制缓冲区。

这是我想出的:
注意:使用我在网上找到的文本版《英王詹姆斯圣经》进行了测试。

#include <string>
#include <vector>
#include <list>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <iostream>

class Line: public std::string
{
};

std::istream& operator>>(std::istream& in,Line& line)
{
    // Relatively efficient way to copy a line into a string.
    return std::getline(in,line);
}
std::ostream& operator<<(std::ostream& out,Line const& line)
{
    return out << static_cast<std::string const&>(line) << "\n";
}

void readLinesFromStream(std::istream& stream,std::vector<Line>& lines)
{
    /*
     * Read into a list as this is flexible in memory usage and will not
     * allocate huge chunks of un-required space.
     *
     * Even with huge files the space for list will be insignificant
     * compared to the size of the data.
     *
     * This then allows us to reserve the correct size of the vector
     * Thus avoiding huge memory chunks being prematurely allocated that
     * are not required. It also prevents the internal structure from
     * being copied every time the container is re-sized.
     */
    std::list<Line>     data;
    std::copy(  std::istream_iterator<Line>(stream),
                std::istream_iterator<Line>(),
                std::inserter(data,data.end())
             );

    /*
     * Reserve the correct size in the vector.
     * then copy out of the list into the vector
     */
    lines.reserve(data.size());
    std::copy(  data.begin(),
                data.end(),
                std::back_inserter(lines)
             );
}

void readLinesFromFile(std::string const& name,std::vector<Line>& lines)
{
    /*
     * Set up the file stream and override the default buffer used by the stream.
     * Make it big because we think the istream buffer is insufficient!!!!
     */
    std::ifstream       file;
    std::vector<char>   buffer(10000);
    file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());

    file.open(name.c_str());
    readLinesFromStream(file,lines);
}


int main(int argc,char* argv[])
{
    std::vector<Line>   lines;
    readLinesFromFile(argv[1],lines);

    // Un-comment if your file is larger than 1100 lines.

    // I tested with a copy of the King James bible. 
    // std::cout << "Lines: " << lines.size() << "\n";
    // std::copy(lines.begin() + 1000,lines.begin() + 1100,std::ostream_iterator<Line>(std::cout));
}

I think your attempt to write your own buffering strategy is misguided.

The streams have a very good buffering strategy already implemented. If you think you need a larger buffer you can install a basic buffer into the stream without any extra code to control the buffer.

Here is what I came up with:
NB tested with a text version of the "King James Bible" that I found online.

#include <string>
#include <vector>
#include <list>
#include <fstream>
#include <algorithm>
#include <iterator>
#include <iostream>

class Line: public std::string
{
};

std::istream& operator>>(std::istream& in,Line& line)
{
    // Relatively efficient way to copy a line into a string.
    return std::getline(in,line);
}
std::ostream& operator<<(std::ostream& out,Line const& line)
{
    return out << static_cast<std::string const&>(line) << "\n";
}

void readLinesFromStream(std::istream& stream,std::vector<Line>& lines)
{
    /*
     * Read into a list as this is flexible in memory usage and will not
     * allocate huge chunks of un-required space.
     *
     * Even with huge files the space for list will be insignificant
     * compared to the size of the data.
     *
     * This then allows us to reserve the correct size of the vector
     * Thus avoiding huge memory chunks being prematurely allocated that
     * are not required. It also prevents the internal structure from
     * being copied every time the container is re-sized.
     */
    std::list<Line>     data;
    std::copy(  std::istream_iterator<Line>(stream),
                std::istream_iterator<Line>(),
                std::inserter(data,data.end())
             );

    /*
     * Reserve the correct size in the vector.
     * then copy out of the list into the vector
     */
    lines.reserve(data.size());
    std::copy(  data.begin(),
                data.end(),
                std::back_inserter(lines)
             );
}

void readLinesFromFile(std::string const& name,std::vector<Line>& lines)
{
    /*
     * Set up the file stream and override the default buffer used by the stream.
     * Make it big because we think the istream buffer is insufficient!!!!
     */
    std::ifstream       file;
    std::vector<char>   buffer(10000);
    file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());

    file.open(name.c_str());
    readLinesFromStream(file,lines);
}


int main(int argc,char* argv[])
{
    std::vector<Line>   lines;
    readLinesFromFile(argv[1],lines);

    // Un-comment if your file is larger than 1100 lines.

    // I tested with a copy of the King James bible. 
    // std::cout << "Lines: " << lines.size() << "\n";
    // std::copy(lines.begin() + 1000,lines.begin() + 1100,std::ostream_iterator<Line>(std::cout));
}
夜清冷一曲。 2024-07-17 19:58:21
  1. 不要使用 std::list。 它需要比向量更多的内存。
  2. Vector 执行所谓的“加倍”操作,即当空间不足时,它会分配当前拥有的内存的两倍。 为了避免它,您可以使用 std::vector::reserve() 方法,如果我没有记错的话,您可以使用 std::vector::capacity()方法(注意capacity()>=size())。

由于执行过程中行数未知,因此我认为没有简单的算法可以避免“加倍”问题。 根据 slavy13.myopenid.com 的评论,解决方案是在完成阅读后将信息移动到另一个保留的向量(相关问题是 如何缩小 std::vector 的大小?)。

  1. do not use std::list. It'll require more memory then vector.
  2. vector does what's called "doubling", i.e., when out of space, it allocates twice the memory it currently has. in order to avoid it you can use std::vector::reserve() method and if i'm not mistaken you can check it using std::vector::capacity() method (note capacity() >= size()).

Since amount of lines is not known during the execution, i see no simple algorithm to avoid "doubling" issue. From a comment by slavy13.myopenid.com the solution is to move the information to another prereserved vector after finishing reading (relevant question is How to downsize std::vector?).

羁客 2024-07-17 19:58:21

尝试使用列表而不是向量。 向量在内存中(几乎总是)是线性的。

诚然,您内部有字符串(几乎总是)修改时复制、引用计数这一事实应该可以减少问题,但它可能会有所帮助。

Try using a list instead of a vector. Vectors are (almost always) linear in memory.

Granted, the fact that you have strings inside, which are (almost always) copy-on-modify, reference-counted should make that less of a problem, but it might help.

你的呼吸 2024-07-17 19:58:21

我不知道这是否相关,因为我真的不知道你的文件是什么样的。

但您应该意识到,std::string 在存储非常短的字符串时可能会产生相当大的空间开销。 如果您单独为非常短的字符串新建 char*,您还将看到所有分配块开销。

您要在该向量中放入多少个字符串,它们的平均长度是多少?

I don't know if this is relevant because I don't really know what your file looks like.

But you should be aware that std::string is likely to have a considerable space overhead when storing a very short string. And if you're individually new-ing up char* for very short strings, you're also going to see all the allocation block overhead.

How many strings are you putting into that vector, and what's their average length?

り繁华旳梦境 2024-07-17 19:58:21

也许您应该详细说明为什么需要读取内存中的整个文件,我怀疑可能有一种方法可以完成您想要的操作,而无需立即将整个文件读取到内存中。 如果您确实需要此功能,请查看内存映射文件,这可能比编写等效文件更有效。 然后,您的内部数据结构可以使用文件中的偏移量。 顺便说一句,请务必查看您是否需要处理字符编码。

Maybe you should elaborate on why you need to read the whole file in memory, I suspect there is probably a way to do what you want without reading the whole file into memory at once. If you really need this functionlity look into memory mapped files which are probably going to be more efficient than you writing the equivalent. Your internal data structure can then use offset into the file. Btw be sure to see if you need to handle character encoding.

青巷忧颜 2024-07-17 19:58:21

我发现执行行的最佳方法是只读内存映射文件。 不要在 for \n 中写入 \0,而是使用成对的 const char *,例如 std::pair 或const char* 对和计数。如果需要编辑行,一个好方法是创建一个可以存储指针对或 std::string 的对象修改后的线路。

至于使用 STL 向量或双端队列节省内存空间,一个好的技术是让它加倍,直到完成添加为止。 然后将其调整为实际大小,这应该将未使用的内存释放回堆分配器。 内存可能仍然分配给程序,尽管我不担心它。 另外,不要采用默认大小,而是首先获取文件大小(以字节为单位),除以您对每行平均字符数的最佳猜测,并在开始时保留那么多空间。

I find that the best way to do lines is to read-only memory map the file. Do not bother writing \0 in for \n, instead use pairs of const char *s, like std::pair<const char*, const char*> or pairs of const char*s and a count.. If you need to edit the lines, a good way to do it is to make an object that can store pointer pairs or a std::string with the modified line.

As for saving room in memory with STL vectors or deques, a good technique is to let it double until you are done adding to it. Then resize it to its real size which should free the unused memory back to the heap allocator. The memory may still be allocated to the program, though I wouldn't worry about it. Also, instead of taking the default size, start out by getting the file size in bytes, divide by your best guess at the average characters per line, and reserve that much space at the start.

请止步禁区 2024-07-17 19:58:21

您应该知道,因为您将 fileStream 声明为 static,所以它永远不会超出范围,这意味着文件直到执行的最后一刻才关闭。 这肯定会涉及一些记忆。 您可以在最后一个 str2Vec 之前显式关闭它,以尝试帮助解决这种情况。

此外,您多次打开和关闭同一个文件,只需打开一次并通过引用传递它(如果需要,重置状态)。 尽管我想您只需通过一次文件就可以完成您所需要的工作。

哎呀,我怀疑你真的需要知道文件大小,就像你在这里所做的那样,你可以只读取大小“块”的量,直到你得到一个简短的读取(此时你就完成了)。

你为什么不解释一下代码的目标,我觉得有一个更简单的解决方案可能。

You should know that because you declared fileStream as static, it never goes out of scope meaning that the file isn't closed until the very last moment in execution. This will certainly involve some memory. You could explicitly close it just before that last str2Vec to try to help the situation.

Also, you open and close the same file multiple times, just open it once and pass it around by reference (resetting state if needed). Though I imagine you could pull off what you need with a single pass on the file.

Heck, i doubt you really need to know the filesize like you do here, you could just read in amounts of size "chunks" until you get a short read (at which point you are done).

Why don't you explain the goal of the code, I feel there is a much simpler solution possible.

小巷里的女流氓 2024-07-17 19:58:21

通过pushBack()增长向量会导致内存碎片和内存使用效率低下。 我会尝试使用列表,并且仅在您确切知道它需要多少个元素时才创建一个向量(如果您需要一个)。

Growing vectors by pushBack() will cause memory fragmentation and inefficient memory usage. I'd try using lists instead, and only creating a vector (if you need one) when you know exactly how many elements it will require.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文