C++程序在处理大数据时内存不足
我正在尝试解决我编写的 C++ 程序中的问题。我基本上内存不足了。该程序是一个缓存模拟器。有一个文件预先收集了内存地址,如下所示:
线程地址 类型 大小 指令指针
0 0x7fff60000000 1 8 0x7f058c482af3
此类条目可能有 100-5000 亿个。首先,我尝试读取所有这些条目并将其存储在向量中。另外,在阅读时,我构建了一组这些地址(使用映射),并存储特定地址的序列号。序列号简单地表示地址条目在文件中的位置(一个地址可以多次看到)。对于大输入,程序在执行此操作时会失败,并在第 3000 万个条目左右出现 bad_alloc 错误。我想我的内存不足了。请告知我如何规避该问题。有没有其他方法来处理这种大数据。非常感谢!抱歉发了这么长的帖子。我想提供一些背景信息和我正在编写的实际代码。
下面是相关代码。 ParseTaceFile() 读取每一行并调用 StoreTokens(),获取地址和大小,并调用 AddAddress(),后者实际上将地址存储在向量和映射中。下面还给出了类声明。 AddAddress() 中的第一个 try 块实际上抛出 bad_alloc 异常。
void AddressList::ParseTraceFile(const char* filename) {
std::ifstream in_file;
std::cerr << "Reading Address Trace File..." << std::endl;
in_file.exceptions(std::ifstream::failbit | std::ifstream::badbit);
char *contents = NULL;
try {
in_file.open(filename, std::ifstream::in | std::ifstream::binary);
in_file.seekg(0, std::ifstream::end);
std::streampos length(in_file.tellg());
if (length < 0) {
std::cerr << "Can not read input file length" << std::endl;
throw ExitException(1);
}
contents = (new char[length]);
in_file.seekg(0, std::ifstream::beg);
in_file.read(contents, length);
in_file.close();
uint64_t linecount = 0, i = 0, lastline = 0, startline = 0;
while (i < static_cast<uint64_t>(length)) {
if ((contents[i] == '\n') or (contents[i] == EOF)) {
contents[i] = '\0';
lastline = startline;
startline = i + 1;
++linecount;
if (linecount > 1) {
StoreTokens((contents + lastline), &linecount);
}
}
++i;
}
} catch (std::bad_alloc& e) {
delete [] contents;
std::cerr << "error allocating memory while parsing" << std::endl;
throw;
} catch (std::ifstream::failure &exc1) {
if (!in_file.eof()) {
delete[] contents;
std::cerr << "error in reading address trace file" << exc1.what()
<< std::endl;
throw ExitException(1);
}
}
std::cerr << "Done" << std::endl;
}
//=========================================================
void AddressList::StoreTokens(char* line, uint64_t * const linecount) {
uint64_t address, size;
char *token = strtok(line, " \t");
uint8_t tokencount = 0;
while (NULL != token) {
++tokencount;
switch (tokencount) {
case 1:
break;
case 2:
address = strtoul(token, NULL, 16);
break;
case 3:
break;
case 4:
size = strtoul(token, NULL, 0);
break;
case 5:
break;
default:
break;
}
token = strtok(NULL, " \t");
}
AddAddress(address, size);
}
//================================================================
void AddressList::AddAddress(const uint64_t& byteaddr, const uint64_t& size) {
//allocate memory for the address vector
try {
if ((sequence_no_ % kReserveCount) == 0) address_list_.reserve(kReserveCount);
} catch (std::bad_alloc& e) {
std::cerr
<< "error allocating memory for address trace vector, address count"
<< sequence_no_ << std::endl;
throw;
}
uint64_t offset = byteaddr & (CacheParam::Instance()->LineSize() - 1);
//lineaddress = byteaddr >> CacheParam::Instance()->BitsForLine();
// this try block is for allocating memory for the address set and the queue it holds
try {
// splitter
uint64_t templinesize = 0;
do {
Address temp_addr(byteaddr + templinesize);
address_list_.push_back(temp_addr);
address_set_[temp_addr.LineAddress()].push(sequence_no_++);
templinesize = templinesize + CacheParam::Instance()->LineSize();
} while (size + offset > templinesize);
} catch (std::bad_alloc& e) {
address_list_.pop_back();
std::cerr
<< "error allocating memory for address trace set, address count"
<< sequence_no_ << std::endl;
throw;
}
}
//======================================================
typedef std::queue<uint64_t> TimeStampQueue;
typedef std::map<uint64_t, TimeStampQueue> AddressSet;
class AddressList {
public:
AddressList(const char* tracefilename);
bool Simulate(uint64_t *hit_count, uint64_t* miss_count);
~AddressList();
private:
void AddAddress(const uint64_t& byteaddr, const uint64_t& size);
void ParseTraceFile(const char* filename);
void StoreTokens(char* line, uint64_t * const linecount);
std::vector<Address> address_list_;
AddressSet address_set_;
uint64_t sequence_no_;
CacheMemory cache_;
AddressList (const AddressList&);
AddressList& operator=(const AddressList&);
};
输出如下:
正在读取缓存配置文件...
读取缓存参数...
正在读取地址跟踪文件...
为地址跟踪集分配内存时出错,地址计数 30000000
解析时分配内存时出错
I am trying to solve an issue in a C++ program I wrote. I am basically running out of memory. The program is a cache simulator. There is a file which has memory addresses collected beforehand, like this:
Thread Address Type Size Instruction Pointer
0 0x7fff60000000 1 8 0x7f058c482af3
There can be 100-500 billion such entries. First, I am trying to read all those entries and store it in a vector. Also while reading, I build up a set of these addresses (using map), and store the sequence numbers of a particular address. Sequence number simply means the position of the address-entry in the file (one address can be seen multiple times). For large inputs the program fails while doing this, with a bad_alloc error at around the 30 millionth entry. I guess I am running out of memory. Please advise on how can I circumvent the problem. Is there an alternative way to handle this kind of large data. Thank you very much! Sorry for the long post. I wanted to give some context and the actual code which I am writing.
Below is the relevant code. The ParseTaceFile() reads each line and calls the
StoreTokens(), which gets the address and size, and calls AddAddress() which actually stores the address in a vector and a map. The class declaration is also given below. The first try block in AddAddress() actually throws the bad_alloc exception.
void AddressList::ParseTraceFile(const char* filename) {
std::ifstream in_file;
std::cerr << "Reading Address Trace File..." << std::endl;
in_file.exceptions(std::ifstream::failbit | std::ifstream::badbit);
char *contents = NULL;
try {
in_file.open(filename, std::ifstream::in | std::ifstream::binary);
in_file.seekg(0, std::ifstream::end);
std::streampos length(in_file.tellg());
if (length < 0) {
std::cerr << "Can not read input file length" << std::endl;
throw ExitException(1);
}
contents = (new char[length]);
in_file.seekg(0, std::ifstream::beg);
in_file.read(contents, length);
in_file.close();
uint64_t linecount = 0, i = 0, lastline = 0, startline = 0;
while (i < static_cast<uint64_t>(length)) {
if ((contents[i] == '\n') or (contents[i] == EOF)) {
contents[i] = '\0';
lastline = startline;
startline = i + 1;
++linecount;
if (linecount > 1) {
StoreTokens((contents + lastline), &linecount);
}
}
++i;
}
} catch (std::bad_alloc& e) {
delete [] contents;
std::cerr << "error allocating memory while parsing" << std::endl;
throw;
} catch (std::ifstream::failure &exc1) {
if (!in_file.eof()) {
delete[] contents;
std::cerr << "error in reading address trace file" << exc1.what()
<< std::endl;
throw ExitException(1);
}
}
std::cerr << "Done" << std::endl;
}
//=========================================================
void AddressList::StoreTokens(char* line, uint64_t * const linecount) {
uint64_t address, size;
char *token = strtok(line, " \t");
uint8_t tokencount = 0;
while (NULL != token) {
++tokencount;
switch (tokencount) {
case 1:
break;
case 2:
address = strtoul(token, NULL, 16);
break;
case 3:
break;
case 4:
size = strtoul(token, NULL, 0);
break;
case 5:
break;
default:
break;
}
token = strtok(NULL, " \t");
}
AddAddress(address, size);
}
//================================================================
void AddressList::AddAddress(const uint64_t& byteaddr, const uint64_t& size) {
//allocate memory for the address vector
try {
if ((sequence_no_ % kReserveCount) == 0) address_list_.reserve(kReserveCount);
} catch (std::bad_alloc& e) {
std::cerr
<< "error allocating memory for address trace vector, address count"
<< sequence_no_ << std::endl;
throw;
}
uint64_t offset = byteaddr & (CacheParam::Instance()->LineSize() - 1);
//lineaddress = byteaddr >> CacheParam::Instance()->BitsForLine();
// this try block is for allocating memory for the address set and the queue it holds
try {
// splitter
uint64_t templinesize = 0;
do {
Address temp_addr(byteaddr + templinesize);
address_list_.push_back(temp_addr);
address_set_[temp_addr.LineAddress()].push(sequence_no_++);
templinesize = templinesize + CacheParam::Instance()->LineSize();
} while (size + offset > templinesize);
} catch (std::bad_alloc& e) {
address_list_.pop_back();
std::cerr
<< "error allocating memory for address trace set, address count"
<< sequence_no_ << std::endl;
throw;
}
}
//======================================================
typedef std::queue<uint64_t> TimeStampQueue;
typedef std::map<uint64_t, TimeStampQueue> AddressSet;
class AddressList {
public:
AddressList(const char* tracefilename);
bool Simulate(uint64_t *hit_count, uint64_t* miss_count);
~AddressList();
private:
void AddAddress(const uint64_t& byteaddr, const uint64_t& size);
void ParseTraceFile(const char* filename);
void StoreTokens(char* line, uint64_t * const linecount);
std::vector<Address> address_list_;
AddressSet address_set_;
uint64_t sequence_no_;
CacheMemory cache_;
AddressList (const AddressList&);
AddressList& operator=(const AddressList&);
};
The output is like this:
Reading Cache Configuration File...
Cache parameters read...
Reading Address Trace File...
error allocating memory for address trace set, address count 30000000
error allocating memory while parsing
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
因为您的数据集似乎比您的内存大得多,所以您必须在磁盘上写入索引。将整个内容导入数据库并让它为您构建索引可能是最简单的。
As it seems your datasets will be much larger then your memory you would have to write an on disk index. Probably easiest to import the whole thing into a database and let that build the indexes for you.
映射在填充时对其输入进行排序,以优化查找时间并提供排序的输出。听起来您没有使用查找功能,因此最佳策略是使用其他方法对列表进行排序。 合并排序非常适合对不适合内存的集合进行排序。即使您正在进行查找,只要每个记录的大小固定,对已排序文件的二进制搜索也会比简单的方法更快。
A map sorts its input while it is being populated, to optimize lookup times and to provide a sorted output. It sounds like you aren't using the lookup feature, so the optimal strategy is to sort the list using another method. Merge sorting is fantastic for sorting collections that don't fit into memory. Even if you are doing lookups, a binary search into a sorted file will be faster than a naive approach as long as each record is a fixed size.
请原谅我陈述可能是显而易见的事情,但以有效的方式存储和查询大量数据的需求正是发明数据库的确切原因。他们已经以比你我在合理时间内想出的更好的方式解决了所有这些问题。无需重新发明轮子。
Forgive me for stating what may be obvious, but the need to store and query large amounts of data in an efficient manner is the exact reason databases were invented. They have already solved all these issues in a better way than you or I could come up with in a reasonable amount of time. No need to reinvent the wheel.