C/C++系统可移植方式更改最大打开文件数
我有一个 C++ 程序,可以转置一个非常大的矩阵。矩阵太大,无法保存在内存中,因此我将每一列写入一个单独的临时文件,然后在处理整个矩阵后将临时文件连接起来。但是,我现在发现我遇到了打开太多临时文件的问题(即操作系统不允许我打开足够的临时文件)。是否有一种系统可移植方法来检查(并希望更改)允许打开的文件的最大数量?
我意识到我可以关闭每个临时文件并仅在需要时重新打开,但我担心这样做会对性能产生影响。
我的代码工作如下(伪代码 - 不保证工作):
int Ncol=5000; // For example - could be much bigger.
int Nrow=50000; // For example - in reality much bigger.
// Stage 1 - create temp files
vector<ofstream *> tmp_files(Ncol); // Vector of temp file pointers.
vector<string> tmp_filenames(Ncol); // Vector of temp file names.
for (unsigned int ui=0; ui<Ncol; ui++)
{
string filename(tmpnam(NULL)); // Get temp filename.
ofstream *tmp_file = new ofstream(filename.c_str());
if (!tmp_file->good())
error("Could not open temp file.\n"); // Call error function
(*tmp_file) << "Column" << ui;
tmp_files[ui] = tmp_file;
tmp_filenames[ui] = filename;
}
// Stage 2 - read input file and write each column to temp file
ifstream input_file(input_filename.c_str());
for (unsigned int s=0; s<Nrow; s++)
{
int input_num;
ofstream *tmp_file;
for (unsigned int ui=0; ui<Ncol; ui++)
{
input_file >> input_num;
tmp_file = tmp_files[ui]; // Get temp file pointer
(*tmp_file) << "\t" << input_num; // Write entry to temp file.
}
}
input_file.close();
// Stage 3 - concatenate temp files into output file and clean up.
ofstream output_file("out.txt");
for (unsigned int ui=0; ui<Ncol; ui++)
{
string tmp_line;
// Close temp file
ofstream *tmp_file = tmp_files[ui];
(*tmp_file) << endl;
tmp_file->close();
// Read from temp file and write to output file.
ifstream read_file(tmp_filenames[ui].c_str());
if (!read_file.good())
error("Could not open tmp file for reading."); // Call error function
getline(read_file, tmp_line);
output_file << tmp_line << endl;
read_file.close();
// Delete temp file.
remove(tmp_filenames[ui].c_str());
}
output_file.close();
非常感谢!
亚当
I have a C++ program that transposes a very large matrix. The matrix is too large to hold in memory, so I was writing each column to a separate temporary file, and then concatenating the temporary files once the whole matrix has been processed. However, I am now finding that I am running up against the problem of having too many open temporary files (i.e. the OS doesn't allow me to open enough temporary files). Is there a system portable method for checking (and hopefully changing) the maximum number of allowed open files?
I realise I could close each temp file and reopen only when needed, but am worried about the performance impact of doing this.
My code works as follows (pseudocode - not guaranteed to work):
int Ncol=5000; // For example - could be much bigger.
int Nrow=50000; // For example - in reality much bigger.
// Stage 1 - create temp files
vector<ofstream *> tmp_files(Ncol); // Vector of temp file pointers.
vector<string> tmp_filenames(Ncol); // Vector of temp file names.
for (unsigned int ui=0; ui<Ncol; ui++)
{
string filename(tmpnam(NULL)); // Get temp filename.
ofstream *tmp_file = new ofstream(filename.c_str());
if (!tmp_file->good())
error("Could not open temp file.\n"); // Call error function
(*tmp_file) << "Column" << ui;
tmp_files[ui] = tmp_file;
tmp_filenames[ui] = filename;
}
// Stage 2 - read input file and write each column to temp file
ifstream input_file(input_filename.c_str());
for (unsigned int s=0; s<Nrow; s++)
{
int input_num;
ofstream *tmp_file;
for (unsigned int ui=0; ui<Ncol; ui++)
{
input_file >> input_num;
tmp_file = tmp_files[ui]; // Get temp file pointer
(*tmp_file) << "\t" << input_num; // Write entry to temp file.
}
}
input_file.close();
// Stage 3 - concatenate temp files into output file and clean up.
ofstream output_file("out.txt");
for (unsigned int ui=0; ui<Ncol; ui++)
{
string tmp_line;
// Close temp file
ofstream *tmp_file = tmp_files[ui];
(*tmp_file) << endl;
tmp_file->close();
// Read from temp file and write to output file.
ifstream read_file(tmp_filenames[ui].c_str());
if (!read_file.good())
error("Could not open tmp file for reading."); // Call error function
getline(read_file, tmp_line);
output_file << tmp_line << endl;
read_file.close();
// Delete temp file.
remove(tmp_filenames[ui].c_str());
}
output_file.close();
Many thanks in advance!
Adam
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
至少有两个限制:
更好的解决方案是避免打开太多文件。在我自己的一个程序中,我围绕文件抽象编写了一个包装器(这是在 Python 中,但原理在 C 中是相同的),它跟踪每个文件中的当前文件位置,并根据需要打开/关闭文件,保留当前打开的文件池。
There are at least two limits:
ulimit
to change the limit, within the bounds allowed by the sysadminA better solution is to avoid having so many open files. In one of my own programs, I wrote a wrapper around the file abstraction (this was in Python, but the principle is the same in C), which keeps track of the current file position in each file, and opens/closes files as needed, keeping a pool of currently-open files.
没有一种可移植的方法来更改打开文件的最大数量。此类限制往往是由操作系统施加的,因此是特定于操作系统的。
最好的办法是减少每次打开的文件数量。
There isn't a portable way to change the max number of open files. Limits like this tend to be imposed by the operating system and are therefore OS-specific.
Your best bet is to reduce the number of files you have open at any one time.
您可以将输入文件规范化为临时文件,以便每个条目占用相同数量的字符。您甚至可以考虑将该临时文件保存为二进制文件(每个数字使用 4/8 个字节,而不是每个十进制数字 1 个字节)。这样您就可以根据矩阵中的坐标计算文件中每个条目的位置。然后,您可以通过执行 std::istream::seekg 来访问特定条目并且您不必担心打开文件数量的限制。
You could normalize the input file into a temporary file, such that each entry occupies the same amount of characters. You might even consider saving that temporary file as binary (using 4/8 bytes per number instead of 1 byte per decimal digit). That way you can calculate the position of each entry in the file from its coordinates in the matrix. Then you can access specific entries by doing a std::istream::seekg and you don't have to concern yourself with a limit on the number of open files.
只制作 1 个大文件而不是许多小临时文件怎么样? Seek 是一种廉价的操作。无论如何,您的列都应该具有相同的大小。您应该能够将文件指针定位在需要访问该列的位置。
How about just making 1 big file instead of many small temp files? Seek is a cheap operation. And your columns should all be the same size anyway. You should be able to position your file pointer right where you need it to access the column.
“矩阵太大,无法保存在内存中”。不过,该矩阵很可能适合您的地址空间。 (如果矩阵无法容纳 2^64 字节,您将需要一个非常强大的文件系统来保存所有这些临时文件。)因此,不必担心临时文件。让操作系统处理交换到磁盘的工作方式。您只需要确保以交换友好的方式访问内存即可。实际上,这意味着您需要有一些参考位置。但是对于 16 GB 的 RAM,您可以映射大约 400 万页的 RAM。如果您的列数明显小于该值,那么应该没有问题。
(不要为此使用 32 位系统;这样做不值得)
"The matrix is too large to hold in memory". It's very likely that the matrix will fit in your address space, though. (If the matrix doesn't fit in 2^64 bytes, you'll need a very impressive file system to hold all those temporary files.) So, don't worry about temporary files. Let the OS handle how swap to disk works. You just need to make sure that you access memory in a way that's swap-friendly. In practice, that means you need to have some locality of reference. But with 16 GB of RAM, you can have ~4 million pages of RAM mapped in. If your number of columsn is significantly smaller than that, there should be no problem.
(Don't use 32 bit systems for this; it's just not worth the pain)