std ::执行中断文件写作

发布于 2025-01-26 20:09:59 字数 4053 浏览 1 评论 0原文

该程序似乎未等待文件系统与std :: excution集成时应用内部更改。

我最近尝试了PPL，TBB，OpenMP和std :: Exection找到用于在我的计算机上运行的特定工作的最快的并行lib。这项工作是递归将某些文件转换为另一个形式，这基本上是：

#include <iostream>
#include <filesystem>
#include <fstream>
#include <vector>
#include <iterator>

using namespace std;
namespace fs = std::filesystem;

const auto CurrentPath = fs::current_path();

fs::path GetOutputPath(const fs::directory_entry& Entry)
{
    return CurrentPath / "Converted" / fs::relative(Entry, CurrentPath);
}

// Convert and save a file entry.
void ConvertFile(const fs::directory_entry& Entry)
{
    ifstream IStream(Entry.path(), ios::in | ios::binary);
    noskipws(IStream);

    ofstream OStream(GetOutputPath(Entry), ios::out | ios::trunc | ios::binary);

    // Read the data.
    // Intentional `uint8_t` for some special needs.
    vector<uint8_t> Data;
    Data.reserve(Entry.file_size());
    Data.assign(istream_iterator<uint8_t>(IStream), {});

    // Some changes to `Data`.
    // Left blank to simplify the problem.

    // Write to the output file.
    OStream.write(reinterpret_cast<char *>(&Data[0]), Data.size());
}

// Convert and save a directory entry.
void ConvertDirectory(const fs::directory_entry& Entry);

// Convert and save an entry.
void Convert(const fs::directory_entry& Entry)
{
    // Recursively loop directories
    if (Entry.is_directory())
    {
        ConvertDirectory(Entry);
    }
    else
    {
        ConvertFile(Entry);
    }
}

void ConvertDirectory(const fs::directory_entry& Entry)
{
    // Not using `recursive_directory_iterator` since its order is not guaranteed.
    // I think manual recursion can always minimize the number of calls to create a directory.
    vector<fs::directory_entry> SubEntries(fs::directory_iterator(Entry), {});
    fs::create_directory(GetOutputPath(Entry));

    // ** Parallel ** part.
    for_each(SubEntries.cbegin(), SubEntries.cend(), Convert);
}

int main(int argc, char *argv[])
{
    ConvertDirectory(fs::directory_entry(CurrentPath / "Test"));
}

使用PPL（ppl.h），我将** Parallel ** Part ** Part ** part更改为：

    concurrency::parallel_for_each(SubEntries.cbegin(), SubEntries.cend(), Convert);

使用： tbb（oneapi/tbb.h）：

    tbb::parallel_for_each(SubEntries.cbegin(), SubEntries.cend(), Convert);

使用std :: excution：

    for_each(execution::par, SubEntries.cbegin(), SubEntries.cend(), Convert);

结果：
测试1 ：300 dir中的700个文件，每个文件1.5MB，每个单元格10个测试中的AVG值
测试2 ：5k dir中的50k文件，每个文件0.15MB，每个单元测试20次测试的AVG值

	1	测试2
（原始）	2.8S	130S 130S
PPL	0.59S	37S 37S
TBB	0.58S	36.8S
STD ::::执行	0.55s	32.6s的

改进非常明显。
但是问题是，当我每次测试检查输出文件时，我发现std :: excution版本在测试2中的输出文件可能低至45k的输出文件。如果我在每个<之后打印一条消息代码> write（），问题仍然存在，而消息的数量始终为50k，因此处理了每个文件，并且问题似乎更有可能是write> write call not call not Afterle程序退出时应用于文件系统。

目录的数量始终是正确的，并且PPL和TBB版本没有这样的问题。

我正在使用Visual Studio 2022。我写错了吗？我该怎么做才能防止std :: Decution发生这种情况？

原文

The program seems to exit without waiting for the file system to apply the inner changes when integrating with std::execution.

I recently tried PPL, TBB, OpenMP and std::exection to find the fastest parallel lib for a particular work running on my machine. The work is to recursively convert some files into another form, which is basically:

#include <iostream>
#include <filesystem>
#include <fstream>
#include <vector>
#include <iterator>

using namespace std;
namespace fs = std::filesystem;

const auto CurrentPath = fs::current_path();

fs::path GetOutputPath(const fs::directory_entry& Entry)
{
    return CurrentPath / "Converted" / fs::relative(Entry, CurrentPath);
}

// Convert and save a file entry.
void ConvertFile(const fs::directory_entry& Entry)
{
    ifstream IStream(Entry.path(), ios::in | ios::binary);
    noskipws(IStream);

    ofstream OStream(GetOutputPath(Entry), ios::out | ios::trunc | ios::binary);

    // Read the data.
    // Intentional `uint8_t` for some special needs.
    vector<uint8_t> Data;
    Data.reserve(Entry.file_size());
    Data.assign(istream_iterator<uint8_t>(IStream), {});

    // Some changes to `Data`.
    // Left blank to simplify the problem.

    // Write to the output file.
    OStream.write(reinterpret_cast<char *>(&Data[0]), Data.size());
}

// Convert and save a directory entry.
void ConvertDirectory(const fs::directory_entry& Entry);

// Convert and save an entry.
void Convert(const fs::directory_entry& Entry)
{
    // Recursively loop directories
    if (Entry.is_directory())
    {
        ConvertDirectory(Entry);
    }
    else
    {
        ConvertFile(Entry);
    }
}

void ConvertDirectory(const fs::directory_entry& Entry)
{
    // Not using `recursive_directory_iterator` since its order is not guaranteed.
    // I think manual recursion can always minimize the number of calls to create a directory.
    vector<fs::directory_entry> SubEntries(fs::directory_iterator(Entry), {});
    fs::create_directory(GetOutputPath(Entry));

    // ** Parallel ** part.
    for_each(SubEntries.cbegin(), SubEntries.cend(), Convert);
}

int main(int argc, char *argv[])
{
    ConvertDirectory(fs::directory_entry(CurrentPath / "Test"));
}

With PPL (ppl.h), I change the ** Parallel ** part to:

    concurrency::parallel_for_each(SubEntries.cbegin(), SubEntries.cend(), Convert);

With TBB (oneapi/tbb.h):

    tbb::parallel_for_each(SubEntries.cbegin(), SubEntries.cend(), Convert);

With std::execution:

    for_each(execution::par, SubEntries.cbegin(), SubEntries.cend(), Convert);

Results:
Test 1: 700 files in 300 dirs, 1.5MB per file, avg values from 10 tests per cell
Test 2: 50k files in 5k dirs, 0.15MB per file, avg values from 20 tests per cell

	Test 1	Test 2
(Original)	2.8s	130s
PPL	0.59s	37s
TBB	0.58s	36.8s
std::execution	0.55s	32.6s

The improvements are quite obvious.
But the problem is, when I check the output files per test, I found the std::execution version may have as low as 45k output files in a test 2. If I print a message after each write(), the problem still exists, while the number of messages is always 50k, so every file is processed and the problem seems more likely to be the write calls not being fully applied to the file system when the program exits.

The number of directories is always correct, and the PPL and TBB versions don't have a such problem.

I'm using Visual Studio 2022.
Did I wrote anything wrong? What can I do to prevent this from happening with std::execution?

分享到QQ

分享到微博