从C++创建带有日期和时间戳的小文件。 API

发布于 2025-01-31 04:03:56 字数 4801 浏览 3 评论 0原文

我正在尝试从C ++程序创建一个.parquet文件。

我想使用parquet :: StreamWriter

TL; DR:最佳的压缩 /编码器设置是什么?宣布列(parquet :: schema :: primitivenode < / code>)如何获取最小文件?

应该可以将文件与GZIP压缩相同数据的CSV大致相同。

我的文件包含一个日期,一个日期(我猜是时间戳),整数和每行浮动。它将必须从R开放,并且日期和时间戳必须为​​数据,而无需转换列。

目前,我将日期设置为时间戳。我想我有点厚,但是我没有得到文档: https://github.com/apache/parquet-format/blob/master/logicaltypes.md ,在测试和示例中: https://github.com/apache/parquet/parquet-cpp/blob/blob/master/master/src /parquet/arrow/arrow-reader-writer-test.cc 似乎没有什么像这样了。

一个玩具示例:

#include <iostream>
#include <fstream>
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

int main() {
    for (auto i : std::vector<int>({1, 2})) {
        std::cout << i << std::endl;
        std::string id("_gzip_compressor_delta_binary_packet_encoding");
        std::shared_ptr<::arrow::io::FileOutputStream> outfile;
        std::string file_path = "/tmp/test." + std::to_string(i) + id + ".parquet";
        PARQUET_ASSIGN_OR_THROW(outfile, ::arrow::io::FileOutputStream::Open(file_path))
        parquet::schema::NodeVector columns {};
        columns.push_back(parquet::schema::PrimitiveNode::Make("integer", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64, parquet::ConvertedType::UINT_64));
        columns.push_back(
            parquet::schema::PrimitiveNode::Make("float", parquet::Repetition::REQUIRED, parquet::Type::FLOAT));
        columns.push_back(parquet::schema::PrimitiveNode::Make("timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));
        columns.push_back(parquet::schema::PrimitiveNode::Make("date_as_timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));

        auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
            parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columns));
        parquet::WriterProperties::Builder builder;
        builder.compression(parquet::Compression::GZIP);
        builder.encoding(parquet::Encoding::DELTA_BINARY_PACKED);
        parquet::StreamWriter parquet_writer {parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

        std::ofstream csv_file("/tmp/test." + std::to_string(i) + ".csv");
        for (uint64_t number = 1; number <= 1000000; ++number) {
            parquet_writer << number;
            float float_number(number * 0.5);
            parquet_writer << float_number;
            std::chrono::milliseconds timestamp{1000ull + number};
            parquet_writer << timestamp;
            std::chrono::milliseconds date_as_timestamp{1000ull + number};
            parquet_writer << date_as_timestamp << parquet::EndRow;
            csv_file << number << std::endl;
        }
        csv_file.close();
        parquet_writer.EndRowGroup();
    }
    std::cout << "Ended ok" << std::endl;
    return 0;
}

我得到的尺寸是:

    6888896 may 22 12:35 test.1.csv
    6888896 may 22 12:35 test.2.csv
    6859569 may 22 13:06 test.1_gzip_compressor_delta_binary_packet_encoding.parquet
    6859569 may 22 13:06 test.2_gzip_compressor_delta_binary_packet_encoding.parquet

使用gzip -k test.2.csv

    2129154 may 22 13:06 test.2.csv.gz

用R(saverds)打开和保存:

    8925791 may 22 13:11 test.1.rds

我尝试过几个压缩和编码设置几乎没有成功...我很确定我会弄乱日期和时间戳,因为如果我只使用整数列,我会获得非常合理的输出。

输出的示例:

df&lt; - read_parquet(“/tmp/test.1_gzip_compressor_delta_binary_binary_packet_encoding.parquet”) 头(DF)

  integer float           timestamp   date_as_timestamp
1       1   0.5 1970-01-01 00:00:01 1970-01-01 00:00:01
2       2   1.0 1970-01-01 00:00:01 1970-01-01 00:00:01
3       3   1.5 1970-01-01 00:00:01 1970-01-01 00:00:01
4       4   2.0 1970-01-01 00:00:01 1970-01-01 00:00:01
5       5   2.5 1970-01-01 00:00:01 1970-01-01 00:00:01
6       6   3.0 1970-01-01 00:00:01 1970-01-01 00:00:01

I am trying to create, from a c++ program, a .parquet file as small as possible.

I would like to use the parquet::StreamWriter.

TL;DR: What are the best compression / encoder settings and how should the columns (parquet::schema::PrimitiveNode) be declared to get the smallest file?

It should be possible to get a file roughly the same size of a csv with the same data compressed with gzip.

My file contains a date, a datetime (or timestamp I guess), an integer and a float per row. It will have to be opened from r and the date and timestamp have to be datetimes without transforming the column afterwards.

For the time being I'm setting the date as a timestamp. I think I am being a bit thick, but I don't get the docs: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, and in the tests and examples: https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/arrow-reader-writer-test.cc there seems to be nothing like this, oddly enough.

A toy example:

#include <iostream>
#include <fstream>
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

int main() {
    for (auto i : std::vector<int>({1, 2})) {
        std::cout << i << std::endl;
        std::string id("_gzip_compressor_delta_binary_packet_encoding");
        std::shared_ptr<::arrow::io::FileOutputStream> outfile;
        std::string file_path = "/tmp/test." + std::to_string(i) + id + ".parquet";
        PARQUET_ASSIGN_OR_THROW(outfile, ::arrow::io::FileOutputStream::Open(file_path))
        parquet::schema::NodeVector columns {};
        columns.push_back(parquet::schema::PrimitiveNode::Make("integer", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64, parquet::ConvertedType::UINT_64));
        columns.push_back(
            parquet::schema::PrimitiveNode::Make("float", parquet::Repetition::REQUIRED, parquet::Type::FLOAT));
        columns.push_back(parquet::schema::PrimitiveNode::Make("timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));
        columns.push_back(parquet::schema::PrimitiveNode::Make("date_as_timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));

        auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
            parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columns));
        parquet::WriterProperties::Builder builder;
        builder.compression(parquet::Compression::GZIP);
        builder.encoding(parquet::Encoding::DELTA_BINARY_PACKED);
        parquet::StreamWriter parquet_writer {parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

        std::ofstream csv_file("/tmp/test." + std::to_string(i) + ".csv");
        for (uint64_t number = 1; number <= 1000000; ++number) {
            parquet_writer << number;
            float float_number(number * 0.5);
            parquet_writer << float_number;
            std::chrono::milliseconds timestamp{1000ull + number};
            parquet_writer << timestamp;
            std::chrono::milliseconds date_as_timestamp{1000ull + number};
            parquet_writer << date_as_timestamp << parquet::EndRow;
            csv_file << number << std::endl;
        }
        csv_file.close();
        parquet_writer.EndRowGroup();
    }
    std::cout << "Ended ok" << std::endl;
    return 0;
}

The sizes I'm getting are:

    6888896 may 22 12:35 test.1.csv
    6888896 may 22 12:35 test.2.csv
    6859569 may 22 13:06 test.1_gzip_compressor_delta_binary_packet_encoding.parquet
    6859569 may 22 13:06 test.2_gzip_compressor_delta_binary_packet_encoding.parquet

Compressing with gzip -k test.2.csv:

    2129154 may 22 13:06 test.2.csv.gz

Opening and saving with r (saveRDS):

    8925791 may 22 13:11 test.1.rds

I have tried several compression and encoding settings with little success... I am pretty sure I am messing up with the date and timestamp, because if I only use an integer column I get a very reasonable output.

Example of output:

df <- read_parquet("/tmp/test.1_gzip_compressor_delta_binary_packet_encoding.parquet")
head(df)

  integer float           timestamp   date_as_timestamp
1       1   0.5 1970-01-01 00:00:01 1970-01-01 00:00:01
2       2   1.0 1970-01-01 00:00:01 1970-01-01 00:00:01
3       3   1.5 1970-01-01 00:00:01 1970-01-01 00:00:01
4       4   2.0 1970-01-01 00:00:01 1970-01-01 00:00:01
5       5   2.5 1970-01-01 00:00:01 1970-01-01 00:00:01
6       6   3.0 1970-01-01 00:00:01 1970-01-01 00:00:01

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文