从C＆＃x2B;＆＃x2B;创建带有日期和时间戳的小文件。 API

发布于 2025-01-31 04:03:56 字数 4801 浏览 3 评论 0原文

我正在尝试从C ++程序创建一个.parquet文件。

我想使用parquet :: StreamWriter。

TL; DR：最佳的压缩 /编码器设置是什么？宣布列（parquet :: schema :: primitivenode < / code>）如何获取最小文件？

应该可以将文件与GZIP压缩相同数据的CSV大致相同。

我的文件包含一个日期，一个日期（我猜是时间戳），整数和每行浮动。它将必须从R开放，并且日期和时间戳必须为数据，而无需转换列。

目前，我将日期设置为时间戳。我想我有点厚，但是我没有得到文档： https://github.com/apache/parquet-format/blob/master/logicaltypes.md ，在测试和示例中： https://github.com/apache/parquet/parquet-cpp/blob/blob/master/master/src /parquet/arrow/arrow-reader-writer-test.cc 似乎没有什么像这样了。

一个玩具示例：

#include <iostream>
#include <fstream>
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

int main() {
    for (auto i : std::vector<int>({1, 2})) {
        std::cout << i << std::endl;
        std::string id("_gzip_compressor_delta_binary_packet_encoding");
        std::shared_ptr<::arrow::io::FileOutputStream> outfile;
        std::string file_path = "/tmp/test." + std::to_string(i) + id + ".parquet";
        PARQUET_ASSIGN_OR_THROW(outfile, ::arrow::io::FileOutputStream::Open(file_path))
        parquet::schema::NodeVector columns {};
        columns.push_back(parquet::schema::PrimitiveNode::Make("integer", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64, parquet::ConvertedType::UINT_64));
        columns.push_back(
            parquet::schema::PrimitiveNode::Make("float", parquet::Repetition::REQUIRED, parquet::Type::FLOAT));
        columns.push_back(parquet::schema::PrimitiveNode::Make("timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));
        columns.push_back(parquet::schema::PrimitiveNode::Make("date_as_timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));

        auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
            parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columns));
        parquet::WriterProperties::Builder builder;
        builder.compression(parquet::Compression::GZIP);
        builder.encoding(parquet::Encoding::DELTA_BINARY_PACKED);
        parquet::StreamWriter parquet_writer {parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

        std::ofstream csv_file("/tmp/test." + std::to_string(i) + ".csv");
        for (uint64_t number = 1; number <= 1000000; ++number) {
            parquet_writer << number;
            float float_number(number * 0.5);
            parquet_writer << float_number;
            std::chrono::milliseconds timestamp{1000ull + number};
            parquet_writer << timestamp;
            std::chrono::milliseconds date_as_timestamp{1000ull + number};
            parquet_writer << date_as_timestamp << parquet::EndRow;
            csv_file << number << std::endl;
        }
        csv_file.close();
        parquet_writer.EndRowGroup();
    }
    std::cout << "Ended ok" << std::endl;
    return 0;
}

我得到的尺寸是：

    6888896 may 22 12:35 test.1.csv
    6888896 may 22 12:35 test.2.csv
    6859569 may 22 13:06 test.1_gzip_compressor_delta_binary_packet_encoding.parquet
    6859569 may 22 13:06 test.2_gzip_compressor_delta_binary_packet_encoding.parquet

使用gzip -k test.2.csv：

    2129154 may 22 13:06 test.2.csv.gz

用R（saverds）打开和保存：

    8925791 may 22 13:11 test.1.rds

我尝试过几个压缩和编码设置几乎没有成功...我很确定我会弄乱日期和时间戳，因为如果我只使用整数列，我会获得非常合理的输出。

输出的示例：

df＆lt; - read_parquet（“/tmp/test.1_gzip_compressor_delta_binary_binary_packet_encoding.parquet”）头（DF）

  integer float           timestamp   date_as_timestamp
1       1   0.5 1970-01-01 00:00:01 1970-01-01 00:00:01
2       2   1.0 1970-01-01 00:00:01 1970-01-01 00:00:01
3       3   1.5 1970-01-01 00:00:01 1970-01-01 00:00:01
4       4   2.0 1970-01-01 00:00:01 1970-01-01 00:00:01
5       5   2.5 1970-01-01 00:00:01 1970-01-01 00:00:01
6       6   3.0 1970-01-01 00:00:01 1970-01-01 00:00:01

原文

I am trying to create, from a c++ program, a .parquet file as small as possible.

I would like to use the parquet::StreamWriter.

TL;DR: What are the best compression / encoder settings and how should the columns (parquet::schema::PrimitiveNode) be declared to get the smallest file?

It should be possible to get a file roughly the same size of a csv with the same data compressed with gzip.

My file contains a date, a datetime (or timestamp I guess), an integer and a float per row. It will have to be opened from r and the date and timestamp have to be datetimes without transforming the column afterwards.

For the time being I'm setting the date as a timestamp. I think I am being a bit thick, but I don't get the docs: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, and in the tests and examples: https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/arrow-reader-writer-test.cc there seems to be nothing like this, oddly enough.

A toy example:

#include <iostream>
#include <fstream>
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"

int main() {
    for (auto i : std::vector<int>({1, 2})) {
        std::cout << i << std::endl;
        std::string id("_gzip_compressor_delta_binary_packet_encoding");
        std::shared_ptr<::arrow::io::FileOutputStream> outfile;
        std::string file_path = "/tmp/test." + std::to_string(i) + id + ".parquet";
        PARQUET_ASSIGN_OR_THROW(outfile, ::arrow::io::FileOutputStream::Open(file_path))
        parquet::schema::NodeVector columns {};
        columns.push_back(parquet::schema::PrimitiveNode::Make("integer", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64, parquet::ConvertedType::UINT_64));
        columns.push_back(
            parquet::schema::PrimitiveNode::Make("float", parquet::Repetition::REQUIRED, parquet::Type::FLOAT));
        columns.push_back(parquet::schema::PrimitiveNode::Make("timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));
        columns.push_back(parquet::schema::PrimitiveNode::Make("date_as_timestamp", parquet::Repetition::REQUIRED,
                                                               parquet::Type::INT64,
                                                               parquet::ConvertedType::TIMESTAMP_MILLIS));

        auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
            parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columns));
        parquet::WriterProperties::Builder builder;
        builder.compression(parquet::Compression::GZIP);
        builder.encoding(parquet::Encoding::DELTA_BINARY_PACKED);
        parquet::StreamWriter parquet_writer {parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};

        std::ofstream csv_file("/tmp/test." + std::to_string(i) + ".csv");
        for (uint64_t number = 1; number <= 1000000; ++number) {
            parquet_writer << number;
            float float_number(number * 0.5);
            parquet_writer << float_number;
            std::chrono::milliseconds timestamp{1000ull + number};
            parquet_writer << timestamp;
            std::chrono::milliseconds date_as_timestamp{1000ull + number};
            parquet_writer << date_as_timestamp << parquet::EndRow;
            csv_file << number << std::endl;
        }
        csv_file.close();
        parquet_writer.EndRowGroup();
    }
    std::cout << "Ended ok" << std::endl;
    return 0;
}

The sizes I'm getting are:

    6888896 may 22 12:35 test.1.csv
    6888896 may 22 12:35 test.2.csv
    6859569 may 22 13:06 test.1_gzip_compressor_delta_binary_packet_encoding.parquet
    6859569 may 22 13:06 test.2_gzip_compressor_delta_binary_packet_encoding.parquet

Compressing with gzip -k test.2.csv:

    2129154 may 22 13:06 test.2.csv.gz

Opening and saving with r (saveRDS):

    8925791 may 22 13:11 test.1.rds

I have tried several compression and encoding settings with little success... I am pretty sure I am messing up with the date and timestamp, because if I only use an integer column I get a very reasonable output.

Example of output:

df <- read_parquet("/tmp/test.1_gzip_compressor_delta_binary_packet_encoding.parquet")
head(df)

  integer float           timestamp   date_as_timestamp
1       1   0.5 1970-01-01 00:00:01 1970-01-01 00:00:01
2       2   1.0 1970-01-01 00:00:01 1970-01-01 00:00:01
3       3   1.5 1970-01-01 00:00:01 1970-01-01 00:00:01
4       4   2.0 1970-01-01 00:00:01 1970-01-01 00:00:01
5       5   2.5 1970-01-01 00:00:01 1970-01-01 00:00:01
6       6   3.0 1970-01-01 00:00:01 1970-01-01 00:00:01

分享到QQ

分享到微博