从C++创建带有日期和时间戳的小文件。 API
我正在尝试从C ++程序创建一个.parquet文件。
我想使用parquet :: StreamWriter
。
TL; DR:最佳的压缩 /编码器设置是什么?宣布列(parquet :: schema :: primitivenode < / code>)如何获取最小文件?
应该可以将文件与GZIP压缩相同数据的CSV大致相同。
我的文件包含一个日期,一个日期(我猜是时间戳),整数和每行浮动。它将必须从R开放,并且日期和时间戳必须为数据,而无需转换列。
目前,我将日期设置为时间戳。我想我有点厚,但是我没有得到文档: https://github.com/apache/parquet-format/blob/master/logicaltypes.md ,在测试和示例中: https://github.com/apache/parquet/parquet-cpp/blob/blob/master/master/src /parquet/arrow/arrow-reader-writer-test.cc 似乎没有什么像这样了。
一个玩具示例:
#include <iostream>
#include <fstream>
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"
int main() {
for (auto i : std::vector<int>({1, 2})) {
std::cout << i << std::endl;
std::string id("_gzip_compressor_delta_binary_packet_encoding");
std::shared_ptr<::arrow::io::FileOutputStream> outfile;
std::string file_path = "/tmp/test." + std::to_string(i) + id + ".parquet";
PARQUET_ASSIGN_OR_THROW(outfile, ::arrow::io::FileOutputStream::Open(file_path))
parquet::schema::NodeVector columns {};
columns.push_back(parquet::schema::PrimitiveNode::Make("integer", parquet::Repetition::REQUIRED,
parquet::Type::INT64, parquet::ConvertedType::UINT_64));
columns.push_back(
parquet::schema::PrimitiveNode::Make("float", parquet::Repetition::REQUIRED, parquet::Type::FLOAT));
columns.push_back(parquet::schema::PrimitiveNode::Make("timestamp", parquet::Repetition::REQUIRED,
parquet::Type::INT64,
parquet::ConvertedType::TIMESTAMP_MILLIS));
columns.push_back(parquet::schema::PrimitiveNode::Make("date_as_timestamp", parquet::Repetition::REQUIRED,
parquet::Type::INT64,
parquet::ConvertedType::TIMESTAMP_MILLIS));
auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columns));
parquet::WriterProperties::Builder builder;
builder.compression(parquet::Compression::GZIP);
builder.encoding(parquet::Encoding::DELTA_BINARY_PACKED);
parquet::StreamWriter parquet_writer {parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};
std::ofstream csv_file("/tmp/test." + std::to_string(i) + ".csv");
for (uint64_t number = 1; number <= 1000000; ++number) {
parquet_writer << number;
float float_number(number * 0.5);
parquet_writer << float_number;
std::chrono::milliseconds timestamp{1000ull + number};
parquet_writer << timestamp;
std::chrono::milliseconds date_as_timestamp{1000ull + number};
parquet_writer << date_as_timestamp << parquet::EndRow;
csv_file << number << std::endl;
}
csv_file.close();
parquet_writer.EndRowGroup();
}
std::cout << "Ended ok" << std::endl;
return 0;
}
我得到的尺寸是:
6888896 may 22 12:35 test.1.csv
6888896 may 22 12:35 test.2.csv
6859569 may 22 13:06 test.1_gzip_compressor_delta_binary_packet_encoding.parquet
6859569 may 22 13:06 test.2_gzip_compressor_delta_binary_packet_encoding.parquet
使用gzip -k test.2.csv
:
2129154 may 22 13:06 test.2.csv.gz
用R(saverds
)打开和保存:
8925791 may 22 13:11 test.1.rds
我尝试过几个压缩和编码设置几乎没有成功...我很确定我会弄乱日期和时间戳,因为如果我只使用整数列,我会获得非常合理的输出。
输出的示例:
df&lt; - read_parquet(“/tmp/test.1_gzip_compressor_delta_binary_binary_packet_encoding.parquet”) 头(DF)
integer float timestamp date_as_timestamp
1 1 0.5 1970-01-01 00:00:01 1970-01-01 00:00:01
2 2 1.0 1970-01-01 00:00:01 1970-01-01 00:00:01
3 3 1.5 1970-01-01 00:00:01 1970-01-01 00:00:01
4 4 2.0 1970-01-01 00:00:01 1970-01-01 00:00:01
5 5 2.5 1970-01-01 00:00:01 1970-01-01 00:00:01
6 6 3.0 1970-01-01 00:00:01 1970-01-01 00:00:01
I am trying to create, from a c++ program, a .parquet file as small as possible.
I would like to use the parquet::StreamWriter
.
TL;DR: What are the best compression / encoder settings and how should the columns (parquet::schema::PrimitiveNode
) be declared to get the smallest file?
It should be possible to get a file roughly the same size of a csv with the same data compressed with gzip.
My file contains a date, a datetime (or timestamp I guess), an integer and a float per row. It will have to be opened from r and the date and timestamp have to be datetimes without transforming the column afterwards.
For the time being I'm setting the date as a timestamp. I think I am being a bit thick, but I don't get the docs: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, and in the tests and examples: https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/arrow-reader-writer-test.cc there seems to be nothing like this, oddly enough.
A toy example:
#include <iostream>
#include <fstream>
#include "arrow/io/file.h"
#include "parquet/stream_writer.h"
int main() {
for (auto i : std::vector<int>({1, 2})) {
std::cout << i << std::endl;
std::string id("_gzip_compressor_delta_binary_packet_encoding");
std::shared_ptr<::arrow::io::FileOutputStream> outfile;
std::string file_path = "/tmp/test." + std::to_string(i) + id + ".parquet";
PARQUET_ASSIGN_OR_THROW(outfile, ::arrow::io::FileOutputStream::Open(file_path))
parquet::schema::NodeVector columns {};
columns.push_back(parquet::schema::PrimitiveNode::Make("integer", parquet::Repetition::REQUIRED,
parquet::Type::INT64, parquet::ConvertedType::UINT_64));
columns.push_back(
parquet::schema::PrimitiveNode::Make("float", parquet::Repetition::REQUIRED, parquet::Type::FLOAT));
columns.push_back(parquet::schema::PrimitiveNode::Make("timestamp", parquet::Repetition::REQUIRED,
parquet::Type::INT64,
parquet::ConvertedType::TIMESTAMP_MILLIS));
columns.push_back(parquet::schema::PrimitiveNode::Make("date_as_timestamp", parquet::Repetition::REQUIRED,
parquet::Type::INT64,
parquet::ConvertedType::TIMESTAMP_MILLIS));
auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columns));
parquet::WriterProperties::Builder builder;
builder.compression(parquet::Compression::GZIP);
builder.encoding(parquet::Encoding::DELTA_BINARY_PACKED);
parquet::StreamWriter parquet_writer {parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};
std::ofstream csv_file("/tmp/test." + std::to_string(i) + ".csv");
for (uint64_t number = 1; number <= 1000000; ++number) {
parquet_writer << number;
float float_number(number * 0.5);
parquet_writer << float_number;
std::chrono::milliseconds timestamp{1000ull + number};
parquet_writer << timestamp;
std::chrono::milliseconds date_as_timestamp{1000ull + number};
parquet_writer << date_as_timestamp << parquet::EndRow;
csv_file << number << std::endl;
}
csv_file.close();
parquet_writer.EndRowGroup();
}
std::cout << "Ended ok" << std::endl;
return 0;
}
The sizes I'm getting are:
6888896 may 22 12:35 test.1.csv
6888896 may 22 12:35 test.2.csv
6859569 may 22 13:06 test.1_gzip_compressor_delta_binary_packet_encoding.parquet
6859569 may 22 13:06 test.2_gzip_compressor_delta_binary_packet_encoding.parquet
Compressing with gzip -k test.2.csv
:
2129154 may 22 13:06 test.2.csv.gz
Opening and saving with r (saveRDS
):
8925791 may 22 13:11 test.1.rds
I have tried several compression and encoding settings with little success... I am pretty sure I am messing up with the date and timestamp, because if I only use an integer column I get a very reasonable output.
Example of output:
df <- read_parquet("/tmp/test.1_gzip_compressor_delta_binary_packet_encoding.parquet")
head(df)
integer float timestamp date_as_timestamp
1 1 0.5 1970-01-01 00:00:01 1970-01-01 00:00:01
2 2 1.0 1970-01-01 00:00:01 1970-01-01 00:00:01
3 3 1.5 1970-01-01 00:00:01 1970-01-01 00:00:01
4 4 2.0 1970-01-01 00:00:01 1970-01-01 00:00:01
5 5 2.5 1970-01-01 00:00:01 1970-01-01 00:00:01
6 6 3.0 1970-01-01 00:00:01 1970-01-01 00:00:01
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论