为什么我的代码中的 protobuf-net 反序列化器比流读取 csv 慢得多
我以以下格式存储简单的时间序列,并寻找最快的方法来读取和解析它们以“引用”对象:
DateTime、price1、price2 。 。 。 日期时间采用以下字符串格式:YYYYmmdd HH:mm:ss:fff Price1 和 Price 2 是具有 5 位小数的数字字符串(1.40505,即),
我使用了不同的方式来存储和读取数据,并且还使用了 protobuf-net 库。已序列化并包含大约 600 万行的文件(按以下方式序列化原始 csv:
TimeSeries 对象,包含 List
, Blob 对象包含 Header 对象和 List
(一个 Blob 包含一天的报价) 包含 DateTime、double px1 和 double px2 的引用对象
(从磁盘)读取序列化二进制文件并反序列化它大约需要 47 秒,这看起来非常长。相比之下,我将时间序列保留为 csv 字符串格式,将每一行读入列表,然后将每一行解析为 DateTime dt、双 px1、双 px1,将其插入到新创建的 Quote 对象中并将它们添加到列表中。 读取时间大约为 10 秒(GZip 压缩为 12 秒 -> 使文件大小约为 1/9。)
乍一看,我要么错误地处理了 protobuf-net 功能,要么在这种特殊的时间系列不太适合序列化/反序列化。
任何评论或帮助,尤其是马克,如果你读到这篇文章,你能插话并添加一些你的想法吗?我很难想象我最终会得到如此不同的性能数据。
一些信息:我不需要随机访问数据。我认为,我只需要阅读一整天的内容,因此将一天的数据存储在单独的 csv 文件中对我的目的来说是有意义的。
您知道读取此类数据的最快方法是什么吗?我对过于简单的语言表示歉意,我不是一个真正的程序员。
这是我用于 protobuf-net 的示例对象:
[ProtoContract]
class TimeSeries
{
[ProtoMember(1)]
public Header Header { get; set; }
[ProtoMember(2)]
public List<DataBlob> DataBlobs { get; set; }
}
[ProtoContract]
class DataBlob
{
[ProtoMember(1)]
public Header Header { get; set; }
[ProtoMember(2)]
public List<Quote> Quotes { get; set; }
}
[ProtoContract]
class Header
{
[ProtoMember(1)]
public string SymbolID { get; set; }
[ProtoMember(2)]
public DateTime StartDateTime { get; set; }
[ProtoMember(3)]
public DateTime EndDateTime { get; set; }
}
[ProtoContract]
class Quote
{
[ProtoMember(1)]
public DateTime DateTime { get; set; }
[ProtoMember(2)]
public double BidPrice { get; set; }
[ProtoMember(3)]
public long AskPrice { get; set; } //Expressed as Spread to BidPrice
}
这是用于序列化/反序列化的代码:
public static void SerializeAll(string fileNameWrite, List<Quote> QuoteList)
{
//Header
Header Header = new Header();
Header.SymbolID = SymbolID;
Header.StartDateTime = StartDateTime;
Header.EndDateTime = EndDateTime;
//Blob
List<DataBlob> DataBlobs = new List<DataBlob>();
DataBlob DataBlob = new DataBlob();
DataBlob.Header = Header;
DataBlob.Quotes = QuoteList;
DataBlobs.Add(DataBlob);
//Create TimeSeries
TimeSeries TimeSeries = new TimeSeries();
TimeSeries.Header = Header;
TimeSeries.DataBlobs = DataBlobs;
using (var file = File.Create(fileNameWrite))
{
Serializer.Serialize(file, TimeSeries);
}
}
public static TimeSeries DeserializeAll(string fileNameBinRead)
{
TimeSeries TimeSeries;
using (var file = File.OpenRead(fileNameBinRead))
{
TimeSeries = Serializer.Deserialize<TimeSeries>(file);
}
return TimeSeries;
}
I store simple time series in the following format and look for the fastest way to read and parse them to "quote" objects:
DateTime, price1, price2
.
.
.
DateTime is in the following string format: YYYYmmdd HH:mm:ss:fff
price1 and price 2 are strings of numbers with 5 decimal places (1.40505, i.e.)
I played with different ways to store and read the data and also toyed around with the protobuf-net library. A file that was serialized and contained roughly 6 million rows (raw csv serialized in the following way:
TimeSeries object, holding a List<Blobs>
,
Blob object holding a Header object and List<Quotes>
(one blob contains quotes for one single day)
Quote object holding DateTime, double px1, and double px2
It took about 47 seconds to read (from disk) the serialized binary and deserialize it which seemed awefully long. In contrast I kept the time series in csv string format, read each row into a List and then parsed each row to DateTime dt, double px1, double px1 which I stuck into a newly created Quote object and added those to a List.
This took about 10 seconds to read (12 seconds with GZip compression -> making the file about 1/9th of the size.)
At first sight it looks like I either handle protobuf-net functionality incorrectly or else that this particular kind of time series does not lend itself well to serialization/deserialization.
Any comments or help, especially Marc, if you read this, could you possibly chime in and add some of your thoughts? I find it hard to imagine that I end up with such different performance numbers.
Some information: I do not need to random access the data. I only need to read full days, thus storing one day's worth of data in an individual csv file made sense for my purpose, I thought.
Any ideas what may be the fastest way to read such kind of data? I apologize for the simplistic language, I am not a programmer by heart.
Here is a sample object I use for protobuf-net:
[ProtoContract]
class TimeSeries
{
[ProtoMember(1)]
public Header Header { get; set; }
[ProtoMember(2)]
public List<DataBlob> DataBlobs { get; set; }
}
[ProtoContract]
class DataBlob
{
[ProtoMember(1)]
public Header Header { get; set; }
[ProtoMember(2)]
public List<Quote> Quotes { get; set; }
}
[ProtoContract]
class Header
{
[ProtoMember(1)]
public string SymbolID { get; set; }
[ProtoMember(2)]
public DateTime StartDateTime { get; set; }
[ProtoMember(3)]
public DateTime EndDateTime { get; set; }
}
[ProtoContract]
class Quote
{
[ProtoMember(1)]
public DateTime DateTime { get; set; }
[ProtoMember(2)]
public double BidPrice { get; set; }
[ProtoMember(3)]
public long AskPrice { get; set; } //Expressed as Spread to BidPrice
}
Here is the code used to serialize/deserialize:
public static void SerializeAll(string fileNameWrite, List<Quote> QuoteList)
{
//Header
Header Header = new Header();
Header.SymbolID = SymbolID;
Header.StartDateTime = StartDateTime;
Header.EndDateTime = EndDateTime;
//Blob
List<DataBlob> DataBlobs = new List<DataBlob>();
DataBlob DataBlob = new DataBlob();
DataBlob.Header = Header;
DataBlob.Quotes = QuoteList;
DataBlobs.Add(DataBlob);
//Create TimeSeries
TimeSeries TimeSeries = new TimeSeries();
TimeSeries.Header = Header;
TimeSeries.DataBlobs = DataBlobs;
using (var file = File.Create(fileNameWrite))
{
Serializer.Serialize(file, TimeSeries);
}
}
public static TimeSeries DeserializeAll(string fileNameBinRead)
{
TimeSeries TimeSeries;
using (var file = File.OpenRead(fileNameBinRead))
{
TimeSeries = Serializer.Deserialize<TimeSeries>(file);
}
return TimeSeries;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最快的方法是手动编码的二进制序列化器,特别是如果您转换图片刻度。这就是我所做的,尽管我的交易量略有不同(每天 6 亿个项目,大约 200,000 个符号,其中一些符号头重脚轻)。我没有以需要从文本解析的方式存储任何内容。解析器是手工制作的,我使用分析器来优化它 - aos 处理大小非常好(有时交易会减少到 1 字节)。
Fastest way is a handcoded binary serializer, especialyl if you transform pices ticks. That is what I do, although my volume is slightly differenet (600 million items per day, around about 200.000 symbols with some being top heavy). I store nothing in a way that needs parsing from text. Parser is handcrafte and i use profiler to ooptimize it - aos handles size very well (a trade is down to 1 byte sometiems).