ASCII 文件解析速度

发布于 2024-12-16 17:09:03 字数 421 浏览 1 评论 0原文

我有两种类型的文件。其中之一是 ASCII 文件,数据存储方式如下:

X Y Value 
0 0 5154,4
1 0 5545455;
. . ...
. . ...

另一种是二进制文件。

我使用 StreamReaderReadLine() 方法解析第一个,然后通过 Split(' 将值设置为 double[,] 数组')

我用 BinaryReader 解析第二个。

解析二进制文件比 ASCII 文件快 3-4 倍。

问题1:读取ASCII文件比读取二进制文件慢。正常吗?

问题 2:您是否建议另一种解析 ASCII 文件的方法?

I have two type of file. One of them is ASCII file and data is stored like;

X Y Value 
0 0 5154,4
1 0 5545455;
. . ...
. . ...

other one is a binary file.

I parse first one with StreamReader and ReadLine() method and then setting values to an double[,] array by Split(' ').

I parse second one with BinaryReader.

Parsing of binary file is 3-4 times faster than ASCII one.

Question 1: Reading ASCII file is slower than binary one. Is it normal?

Question 2: Do you suggest another way for parsing ASCII file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

醉南桥 2024-12-23 17:09:03

与其说读取ascii较慢,不如说你如何读取它。

它进行解析,寻找新行、分隔符,然后将文本位转换为其他格式。 BinaryReader 基本上是直接内存复制。

这就像固定长度和csv,或者csv和xml之间的区别一样,您添加的元数据越多,您可以获取的数据就越多,但成本也越高。

逐个字符读取 ascii 文件可能比 readline 和 split 更快,因为您可以针对特定文件结构对其进行优化。但工作量很大,而且非常脆弱,这使得它的前景令人怀疑。将加载加载到单独的线程,甚至并行处理线路,可能会更有价值,肯定更令人满意和可重用。

It's not so much reading ascii is slower, but how you do it.

It's parsing, looking for new lines, seperators, then converting bits of text to other formats. BinaryReader is basically a straight memory copy.

It's like the difference between fixed length and csv, or csv and xml The more meta data you add, the more you can get out it but the more it costs.

Reading an ascii file character by character might work out faster than readline and split, in that you could optimise it for your specific file structure. Lot of work though and very fragile making it a dubious prospect. Chucking loading to a seperate thread perhaps even parallel processing the lines, might be more rewarding, definitely be more satisfying and reusable.

吃不饱 2024-12-23 17:09:03

从 ASCII 文件和二进制文件中读取没有不同,不同的是它们的解析,读取 ASCII 文件后,您将字符串解析为双精度数,这是需要处理时间的。但是在二进制文件中,您读取的数据流完全等于等效的二进制双精度数,不需要来解析。

Reading from ASCII file and binary not different, different is parsing of them,after reading ASCII file you parse string to double, this is got process time.But in binary file your read data stream is completely equals to equivalent binary double number and not need to parsing.

药祭#氼 2024-12-23 17:09:03

我们每个月都会收到一个 350 MB 的 csv 文件,其中包含 350 万行,然后我们通常一次读取一行并创建一些索引,大约花费了大约 1 个小时。每次重新启动服务需要 60 秒。
我编写了一个程序,将其归结为 170 万行,并将其转换为大约 24 MB 的二进制格式。
这些数据在 7 毫秒内直接读入内存,并且索引在需要时生成,数据在使用时转换。
内存消耗从 400 MB 下降到 90 MB。
重点是,如果性能要求较高,您应该为数据选择合适的格式。问题,另请注意,此解决方案之所以可行,是因为数据相当静态,并且 24 小时内检索数据的次数不会超过几百万次。
我相信新服务实际上现在的响应速度比以前快一点习惯了。

Once a month we receive a 350 MB csv file with 3.5 million rows, then we used to read it one line at a time and make some indexes, it took aprox. 60 seconds every time the service was restarted.
I made a program that boiled it down to 1.7 million rows and converted it to a binary format to aprox 24 MB.
These data was read directly into memory in 7 ms and the indexes was generated when needed and the data was converted when used.
The memory consumption declined from 400 MB til 90 MB.
The point is that you should choose an appropriate format for your data if performance is an issue, also note that this solution is only possible because the data is fairly static and that the data is not retrieved more than a few million times in 24 hours.
I believe that the new service actually answers a little faster now than it used to.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文