将 csv(行数可变)读入数据结构的最佳实践
我正在编写一个小程序来读取行数可变的 csv,并有一个关于最佳实践的问题:
Is the best way to create storage for the data on every line to make an array that contains the data Structures csv(csv 的每一行一个)?
分配给数组的大小可以设置为一个很大的数字(例如,比 csv 中合理的行数更多)? 我在网络上的许多示例中都看到了这一点。
或者...是否有一种聪明的方法来告诉需要多少空间,例如预先计算行数或通过使用链表动态添加空间,而不是使用静态存储分配的数组。有什么最佳实践吗?我不认为选择随机数看起来很巧妙......
任何想法将不胜感激。
I'm writing a small program to read in a csv with a variable number of lines and have a question about best practices:
Is the best way to create storage for the data on each line to make an array that holds the data structures of the csv (one per each line of csv)?
Size allocated to the array could be set to a large number (for example, more lines than there would ever reasonably be in the csv)? I have seen this in many examples on the web.
Or... is there was a smart way to tell how much space would be needed such as counting the lines before hand or dynamically adding space by using a linked list as opposed to an array with static storage allocation. Any best practices? I don't think choosing a random number seems very slick...
Any thoughts would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
两个最佳实践:
Two best practices:
如果您可以在读取数据时对其进行处理,而不是全部保存并在之后进行处理,那么问题就可以解决。
我避免先计算行数,因为这需要读取整个文件两次。我想如果文件很小,效率影响并不是什么大问题,但如果你知道文件很小,那么你可以分配足够大的空间。
因此,总的来说,如果我无法一次一行处理文件,我的方法是使用可以增长的数据结构,例如链接列表。然后对于每一行我只分配一个新块。根据您的目的,您可以使用动态数组:分配足以满足正常情况的空间量。如果填满,请分配更大的空间,将第一个复制到第二个,删除第一个,然后继续处理第二个。如果您填写了该内容,请重复该过程。这可能是大量的数据移动,但最终使用的空间量将小于链表,因为您没有指针,并且遍历速度会更快,因为您不追逐指针并且可能运行全部在虚拟内存上。
If you can process the data as you read it rather than saving it all and processing after, this would eliminate the problem.
I avoid counting the lines first, as this requires reading the entire file twice. I suppose if the file is small the efficiency hit is not a big deal, but if you know that the file is small, then you could just allocate a big enough space.
So in general, my approach -- if I can't process the file a line at a time -- is to use a data structure that can grow, like a linked list. Then for each line I just allocate a new block. Depending on what you're up to, you might use a dynamic array: allocate an amount of space that ought to be enough for the normal case. If you fill it, allocate a bigger space, copy the first to the second, delete the first, and then continue working with the second. If you fill that, repeat the process. This can be a lot of data movement but the amount of space used in the end will be less than a linked list because you don't have the pointers, and it will be faster to traverse because you're not chasing pointers and possibly running all over virtual memory.
确实没有“最佳实践”。请记住数据的特定结构,您希望以多快的速度读取、存储、查询、排序、查找/消除/忽略重复项等。树、链表、散列、有序数据等. 是不错的选择,具体取决于我已经提到的因素。
我同意其他朋友的观点。无需重新发明轮子。肯定有无数关于如何解析 CSV 的示例。
但是,在选择您最喜欢的库时,请注意以下几点:
3.1。字符串内的逗号。例如,“Smith, John”与“Smith, John”不同。
3.2.字符串中包含特殊字符,例如撇号、制表符或引号。它们是如何处理的?例如,Microsoft 通常使用双双引号来表示字符串内的引号。
3.3.当然,要小心行尾格式(Unix 或 Windows 风格)。
请务必查看大量实际数据。永远不要相信用户(也不要相信程序员:-)。
祝你好运。路易斯.
Excel 和 Visual Basic 用于生成
There's really no "best practice". Keep in mind the particular structure of your data, how quickly you want to read it, store it, query it, sort it, find/eliminate/ignore duplicates, etc. A tree, a linked-list, hashing, ordered data, etc. are good options depending on the factors that I already mentioned.
I agree with the other fellows. No need to reinvent the wheel. There must be gazillions of samples about how to parse CSV.
However, when choosing your favorite library, a few words of caution:
3.1. Commas within strings. For example, it's not the same "Smith, John" than Smith, John.
3.2. Special characters withing the strings, such as apostrophes, tabs, or quotes. How are they handled? For example, Microsoft typically uses double-double quotes to represent quotes inside a string.
3.3. And, of course, be careful with the end of line format (Unix or Windows-style).
Be sure to take a look to very good bunch of actual data. Never believe the users (nor the programmers :-).
Good luck. Luis.
Excel and Visual Basic used to generate
使用库或预先计算行数。您还可以使用某种列表数据结构来避免担心行数。
在我看来,Nissan Fan 推荐了一个库,+1,除非您想了解很多有关 CSV 解析和 CSV 解析边缘情况的知识,否则这始终是最佳选择。
Use a library or count the lines beforehand. You could also use some kind of list data structure to avoid worrying about the line count.
+1 to Nissan Fan for recommending a library, in my opinion, unless you're trying to learn a lot about CSV parsing and CSV parsing edge cases, this is always the way to go.