将 csv(行数可变)读入数据结构的最佳实践

发布于 2024-09-14 22:45:19 字数 398 浏览 5 评论 0原文

我正在编写一个小程序来读取行数可变的 csv,并有一个关于最佳实践的问题:

Is the best way to create storage for the data on every line to make an array that contains the data Structures csv(csv 的每一行一个)?

分配给数组的大小可以设置为一个很大的数字(例如,比 csv 中合理的行数更多)? 我在网络上的许多示例中都看到了这一点。

或者...是否有一种聪明的方法来告诉需要多少空间,例如预先计算行数或通过使用链表动态添加空间,而不是使用静态存储分配的数组。有什么最佳实践吗?我不认为选择随机数看起来很巧妙......

任何想法将不胜感激。

I'm writing a small program to read in a csv with a variable number of lines and have a question about best practices:

Is the best way to create storage for the data on each line to make an array that holds the data structures of the csv (one per each line of csv)?

Size allocated to the array could be set to a large number (for example, more lines than there would ever reasonably be in the csv)? I have seen this in many examples on the web.

Or... is there was a smart way to tell how much space would be needed such as counting the lines before hand or dynamically adding space by using a linked list as opposed to an array with static storage allocation. Any best practices? I don't think choosing a random number seems very slick...

Any thoughts would be greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

长不大的小祸害 2024-09-21 22:45:19

两个最佳实践:

  1. 永远不要指望外部的输入会得到纠正。
  2. 使其具有事务性(导入全部或回滚)
  3. 如果可能,请利用第三方 API 或库,例如 http://www.codeproject.com/KB/database/CsvReader.aspx 或此 http://sourceforge.net/projects/javacsv/ 大大减少重新发明轮子。如果您坚持使用 C 并且可以使用 C++,请考虑这种方法: 如何在 C++ 中读取和操作 CSV 文件数据?

Two best practices:

  1. Never expect input from the outside to correct.
  2. Make it transactional (imports all or rolls back)
  3. If possible, leverage a third-party API or library like this http://www.codeproject.com/KB/database/CsvReader.aspx or this http://sourceforge.net/projects/javacsv/ to greatly reduce reinventing to wheel. If you're sticking to C and can do C++ consider this approach: How can I read and manipulate CSV file data in C++?
寄风 2024-09-21 22:45:19

如果您可以在读取数据时对其进行处理,而不是全部保存并在之后进行处理,那么问题就可以解决。

我避免先计算行数,因为这需要读取整个文件两次。我想如果文件很小,效率影响并不是什么大问题,但如果你知道文件很小,那么你可以分配足够大的空间。

因此,总的来说,如果我无法一次一行处理文件,我的方法是使用可以增长的数据结构,例如链接列表。然后对于每一行我只分配一个新块。根据您的目的,您可以使用动态数组:分配足以满足正常情况的空间量。如果填满,请分配更大的空间,将第一个复制到第二个,删除第一个,然后继续处理第二个。如果您填写了该内容,请重复该过程。这可能是大量的数据移动,但最终使用的空间量将小于链表,因为您没有指针,并且遍历速度会更快,因为您不追逐指针并且可能运行全部在虚拟内存上。

If you can process the data as you read it rather than saving it all and processing after, this would eliminate the problem.

I avoid counting the lines first, as this requires reading the entire file twice. I suppose if the file is small the efficiency hit is not a big deal, but if you know that the file is small, then you could just allocate a big enough space.

So in general, my approach -- if I can't process the file a line at a time -- is to use a data structure that can grow, like a linked list. Then for each line I just allocate a new block. Depending on what you're up to, you might use a dynamic array: allocate an amount of space that ought to be enough for the normal case. If you fill it, allocate a bigger space, copy the first to the second, delete the first, and then continue working with the second. If you fill that, repeat the process. This can be a lot of data movement but the amount of space used in the end will be less than a linked list because you don't have the pointers, and it will be faster to traverse because you're not chasing pointers and possibly running all over virtual memory.

鸵鸟症 2024-09-21 22:45:19

确实没有“最佳实践”。请记住数据的特定结构,您希望以多快的速度读取、存储、查询、排序、查找/消除/忽略重复项等。树、链表、散列、有序数据等. 是不错的选择,具体取决于我已经提到的因素。

我同意其他朋友的观点。无需重新发明轮子。肯定有无数关于如何解析 CSV 的示例。

但是,在选择您最喜欢的库时,请注意以下几点:

  1. 最佳实践:永远不要假设数据具有特定(小或非常大)的数据量。推论:不要将所有数据存储在内存中,尽可能少,并假设无论数组大小如何,数据都可能比它大。考虑到这一点,解决这个假设。
  2. 另一种最佳实践:测试极端情况(无输入、输入非常大、只有一行或元素等)
  3. CSV 文件不是标准的。例如,一些生成 CSV 的程序只是忽略以下情况:

3.1。字符串内的逗号。例如,“Smith, John”与“Smith, John”不同。
3.2.字符串中包含特殊字符,例如撇号、制表符或引号。它们是如何处理的?例如,Microsoft 通常使用双双引号来表示字符串内的引号。
3.3.当然,要小心行尾格式(Unix 或 Windows 风格)。

请务必查看大量实际数据。永远不要相信用户(也不要相信程序员:-)。

祝你好运。路易斯.
Excel 和 Visual Basic 用于生成

There's really no "best practice". Keep in mind the particular structure of your data, how quickly you want to read it, store it, query it, sort it, find/eliminate/ignore duplicates, etc. A tree, a linked-list, hashing, ordered data, etc. are good options depending on the factors that I already mentioned.

I agree with the other fellows. No need to reinvent the wheel. There must be gazillions of samples about how to parse CSV.

However, when choosing your favorite library, a few words of caution:

  1. Best practice: Never assume that the data has a specific (small or very large) amount of data. Corollary: don't store all the data in memory, just as little as reasonable, and assume that whatever the size of your array, the data may be bigger than it. With that being considered, work around that assumption.
  2. Another best practice: Test corner cases (no input, very large input, only one line or element, etc.)
  3. CSV files are not standard. For example, some programs that generate CSV just ignore the following cases:

3.1. Commas within strings. For example, it's not the same "Smith, John" than Smith, John.
3.2. Special characters withing the strings, such as apostrophes, tabs, or quotes. How are they handled? For example, Microsoft typically uses double-double quotes to represent quotes inside a string.
3.3. And, of course, be careful with the end of line format (Unix or Windows-style).

Be sure to take a look to very good bunch of actual data. Never believe the users (nor the programmers :-).

Good luck. Luis.
Excel and Visual Basic used to generate

谎言 2024-09-21 22:45:19

使用库或预先计算行数。您还可以使用某种列表数据结构来避免担心行数。

在我看来,Nissan Fan 推荐了一个库,+1,除非您想了解很多有关 CSV 解析和 CSV 解析边缘情况的知识,否则这始终是最佳选择。

Use a library or count the lines beforehand. You could also use some kind of list data structure to avoid worrying about the line count.

+1 to Nissan Fan for recommending a library, in my opinion, unless you're trying to learn a lot about CSV parsing and CSV parsing edge cases, this is always the way to go.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文