5000万+数据行 - CSV 或 MySQL
我有一个大约 1GB 大的 CSV 文件,包含大约 5000 万行数据,我想知道是将其保留为 CSV 文件还是将其存储为某种形式的数据库更好。我对 MySQL 知之甚少,无法争论为什么我应该使用它或其他数据库框架而不是仅仅将其保留为 CSV 文件。我基本上是用这个数据集进行广度优先搜索,所以一旦我获得初始“种子”,设置 5000 万个,我将其用作队列中的第一个值。
谢谢,
I have a CSV file which is about 1GB big and contains about 50million rows of data, I am wondering is it better to keep it as a CSV file or store it as some form of a database. I don't know a great deal about MySQL to argue for why I should use it or another database framework over just keeping it as a CSV file. I am basically doing a Breadth-First Search with this dataset, so once I get the initial "seed" set the 50million I use this as the first values in my queue.
Thanks,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我想说,对于如此大的结构化数据,使用数据库而不是 CSV 有很多好处,因此我建议您学习足够的知识来做到这一点。但是,根据您的描述,您可能需要检查非服务器/轻量级数据库。例如 SQLite,或类似于 JavaDB/Derby 的东西...或者根据数据的结构,非关系型(Nosql)数据库 - 显然您将需要一个具有某种类型的 python 支持的数据库。
I would say that there are a wide variety of benefits to using a database over a CSV for such large structured data so I would suggest that you learn enough to do so. However, based on your description you might want to check out non-server/lighter weight databases. Such as SQLite, or something similar to JavaDB/Derby... or depending on the structure of your data a non-relational (Nosql) database- obviously you will need one with some type of python support though.
如果您想搜索图形化的内容(因为您提到了广度优先搜索),那么图形数据库 可能有用。
If you want to search on something graph-ish (since you mention Breadth-First Search) then a graph database might prove useful.
你打算一次性把所有东西都吃掉吗?如果是这样,那么 CSV 可能是最佳选择。它很简单并且有效。
如果您需要进行查找,那么可以为数据建立索引的工具(例如 MySQL)会更好。
Are you just going to slurp in everything all at once? If so, then CSV is probably the way to go. It's simple and works.
If you need to do lookups, then something that lets you index the data, like MySQL, would be better.
从您之前的问题来看,您似乎正在针对 Facebook 好友数据进行社交网络搜索;所以我假设你的数据是一组“A is-friend-of B”语句,并且你正在寻找两个人之间的最短连接?
如果您有足够的内存,我建议将您的 csv 文件解析为列表字典。请参阅这种广度优先搜索可以变得更快吗?
如果您无法同时保存所有数据,像 SQLite 这样的本地存储数据库可能是您的下一个最佳选择。
还有一些 python 模块可能会有所帮助:
From your previous questions, it looks like you are doing social-network searches against facebook friend data; so I presume your data is a set of 'A is-friend-of B' statements, and you are looking for a shortest connection between two individuals?
If you have enough memory, I would suggest parsing your csv file into a dictionary of lists. See Can this breadth-first search be made faster?
If you cannot hold all the data at once, a local-storage database like SQLite is probably your next-best alternative.
There are also some python modules which might help:
一些键值存储(例如 MongoDB)怎么样?
How about some key-value storages like MongoDB