处理大型数据集
处理大型数据集的最佳解决方案是什么。
我有 txt 文件分成多个文件。 如果我加起来大约是 100 GB 这些文件只不过是
uniqID1 uniqID2 等
id 对 如果我想计算类似的东西 1:uniqID等的唯一数量 2:uniqID1链接到的其他ID列表?
最好的解决方案是什么? 我如何将它们更新到数据库中?
谢谢你!
What is the best solution for handling LARGE dataset.
I have txt files broken down into multiple files.
which if I add up it will be about 100 GB
the files are nothing more than just
uniqID1 uniqID2
etc
id pairs
and if I want calculate things like
1:unique number of uniqIDs etc
2:list of other IDs uniqID1 is linked to?
what is the best solution?
how do I update these into a database?
thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因此,如果您有一个包含以下列的表:
表中有大约 50 亿行,并且您想要快速回答以下问题:
一个关系数据库(支持 SQL)和一个表,该表的索引位于 id1 上,另一个索引位于id2 会做你需要的。 SQLite 可以完成这项工作。
编辑:要导入它们,最好用值中从未出现的一些字符分隔两个值,例如逗号或管道字符或制表符,每行一对:
编辑2:您不需要关系,但它不会伤害任何东西,并且如果数据库是关系型的,您的数据结构将更具可扩展性。
So if you had a table with the following columns:
with about five billion rows in the table, and you wanted quick answers to questions such as:
a relational database (that supports SQL) and a table with an index on id1 and another index on id2 would do what you need. SQLite would do the job.
EDIT: to import them it would be best to separate the two values with some character that never occurs in the values, like a comma or a pipe character or a tab, one pair per line:
EDIT2: You don't need relational but it doesn't hurt anything, and your data structure is more extensible if the db is relational.