当前位置：文江博客话题详情

处理大型数据集

发布于 2024-10-18 21:07:31 字数 212 浏览 8 评论 0原文

处理大型数据集的最佳解决方案是什么。
我有 txt 文件分成多个文件。如果我加起来大约是 100 GB 这些文件只不过是

uniqID1 uniqID2 等

id 对如果我想计算类似的东西 1：uniqID等的唯一数量 2：uniqID1链接到的其他ID列表？

最好的解决方案是什么？我如何将它们更新到数据库中？

谢谢你！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微暖i 2024-10-25 21:07:31

因此，如果您有一个包含以下列的表：

           id1 varchar(10)   // how long are you ids? are they numeric? text?
           id2 varchar(10)

表中有大约 50 亿行，并且您想要快速回答以下问题：

        how many unique values in column id1 are there?
        what is the set of distinct values from id1 where id2 = {some parameter}

一个关系数据库（支持 SQL）和一个表，该表的索引位于 id1 上，另一个索引位于id2 会做你需要的。 SQLite 可以完成这项工作。

编辑：要导入它们，最好用值中从未出现的一些字符分隔两个值，例如逗号或管道字符或制表符，每行一对：

         foo|bar
         moo|mar

编辑2：您不需要关系，但它不会伤害任何东西，并且如果数据库是关系型的，您的数据结构将更具可扩展性。

So if you had a table with the following columns:

           id1 varchar(10)   // how long are you ids? are they numeric? text?
           id2 varchar(10)

with about five billion rows in the table, and you wanted quick answers to questions such as:

        how many unique values in column id1 are there?
        what is the set of distinct values from id1 where id2 = {some parameter}

a relational database (that supports SQL) and a table with an index on id1 and another index on id2 would do what you need. SQLite would do the job.

EDIT: to import them it would be best to separate the two values with some character that never occurs in the values, like a comma or a pipe character or a tab, one pair per line:

         foo|bar
         moo|mar

EDIT2: You don't need relational but it doesn't hurt anything, and your data structure is more extensible if the db is relational.

回复收藏 0 原文

~没有更多了~

关于作者

苏别ゝ

暂无简介

文章

26 人气

关注发私信

牛↙奶布丁

文章 0 评论 0

关注

COSO

文章 0 评论 0

关注

落叶

文章 0 评论 0

关注

暗地喜欢

文章 0 评论 0

关注

qq_i8qOEG

文章 0 评论 0

关注

qq_Wl4Sbi

文章 0 评论 0

友情链接

文江博客

处理大型数据集

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

处理大型数据集

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。