如何对大型数据集进行分组
我有一个简单的文本文件,其中包含两列,都是整数
1 5
1 12
2 5
2 341
2 12
等等。
我需要按第二个值对数据集进行分组, 这样输出就会是。
5 1 2
12 1 2
341 2
现在的问题是文件很大,大约 34 GB 在大小方面,我尝试编写一个 python 脚本将它们分组到一个字典中,其值为整数数组,但仍然需要很长时间。 (我猜分配array('i')
并在append
上扩展它们需要花费大量时间。
我现在计划编写一个我正在计划的猪脚本在伪分布式 hadoop 机器(Amazon EC3 高内存大型实例)上运行
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
我想知道是否有更简单的方法
更新: 。 在内存中保存这么大的文件是毫无疑问的,对于python解决方案,我计划是在第一次运行中进行4次运行,仅考虑从1到1000万的第二个col值,在下一次运行中考虑1000万到2000万等等。但事实证明这真的很慢。
pig / hadoop 解决方案很有趣,因为它将所有内容都保存在磁盘上[好吧,其中大部分]。
为了更好地理解该数据集包含约 4500 万 Twitter 用户的连接信息,文件中的格式意味着第二个数字给出的用户 ID 位于第一个数字之后。
我使用过的解决方案:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:
I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i')
and extending them on append
.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
假设每行大约有 17 个字符(我随机选择的数字是为了使数学更容易),则此文件中大约有 20 亿条记录。除非您在 64 位系统上运行大量物理内存,否则您将试图将所有这些内容保存在单个字典中,从而使页面文件崩溃。这只是将其作为数据结构读入 - 人们假设在构建此结构后,您计划实际上用它做一些事情。
对于如此简单的数据格式,我认为你最好用 C 而不是 Python 来做一些事情。破解这些数据应该不难,而且每个值的开销也会少得多。至少,仅容纳 20 亿个 4 字节整数就需要 8 GB(除非您可以对当前列为 1 和 2 的值的可能范围做出一些简化假设 - 如果它们适合一个字节或一个短整型,那么你可以使用较小的 int 变量,对于这种大小的数据集来说,这是值得的)。
Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.
With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).
如果我必须在当前的硬件上解决这个问题,我可能会编写一些小程序:
第一个程序将处理 500 MB 的文件块,交换列并将结果写入新文件。 (您将得到 70 个或更多。)(这不会占用太多内存。)
然后我会对每个小文件调用操作系统提供的
sort(1)
。 (这可能需要几GB内存。)然后我会编写一个合并排序程序,将所有 70 多个子文件中的行合并在一起。 (这不会占用太多内存。)
然后我会编写一个程序来运行大型排序列表;你会得到一堆像这样的行:
并且你需要返回:(
这不会占用太多内存。)
通过将其分成更小的块,希望你可以将 RSS 保持在适合合理机器的水平-- 它将需要更多的磁盘I/O,但是在除了令人惊叹的硬件之外的任何东西上,交换的使用都会阻止在一个大程序中处理这个问题的尝试。
If I had to solve this on my current hardware, I'd probably write a few small programs:
The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)
Then I'd call the OS-supplied
sort(1)
on each small file. (This might take a few gigs of memory.)Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)
Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:
and you'll need to return:
(This won't take much memory.)
By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.
也许您可以多次遍历该文件。
每次遍历文件时执行一系列键,例如,如果您选择的范围大小为 100
第一遍 - 计算出 0-99 之间的所有键
第二遍 - 算出 100-199 的所有键
第三遍 - 算出 200-299 的所有键
第四遍 - 算出 300-399 的所有键
..等等。
对于您的示例,第 1 遍将输出
,第 4 遍将输出
选择范围大小,以便您创建的字典适合您的 RAM
我不会费心使用多处理来尝试通过使用多个核心来加速它,除非您有一个非常快的硬盘驱动器,这应该是 IO 绑定的,你最终只会破坏磁盘
Maybe you can do a multi-pass through the file.
Do a range of keys each pass through the file, for example if you picked a range size of 100
1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.
for your sample, the 1st pass would output
and the 4th pass would output
Choose the range size so that the dict you are creating fits into your RAM
I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk
如果您正在使用 34 GB 文件,我假设硬盘驱动器在存储和访问时间方面都不是问题。如何按顺序读取对,当找到对 (x,y) 时,打开文件“x”,追加“y”并关闭文件“x”?最后,每个 Twitter 用户 ID 都会有一个文件,每个文件都包含该文件所连接的所有用户。如果您希望结果采用指定的输出格式,则可以连接所有这些文件。
话虽这么说,我确实认为:
(a) 对于如此大的数据集,精确分辨率是不合适的,并且
(b) 可能有一些更好的方法来衡量连接性,所以也许您想告诉我们您的最终目标。
事实上,你有一个非常大的图,并且已经设计了许多有效的技术来研究巨大图的形状和属性——这些技术中的大多数都是为了作为流式在线算法而构建的。
例如,一种称为“三角形计数”的技术与概率基数估计算法相结合,可以高效、快速地提供有关图形中包含的派系的信息。有关三角形计数方面的更好想法以及它与图形的关系,请参阅例如此(随机选择)文章。
If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.
THAT SAID HOWEVER, I really do think that:
(a) for such a large data set, exact resolution is not appropriate and that
(b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.
Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.
For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.
我有一个类似的要求,你只需要再一条 pig 语句来删除 5 (1,5) (2,5) 中的冗余。
I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).