如何对大型数据集进行分组

发布于 2024-09-13 03:44:14 字数 1475 浏览 22 评论 0原文

我有一个简单的文本文件，其中包含两列，都是整数

等等。

我需要按第二个值对数据集进行分组，这样输出就会是。

5 1 2
12 1 2
341 2

现在的问题是文件很大，大约 34 GB 在大小方面，我尝试编写一个 python 脚本将它们分组到一个字典中，其值为整数数组，但仍然需要很长时间。（我猜分配array('i')并在append上扩展它们需要花费大量时间。

我现在计划编写一个我正在计划的猪脚本在伪分布式 hadoop 机器（Amazon EC3 高内存大型实例）上运行

data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';

我想知道是否有更简单的方法

更新： 。在内存中保存这么大的文件是毫无疑问的，对于python解决方案，我计划是在第一次运行中进行4次运行，仅考虑从1到1000万的第二个col值，在下一次运行中考虑1000万到2000万等等。但事实证明这真的很慢。

pig / hadoop 解决方案很有趣，因为它将所有内容都保存在磁盘上[好吧，其中大部分]。

为了更好地理解该数据集包含约 4500 万 Twitter 用户的连接信息，文件中的格式意味着第二个数字给出的用户 ID 位于第一个数字之后。

我使用过的解决方案：

class AdjDict(dict):
    """
     A special Dictionary Class to hold adjecancy list
    """
    def __missing__(self, key):
        """
        Missing is changed such that when a key is not found an integer array is initialized
        """
        self.__setitem__(key,array.array('i'))
        return self[key]

Adj= AdjDict()

for line in file("net.txt"):
    entry =  line.strip().split('\t')
    node = int(entry[1])
    follower = int(entry[0])
    if node < 10 ** 6:
        Adj[node].append(follower)

# Code for writting Adj matrix to the file:

原文

I have simple text file containing two columns, both integers

and so on..

I need to group the dataset by second value,
such that the output will be.

5 1 2
12 1 2
341 2

Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.

I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).

data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';

I wanted to know if there was any simpler way of doing this.

Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.

The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].

For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.

Solution which I had used:

class AdjDict(dict):
    """
     A special Dictionary Class to hold adjecancy list
    """
    def __missing__(self, key):
        """
        Missing is changed such that when a key is not found an integer array is initialized
        """
        self.__setitem__(key,array.array('i'))
        return self[key]

Adj= AdjDict()

for line in file("net.txt"):
    entry =  line.strip().split('\t')
    node = int(entry[1])
    follower = int(entry[0])
    if node < 10 ** 6:
        Adj[node].append(follower)

# Code for writting Adj matrix to the file:

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自在安然 2024-09-20 03:44:14

假设每行大约有 17 个字符（我随机选择的数字是为了使数学更容易），则此文件中大约有 20 亿条记录。除非您在 64 位系统上运行大量物理内存，否则您将试图将所有这些内容保存在单个字典中，从而使页面文件崩溃。这只是将其作为数据结构读入 - 人们假设在构建此结构后，您计划实际上用它做一些事情。

对于如此简单的数据格式，我认为你最好用 C 而不是 Python 来做一些事情。破解这些数据应该不难，而且每个值的开销也会少得多。至少，仅容纳 20 亿个 4 字节整数就需要 8 GB（除非您可以对当前列为 1 和 2 的值的可能范围做出一些简化假设 - 如果它们适合一个字节或一个短整型，那么你可以使用较小的 int 变量，对于这种大小的数据集来说，这是值得的）。

回复收藏 0 原文

明天过后 2024-09-20 03:44:14

如果我必须在当前的硬件上解决这个问题，我可能会编写一些小程序：

第一个程序将处理 500 MB 的文件块，交换列并将结果写入新文件。（您将得到 70 个或更多。）（这不会占用太多内存。）

然后我会对每个小文件调用操作系统提供的 sort(1) 。（这可能需要几GB内存。）

然后我会编写一个合并排序程序，将所有 70 多个子文件中的行合并在一起。（这不会占用太多内存。）

然后我会编写一个程序来运行大型排序列表；你会得到一堆像这样的行：

并且你需要返回：（

5 1 2
12 1 2

这不会占用太多内存。）

通过将其分成更小的块，希望你可以将 RSS 保持在适合合理机器的水平-- 它将需要更多的磁盘I/O，但是在除了令人惊叹的硬件之外的任何东西上，交换的使用都会阻止在一个大程序中处理这个问题的尝试。

If I had to solve this on my current hardware, I'd probably write a few small programs:

The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)

Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)

Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)

Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:

and you'll need to return:

5 1 2
12 1 2

(This won't take much memory.)

By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.

回复收藏 0 原文

很酷不放纵 2024-09-20 03:44:14

也许您可以多次遍历该文件。

每次遍历文件时执行一系列键，例如，如果您选择的范围大小为 100

第一遍 - 计算出 0-99 之间的所有键
第二遍 - 算出 100-199 的所有键
第三遍 - 算出 200-299 的所有键
第四遍 - 算出 300-399 的所有键
..等等。

对于您的示例，第 1 遍将输出

5 1 2
12 1 2

，第 4 遍将输出

341 2

选择范围大小，以便您创建的字典适合您的 RAM

我不会费心使用多处理来尝试通过使用多个核心来加速它，除非您有一个非常快的硬盘驱动器，这应该是 IO 绑定的，你最终只会破坏磁盘

Maybe you can do a multi-pass through the file.

Do a range of keys each pass through the file, for example if you picked a range size of 100

1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.

for your sample, the 1st pass would output

5 1 2
12 1 2

and the 4th pass would output

341 2

Choose the range size so that the dict you are creating fits into your RAM

I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk

回复收藏 0 原文

未蓝澄海的烟 2024-09-20 03:44:14

如果您正在使用 34 GB 文件，我假设硬盘驱动器在存储和访问时间方面都不是问题。如何按顺序读取对，当找到对 (x,y) 时，打开文件“x”，追加“y”并关闭文件“x”？最后，每个 Twitter 用户 ID 都会有一个文件，每个文件都包含该文件所连接的所有用户。如果您希望结果采用指定的输出格式，则可以连接所有这些文件。

话虽这么说，我确实认为：
(a) 对于如此大的数据集，精确分辨率是不合适的，并且
(b) 可能有一些更好的方法来衡量连接性，所以也许您想告诉我们您的最终目标。

事实上，你有一个非常大的图，并且已经设计了许多有效的技术来研究巨大图的形状和属性——这些技术中的大多数都是为了作为流式在线算法而构建的。

例如，一种称为“三角形计数”的技术与概率基数估计算法相结合，可以高效、快速地提供有关图形中包含的派系的信息。有关三角形计数方面的更好想法以及它与图形的关系，请参阅例如此（随机选择）文章。

回复收藏 0 原文

も让我眼熟你 2024-09-20 03:44:14

我有一个类似的要求，你只需要再一条 pig 语句来删除 5 (1,5) (2,5) 中的冗余。

a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';

I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).

a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';

回复收藏 0 原文

~没有更多了~