合并2个非常大的文本文件，更新每一行，不使用内存

发布于 2024-12-02 03:53:05 字数 509 浏览 0 评论 0原文

假设我有 2 个文本文件，每个文件大约有 200 万行（每个文件大小约为 50-80MB）。两个文件的结构相同：

Column1 Column2 Column3
...

列 1 永远不会改变，列 2：两个文件中可能不存在相同的值，并且顺序也不同对于这两个文件，Column3 是一个数字，并且在每个文件中都不同。

我需要能够将它们合并到一个文件中，并与 Column 2 匹配。如果两个文件中都存在 Column2，则通过将两个文件中的 Column3 的值添加到一起来更新 Column3。

如果文件不是那么大，我可以通过将两个文件的每一行读入数组并从那里开始，在 PHP 中轻松地完成此操作，但这样做很容易使可用内存超载。

有没有办法做到这一点而不将每一行加载到内存中？我最熟悉 PHP，但也愿意接受 Python、Java 或 Shell 脚本（如果它们不是太复杂而难以理解）。

原文

Say I've got 2 text files with around 2 million lines each (~50-80MB file size each). The structure of both files is the same:

Column1 Column2 Column3
...

Column 1 never changes, Column 2: the same value may not be in both files, and won't be in the same order for both files, Column3 is a number and will be different in every file.

I need to be able to merge them both into one file, matched by Column 2. If Column2 exists in both files, update Column3 by adding the values of Column3 from both files together.

If the files weren't so huge, I could easily do this in PHP by reading each line of both files into arrays and going from there, but doing so easily overloads the memory available.

Is there a way to do this with without loading each line into memory? I'm mostly familiar with PHP, but open to Python, Java or Shell scripts if they are not too complicated to understand.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半山落雨半山空 2024-12-09 03:53:06

我会使用命令行 sort(1) 对文件进行合并和排序。之后，它应该是一个计算总和的简单脚本。我不懂 PHP，所以我用 python 给出我的例子：

sort -k2 <file1> <file2> | python -c "
  import itertools,sys
  allLines = (x.strip().split(' ') for x in sys.stdin)
  groups = itertools.groupby(allLines, lambda x:x[1])
  for k,lines in groups:
      firstLine = iter(g).next()
      print firstLine[0], firstline[1], sum(int(x[2]) for x in lines)
"

I'd go with command line sort(1)to merge and sort the files . After that, it should be a simple script to compute the sums. I don't know PHP, so I'll give my example in python:

sort -k2 <file1> <file2> | python -c "
  import itertools,sys
  allLines = (x.strip().split(' ') for x in sys.stdin)
  groups = itertools.groupby(allLines, lambda x:x[1])
  for k,lines in groups:
      firstLine = iter(g).next()
      print firstLine[0], firstline[1], sum(int(x[2]) for x in lines)
"

回复收藏 0 原文

浅暮の光 2024-12-09 03:53:06

好的，所以如果我没读错的话，您将得到：

file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  <-- common column 2
abc 12 67  <---common column 2
abc 23 1   <-- unique column 2

输出应该是：

abc 12 101
abc 90 99

如果是这样的话，那么类似这样的东西（假设它们是 .csv 格式）：

$f1 = fopen('file1.txt', 'rb');
$f2 = fopen('file2.txt', 'rb');
$fout = fopen('outputxt.');

$data = array();
while(1) {
    if (feof($line1) || feof($line2)) {
        break; // quit if we hit the end of either file
    }

    $line1 = fgetcsv($f1);
    if (isset($data[$line1[1]])) {
       // saw the col2 value earlier, so do the math for the output file:
       $col3 = $line1[2] + $data[$line1[1]];
       $output = array($line[0], $line1[1], $col3);
       fputcsv($fout, $output);
       unset($data[$line1[1]]);
    } else {
       $data[$line1[1]] = $line1; // cache the line, if the col2 value wasn't seen already
    }

    $line2 = fgetcsv($f2);
    if (isset($data[$line2[1]])) {
       $col3 = $data[$line2[1]] + $line2[2];
       $newdata = array($line2[0], $line2[1], $col3);
       fputcsv($fout, $newdata);
       unset($data[$line2[1]]); // remove line from cache
    } else {
       $data[$line2[1]] = $line2;
    }
}

fclose($f1);
fclose($f2);
fclose($fout);

这是最重要的我的想法，未经测试，可能行不通，YMMV 等等...

如果您对两个输入文件进行预排序，那么column2 就会用作排序键，这会极大地简化事情。这会降低缓存大小，因为您会知道是否已经看到匹配的值以及何时转储早期缓存的数据。

Ok, so if I'm reading this right, you'll have:

file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  <-- common column 2
abc 12 67  <---common column 2
abc 23 1   <-- unique column 2

output should be:

abc 12 101
abc 90 99

If that's the case, then something like this (assuming they're .csv-formatted):

$f1 = fopen('file1.txt', 'rb');
$f2 = fopen('file2.txt', 'rb');
$fout = fopen('outputxt.');

$data = array();
while(1) {
    if (feof($line1) || feof($line2)) {
        break; // quit if we hit the end of either file
    }

    $line1 = fgetcsv($f1);
    if (isset($data[$line1[1]])) {
       // saw the col2 value earlier, so do the math for the output file:
       $col3 = $line1[2] + $data[$line1[1]];
       $output = array($line[0], $line1[1], $col3);
       fputcsv($fout, $output);
       unset($data[$line1[1]]);
    } else {
       $data[$line1[1]] = $line1; // cache the line, if the col2 value wasn't seen already
    }

    $line2 = fgetcsv($f2);
    if (isset($data[$line2[1]])) {
       $col3 = $data[$line2[1]] + $line2[2];
       $newdata = array($line2[0], $line2[1], $col3);
       fputcsv($fout, $newdata);
       unset($data[$line2[1]]); // remove line from cache
    } else {
       $data[$line2[1]] = $line2;
    }
}

fclose($f1);
fclose($f2);
fclose($fout);

This is going off the top of my head, not tested, probably won't work, YMMV, etc...

It'd simplify things immensely if you pre-sort the two input files, so that column2 is used as the sort key. That'd keep the cache size down, as you'd know if you'd seen a matched value already and when to dump the earlier cached data.

回复收藏 0 原文

一杆小烟枪 2024-12-09 03:53:06

可能会让您感到困惑的是您正在查看两个文件。没必要这样。使用马克的优秀例子：
file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  
abc 12 67  
abc 23 1

然后

sort file1 file2 > file3

产生
file3：

abc 12 34
abc 12 67  
abc 23 1
abc 56 78
abc 90 12
abc 90 87

CS-101 的第二周，将其缩减为最终形式。

What may throwing you is that you are looking at two files. There's no need for that. To use Mark's excellent example:
file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  
abc 12 67  
abc 23 1

then

sort file1 file2 > file3

yields
file3:

abc 12 34
abc 12 67  
abc 23 1
abc 56 78
abc 90 12
abc 90 87

Second week of CS-101 to reduce that down to its final form.

回复收藏 0 原文

又怨 2024-12-09 03:53:06

您可以使用Python sqlite3包含的模块轻松解决它，而无需使用太多内存（大约13 Mb，100万行）：

import sqlite3

files = ("f1.txt", "f2.txt")    # Files to compare

# # Create test data
# for file_ in files:
#   f = open(file_, "w")
#   fld2 = 0
#   for fld1 in "abc def ghi jkl".split():
#       for fld3 in range(1000000 / 4):
#           fld2 += 1
#           f.write("%s %s %s\n" % (fld1, fld2, 1))
# 
#   f.close()

sqlite_file = "./join.tmp"      # or :memory: if you don't want to create a file

cnx = sqlite3.connect(sqlite_file)

for file_ in range(len(files)):     # Create & load tables
    table = "file%d" % (file_+1)
    cnx.execute("drop table if exists %s" % table)
    cnx.execute("create table %s (fld1 text, fld2 int primary key, fld3 int)" % table)

    for line in open(files[file_], "r"):
        cnx.execute("insert into %s values (?,?,?)" % table, line.split())

# Join & result
cur = cnx.execute("select f1.fld1, f1.fld2, (f1.fld3+f2.fld3) from file1 f1 join file2 f2 on f1.fld2==f2.fld2")
while True:
    row = cur.fetchone()
    if not row:
        break

    print row[0], row[1], row[2]

cnx.close()

You can solve it easily with Python sqlite3 included module without using much memory (around 13 Mb with 1 million rows):

import sqlite3

files = ("f1.txt", "f2.txt")    # Files to compare

# # Create test data
# for file_ in files:
#   f = open(file_, "w")
#   fld2 = 0
#   for fld1 in "abc def ghi jkl".split():
#       for fld3 in range(1000000 / 4):
#           fld2 += 1
#           f.write("%s %s %s\n" % (fld1, fld2, 1))
# 
#   f.close()

sqlite_file = "./join.tmp"      # or :memory: if you don't want to create a file

cnx = sqlite3.connect(sqlite_file)

for file_ in range(len(files)):     # Create & load tables
    table = "file%d" % (file_+1)
    cnx.execute("drop table if exists %s" % table)
    cnx.execute("create table %s (fld1 text, fld2 int primary key, fld3 int)" % table)

    for line in open(files[file_], "r"):
        cnx.execute("insert into %s values (?,?,?)" % table, line.split())

# Join & result
cur = cnx.execute("select f1.fld1, f1.fld2, (f1.fld3+f2.fld3) from file1 f1 join file2 f2 on f1.fld2==f2.fld2")
while True:
    row = cur.fetchone()
    if not row:
        break

    print row[0], row[1], row[2]

cnx.close()

回复收藏 0 原文

沐歌 2024-12-09 03:53:06

PHP 的内存限制适合其 Web 服务器脚本编写的主要任务。它非常不适合批处理数据，就像您想要做的工作一样。问题是 PHP 配置的内存限制，而不是你试图做一些需要“太多”内存的事情。我的手机有足够的内存轻松地将 2 个 80Mb 文件加载到内存中，并以快速/简单的方式执行此操作，更不用说任何应该能够加载 GB 字节（或至少 1 GB）的真实计算机了）的数据不费吹灰之力。

显然，您可以在运行时使用 ini_set 设置 PHP 的内存限制（根据当今的标准，该限制是任意的并且非常小），仅针对此脚本。您知道服务器上实际有多少可用内存吗？我知道许多共享托管提供商确实按照当今的标准为您提供了非常少量的内存，因为他们不希望您做的事情除了处理网页请求之外。但是您可能可以按照您想要的方式直接在 PHP 中执行此操作，而无需跳过任何步骤（并大大减慢进程）来尝试避免一次将所有文件加载到内存中。

回复收藏 0 原文

~没有更多了~