更快地查找以时间为条件的重复项的方法

发布于 2024-07-04 03:05:42 字数 1265 浏览 8 评论 0原文

在没有 PERL 的 AIX 机器中，我需要过滤那些具有相同 id 且在四个小时内注册的记录，这些记录将被视为重复。

我使用 AWK 实现了这个过滤器并且工作得很好，但我需要一个更快的解决方案：

# Generar lista de Duplicados
awk 'BEGIN {
FS="," 
}
/OK/ { 
    old[$8] = f[$8];
    f[$8] = mktime($4, $3, $2, $5, $6, $7); 
    x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8] 

有什么建议吗？   有没有办法改善环境（预加载文件或类似的东西）？

  输入文件已经排序。

  根据 jj33 建议的更正，我做了一个新的版本更好地处理日期，仍然保持低调以合并更多操作：


awk 'BEGIN {
    FS=","; 
    SECSPERMINUTE=60;
    SECSPERHOUR=3600;
    SECSPERDAY=86400;
    split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " ");
    split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " ");
}
/OK/ { 
    old[$8] = f[$8];
    f[$8] = mktime($4, $3, $2, $5, $6, $7); 
    x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8]  2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) {
        d2m = d2m + 1;
    }
    d2y = DAYSTOYEAR[ y - 1999 ];
    return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY);
}
'

原文

In a machine with AIX without PERL I need to filter records that will be considered duplicated if they have the same id and if they were registered between a period of four hours.

I implemented this filter using AWK and work pretty well but I need a solution much faster:

# Generar lista de Duplicados
awk 'BEGIN {
FS="," 
}
/OK/ { 
    old[$8] = f[$8];
    f[$8] = mktime($4, $3, $2, $5, $6, $7); 
    x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8] 

Any suggestions? Are there ways to improve the environment (preloading the file or someting like that)? 

The input file is already sorted.

With the corrections suggested by jj33 I made a new version with better treatment of dates, still maintaining a low profile for incorporating more operations: 


awk 'BEGIN {
    FS=","; 
    SECSPERMINUTE=60;
    SECSPERHOUR=3600;
    SECSPERDAY=86400;
    split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " ");
    split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " ");
}
/OK/ { 
    old[$8] = f[$8];
    f[$8] = mktime($4, $3, $2, $5, $6, $7); 
    x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8]  2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) {
        d2m = d2m + 1;
    }
    d2y = DAYSTOYEAR[ y - 1999 ];
    return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY);
}
'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最近可好 2024-07-11 03:05:42

输入文件是如何排序的？例如，cat file|sort，或通过单个特定字段或多个字段排序？如果有多个字段，是什么字段以及什么顺序？看起来小时字段是 24 小时制，而不是 12 小时制，对吧？所有日期/时间字段是否都用零填充（上午 9 点是“9”还是“09”？）

如果不考虑性能，您的代码似乎在月份边界方面存在问题，因为它假设所有月份都是 30 天长。取两个日期 2008-05-31/12:00:00 和 2008-06-01:12:00:00。这些时间间隔为 24 小时，但您的代码为两者生成相同的时间代码 (63339969600)

回复收藏 0 原文

您的好友蓝忘机已上羡 2024-07-11 03:05:42

我认为您需要考虑闰年。我没有做数学计算，但我认为在闰年期间，2 月的硬编码为 28 天，比较 2/29 的中午和 3/1 的中午会产生与以前相同的重复时间戳。虽然看起来你并没有那样实现。他们实现它的方式，我认为你仍然有问题，但它是在 $leapyear 的 12/31 和 $leapyear+1 的 1/1 之间的日期。

我认为如果您的代码必须处理处理它们的时区，那么在时间更改期间也可能会发生一些冲突。

该文件似乎并没有真正以任何有用的方式进行排序。我猜字段 $1 是某种状态（您正在检查的“OK”）。因此它按记录状态排序，然后按天排序，然后按月、年、小时、分钟、秒排序。如果是年、月、日，我认为可能会有一些优化。可能仍然如此，但我的大脑现在正朝不同的方向发展。

如果重复键的数量与总行数成比例，我认为最好的选择是将 awk 脚本处理的文件减少为仅重复键（如大卫说）。您还可以预处理该文件，以便仅存在 /OK/ 行。我想我会用一个管道来做到这一点，其中第一个 awk 脚本仅打印具有重复 ID 的行，第二个 awk 脚本基本上是上面的脚本，但经过优化以不查找 /OK/ 并且知道存在的任何键都是重复的密钥。

如果您提前知道所有或大多数行都会有重复的键，那么可能不值得乱搞。我会硬着头皮用 C 语言编写它。代码行数更多，比 awk 脚本快得多。

回复收藏 0 原文

多彩岁月 2024-07-11 03:05:42

在许多 unixen 上，您可以通过特定的列或字段进行排序。因此，通过按 ID 对文件进行排序，然后按日期对文件进行排序，您不再需要保留上次看到每个 ID 时的关联数组。所有上下文都按文件的顺序排列。

在我的 Mac 上，它有 GNU 排序，它是：

sort -k 8 < input.txt > output.txt

根据 ID 字段排序。您也可以对第二个字段进行排序，只需说（例如）8,3，但只有 2 个字段。因此，在文件中使用 unix 风格的 time_t 时间戳可能不是一个坏主意 - 它很容易排序，并且可以保存所有这些日期计算。另外，（至少在 GNU awk 中），有一个 mktime 函数从组件中为您生成 time_t 。

On many unixen, you can get sort to sort by a particular column, or field. So by sorting the file by the ID, and then by the date, you no longer need to keep the associative array of when you last saw each ID at all. All the context is there in the order of the file.

On my Mac, which has GNU sort, it's:

sort -k 8 < input.txt > output.txt

to sort on the ID field. You can sort on a second field too, by saying (e.g) 8,3 instead, but ONLY 2 fields. So a unix-style time_t timestamp might not be a bad idea in the file - it's easy to sort, and saves you all those date calculations. Also, (again at least in GNU awk), there is a mktime function that makes the time_t for you from the components.

回复收藏 0 原文

梦萦几度 2024-07-11 03:05:42

@AnotherHowie，我认为整个预处理可能是完成排序和 uniq。问题是OP的数据似乎是逗号分隔的，并且（Solaris 8的）uniq不允许您以任何方式指定记录分隔符，因此没有一种超级干净的方法来使用标准unix工具进行预处理。我不认为它会更快，所以我不会查找确切的选项，但你可以这样做：

cut -d, -f8 <infile.txt | sort | uniq -d | xargs -i grep {} infile.txt >outfile.txt

这不是很好，因为它对包含重复键的每一行执行 grep 。您可能可以将 uniq 输出整理为单个正则表达式以提供给 grep，但只有当 OP 发布包含可疑重复键的行与文件中总行数的预期比率时，才能知道好处。

@AnotherHowie, I thought the whole preprocessing could be done with sort and uniq. The problem is that the OP's data seems to be comma delimited and (Solaris 8's) uniq doesn't allow you any way specify the record separator, so there wasn't a super clean way to do the preprocessing using standard unix tools. I don't think it would be any faster so I'm not going to look up the exact options, but you could do something like:

cut -d, -f8 <infile.txt | sort | uniq -d | xargs -i grep {} infile.txt >outfile.txt

That's not very good because it executes grep for every line containing a duplicate key. You could probably massage the uniq output into a single regexp to feed to grep, but the benefit would only be known if the OP posts expected ratio of lines containing suspected duplicate keys to total lines in the file.

回复收藏 0 原文