更快地查找以时间为条件的重复项的方法
在没有 PERL 的 AIX 机器中,我需要过滤那些具有相同 id 且在四个小时内注册的记录,这些记录将被视为重复。
我使用 AWK
实现了这个过滤器并且工作得很好,但我需要一个更快的解决方案:
# Generar lista de Duplicados awk 'BEGIN { FS="," } /OK/ { old[$8] = f[$8]; f[$8] = mktime($4, $3, $2, $5, $6, $7); x[$8]++; } /OK/ && x[$8]>1 && f[$8]-old[$8]有什么建议吗? 有没有办法改善环境(预加载文件或类似的东西)?
输入文件已经排序。
根据 jj33 建议的更正,我做了一个新的版本更好地处理日期,仍然保持低调以合并更多操作:
awk 'BEGIN { FS=","; SECSPERMINUTE=60; SECSPERHOUR=3600; SECSPERDAY=86400; split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " "); split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " "); } /OK/ { old[$8] = f[$8]; f[$8] = mktime($4, $3, $2, $5, $6, $7); x[$8]++; } /OK/ && x[$8]>1 && f[$8]-old[$8] 2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) { d2m = d2m + 1; } d2y = DAYSTOYEAR[ y - 1999 ]; return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY); } '
In a machine with AIX without PERL
I need to filter records that will be considered duplicated if they have the same id and if they were registered between a period of four hours.
I implemented this filter using AWK
and work pretty well but I need a solution much faster:
# Generar lista de Duplicados awk 'BEGIN { FS="," } /OK/ { old[$8] = f[$8]; f[$8] = mktime($4, $3, $2, $5, $6, $7); x[$8]++; } /OK/ && x[$8]>1 && f[$8]-old[$8]Any suggestions? Are there ways to improve the environment (preloading the file or someting like that)?
The input file is already sorted.
With the corrections suggested by jj33 I made a new version with better treatment of dates, still maintaining a low profile for incorporating more operations:
awk 'BEGIN { FS=","; SECSPERMINUTE=60; SECSPERHOUR=3600; SECSPERDAY=86400; split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " "); split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " "); } /OK/ { old[$8] = f[$8]; f[$8] = mktime($4, $3, $2, $5, $6, $7); x[$8]++; } /OK/ && x[$8]>1 && f[$8]-old[$8] 2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) { d2m = d2m + 1; } d2y = DAYSTOYEAR[ y - 1999 ]; return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY); } '
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
输入文件是如何排序的? 例如,cat file|sort,或通过单个特定字段或多个字段排序? 如果有多个字段,是什么字段以及什么顺序? 看起来小时字段是 24 小时制,而不是 12 小时制,对吧? 所有日期/时间字段是否都用零填充(上午 9 点是“9”还是“09”?)
如果不考虑性能,您的代码似乎在月份边界方面存在问题,因为它假设所有月份都是 30 天长。 取两个日期 2008-05-31/12:00:00 和 2008-06-01:12:00:00。 这些时间间隔为 24 小时,但您的代码为两者生成相同的时间代码 (63339969600)
How is the input file sorted? Like, cat file|sort, or sorted via a single specific field, or multiple fields? If multiple fields, what fields and what order? It appears the hour fields are a 24 hour clock, not 12, right? Are all the date/time fields zero-padded (would 9am be "9" or "09"?)
Without taking into account performance it looks like your code has problems with month boundaries since it assumes all months are 30 days long. Take the two dates 2008-05-31/12:00:00 and 2008-06-01:12:00:00. Those are 24 hours apart but your code produces the same time code for both (63339969600)
我认为您需要考虑闰年。 我没有做数学计算,但我认为在闰年期间,2 月的硬编码为 28 天,比较 2/29 的中午和 3/1 的中午会产生与以前相同的重复时间戳。 虽然看起来你并没有那样实现。 他们实现它的方式,我认为你仍然有问题,但它是在 $leapyear 的 12/31 和 $leapyear+1 的 1/1 之间的日期。
我认为如果您的代码必须处理处理它们的时区,那么在时间更改期间也可能会发生一些冲突。
该文件似乎并没有真正以任何有用的方式进行排序。 我猜字段 $1 是某种状态(您正在检查的“OK”)。 因此它按记录状态排序,然后按天排序,然后按月、年、小时、分钟、秒排序。 如果是年、月、日,我认为可能会有一些优化。 可能仍然如此,但我的大脑现在正朝不同的方向发展。
如果重复键的数量与总行数成比例,我认为最好的选择是将 awk 脚本处理的文件减少为仅重复键(如 大卫说)。 您还可以预处理该文件,以便仅存在 /OK/ 行。 我想我会用一个管道来做到这一点,其中第一个 awk 脚本仅打印具有重复 ID 的行,第二个 awk 脚本基本上是上面的脚本,但经过优化以不查找 /OK/ 并且知道存在的任何键都是重复的密钥。
如果您提前知道所有或大多数行都会有重复的键,那么可能不值得乱搞。 我会硬着头皮用 C 语言编写它。代码行数更多,比 awk 脚本快得多。
I think you would need to consider leap years. I didn't do the math, but I think during a leap year, with a hard code of 28 days for feb, a comparison of noon on 2/29 and noon on 3/1 would result in the same duplicate time stamp as before. Although it looks like you didn't implement it like that. They way you implemented it, I think you still have the problem but it's between dates on 12/31 of $leapyear and 1/1 of $leapyear+1.
I think you might also have some collisions during time changes if your code has to handle time zones that handle them.
The file doesn't really seem to be sorted in any useful way. I'm guessing that field $1 is some sort of status (the "OK" you're checking for). So it's sorted by record status, then by DAY, then MONTH, YEAR, HOURS, MINUTES, SECONDS. If it was year,month,day I think there could be some optimizations there. Still might be but my brain's going in a different direction right now.
If there are a small number of duplicate keys in proportion to total number of lines, I think your best bet is to reduce the file your awk script works over to just duplicate keys (as David said). You could also preprocess the file so the only lines present are the /OK/ lines. I think I would do this with a pipeline where the first awk script only prints the lines with duplicate IDs and the second awk script is basically the one above but optimized to not look for /OK/ and with the knowledge that any key present is a duplicate key.
If you know ahead of time that all or most lines will have repeated keys, it's probably not worth messing with. I'd bite the bullet and write it in C. Tons more lines of code, much faster than the awk script.
在许多 unixen 上,您可以通过特定的列或字段进行排序。 因此,通过按 ID 对文件进行排序,然后按日期对文件进行排序,您不再需要保留上次看到每个 ID 时的关联数组。 所有上下文都按文件的顺序排列。
在我的 Mac 上,它有 GNU 排序,它是:
根据 ID 字段排序。 您也可以对第二个字段进行排序,只需说(例如)8,3,但只有 2 个字段。 因此,在文件中使用 unix 风格的 time_t 时间戳可能不是一个坏主意 - 它很容易排序,并且可以保存所有这些日期计算。 另外,(至少在 GNU awk 中),有一个 mktime 函数 从组件中为您生成 time_t 。
On many unixen, you can get sort to sort by a particular column, or field. So by sorting the file by the ID, and then by the date, you no longer need to keep the associative array of when you last saw each ID at all. All the context is there in the order of the file.
On my Mac, which has GNU sort, it's:
to sort on the ID field. You can sort on a second field too, by saying (e.g) 8,3 instead, but ONLY 2 fields. So a unix-style time_t timestamp might not be a bad idea in the file - it's easy to sort, and saves you all those date calculations. Also, (again at least in GNU awk), there is a mktime function that makes the time_t for you from the components.
@AnotherHowie,我认为整个预处理可能是完成排序和 uniq。 问题是OP的数据似乎是逗号分隔的,并且(Solaris 8的)uniq不允许您以任何方式指定记录分隔符,因此没有一种超级干净的方法来使用标准unix工具进行预处理。 我不认为它会更快,所以我不会查找确切的选项,但你可以这样做:
这不是很好,因为它对包含重复键的每一行执行 grep 。 您可能可以将 uniq 输出整理为单个正则表达式以提供给 grep,但只有当 OP 发布包含可疑重复键的行与文件中总行数的预期比率时,才能知道好处。
@AnotherHowie, I thought the whole preprocessing could be done with sort and uniq. The problem is that the OP's data seems to be comma delimited and (Solaris 8's) uniq doesn't allow you any way specify the record separator, so there wasn't a super clean way to do the preprocessing using standard unix tools. I don't think it would be any faster so I'm not going to look up the exact options, but you could do something like:
That's not very good because it executes grep for every line containing a duplicate key. You could probably massage the uniq output into a single regexp to feed to grep, but the benefit would only be known if the OP posts expected ratio of lines containing suspected duplicate keys to total lines in the file.
如果您的数据文件包含所有记录(即,它包括文件中没有重复 id 的记录),您可以对其进行预处理并生成一个仅包含具有重复 (id) 的记录的文件。
如果是这种情况,将减少您需要使用 AWK 程序处理的文件的大小。
If your data file contains all your records (i.e. it includes records that do not have dupicate ids within the file) you could pre-process it and produce a file that only contains records that have duplicate (ids).
If this is the case that would reduce the size of file you need to process with your AWK program.
这听起来像是实际数据库的工作。 即使是像 SQLite 这样的东西也可能可以很好地帮助你。 我认为最大的问题是你对“4小时内”的定义。 这是一个滑动窗口问题,这意味着您不能简单地将所有数据量化为 4 小时段...您必须分别计算每个其他元素的所有“附近”元素。 啊。
This sounds like a job for an actual database. Even something like SQLite could probably help you reasonably well here. The big problem I see is your definition of "within 4 hours". That's a sliding window problem, which means you can't simply quantize all the data to 4 hour segments... you have to compute all "nearby" elements for every other element separately. Ugh.