如何将两个文件中的字符串对替换为相同的 ID?
[更新2]正如经常发生的那样,随着对任务的理解加深,任务的范围扩大了很多。过时的部分已被划掉,您可以在下面找到更新的解释。 [/Update2]
我有一对相当大的日志文件,其内容非常相似,只是两者之间的某些字符串不同。几个示例:
UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
也就是说,只要第一个文件包含 UnifiedClassLoader3@19518cc
,第二个文件就包含 UnifiedClassLoader3@d0357a
,依此类推在。 [更新]此类标识符大约有 40 对不同的。[/更新]
UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
Logi18n@177060f | Logi18n@12ef4c6
LogFactory$1@15e3dc4 | LogFactory$1@2942da
也就是说,只要第一个文件包含 UnifiedClassLoader3@19518cc
,第二个文件就包含 UnifiedClassLoader3@d0357a
,依此类推。请注意,所有这些字符串都位于长行文本内,并且它们出现在许多行中,彼此混合。此类标识符大约有 4000 对不同的,每个文件的大小约为 34 MB。因此,性能也成为一个问题。
我想用相同的 ID 替换它们,以便我可以发现两个文件之间真正重要的差异。即我想用 UnifiedClassLoader3@1
替换 file1 中出现的所有 UnifiedClassLoader3@19518cc
和 file2 中出现的 UnifiedClassLoader3@d0357a
; file1 中 Logi18n@177060f
和 file2 中 Logi18n@12ef4c6
的所有出现以及 Logi18n@2
等。计数器 1
code> 和 2
是任意选择 - 唯一的要求是旧字符串和新字符串之间存在一对一的映射(即相同的字符串始终被相同的值替换,并且不会出现不同的字符串)替换为相同的值)。
使用 Cygwin shell,到目前为止,我成功列出了其中一个文件中出现的所有不同标识符
grep -o -e 'ClassLoader[0-9]*@[0-9a-f][0-9a-f]*' file1.log | sort | uniq
grep -o -e '[A-Z][A-Za-z0-9]*\(\$[0-9][0-9]*\)*@[0-9a-f][0-9a-f]*' file1.log
| sort | uniq
但是,现在原来的顺序丢失了,所以我不知道另一个文件中哪个ID是一对。使用 grep -n 我可以获得行号,因此排序将保留出现的顺序,但我无法清除重复的出现。不幸的是 grep 不能只打印模式的第一个匹配项。
我想我可以将上述命令生成的标识符列表保存到一个文件中,然后使用 grep -n | 迭代文件中的模式。 head -n 1,连接结果并再次排序。结果将类似于
2 ClassLoader3@19518cc
137 ClassLoader@13c2d7f
563 ClassLoader3@1267649
...
sed 命令中
sed -e 's/ClassLoader3@19518cc/ClassLoader3@2/g'
-e 's/ClassLoader@13c2d7f/ClassLoader@137/g'
-e 's/ClassLoader3@1267649/ClassLoader3@563/g'
file1.log > file1_processed.log
然后我可以(使用 sed 本身)将其按摩到类似于 file2 的
。然而,在开始之前,我想验证我的计划是否是最简单的可行解决方案。
这种方法有什么缺陷吗?有没有更简单的方法?
[Update2] As it often happens, the scope of the task expanded quite a bit as a understood it better. The obsolete parts are crossed out, and you find the updated explanation below. [/Update2]
I have a pair of rather large log files with very similar content, except that some strings are different between the two. A couple of examples:
UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
That is, wherever the first file contains UnifiedClassLoader3@19518cc
, the second contains UnifiedClassLoader3@d0357a
, and so on. [Update] There are about 40 distinct pairs of such identifiers.[/Update]
UnifiedClassLoader3@19518cc | UnifiedClassLoader3@d0357a
JBossRMIClassLoader@13c2d7f | JBossRMIClassLoader@191777e
Logi18n@177060f | Logi18n@12ef4c6
LogFactory$1@15e3dc4 | LogFactory$1@2942da
That is, wherever the first file contains UnifiedClassLoader3@19518cc
, the second contains UnifiedClassLoader3@d0357a
, and so on. Note that all these strings are inside long lines of text, and they appear in many rows, intermixed with each other. There are about 4000 distinct pairs of such identifiers, and the size of each file is about 34 MB. So performance became an issue as well.
I want to replace these with identical IDs so that I can spot the really important differences between the two files. I.e. I want to replace all occurrences of both UnifiedClassLoader3@19518cc
in file1 and UnifiedClassLoader3@d0357a
in file2 with UnifiedClassLoader3@1
; all occurrences of both Logi18n@177060f
in file1 and Logi18n@12ef4c6
in file2 with Logi18n@2
etc. The counters 1
and 2
are arbitrary choices - the only requirement is that there is a one to one mapping between the old and new strings (i.e. the same string is always replaced by the same value and no different strings are replaced by the same value).
Using the Cygwin shell, so far I managed to list all different identifiers occurring in one of the files with
grep -o -e 'ClassLoader[0-9]*@[0-9a-f][0-9a-f]*' file1.log | sort | uniq
grep -o -e '[A-Z][A-Za-z0-9]*\(\$[0-9][0-9]*\)*@[0-9a-f][0-9a-f]*' file1.log
| sort | uniq
However, now the original order is lost, so I don't know which is the pair of which ID in the other file. With grep -n
I can get the line number, so the sort would preserve the order of appearance, but then I can't weed out the duplicate occurrences. Unfortunately grep can not print only the first match of a pattern.
I figured I could save the list of identifiers produced by the above command into a file, then iterate over the patterns in the file with grep -n | head -n 1
, concatenate the results and sort them again. The result would be something like
2 ClassLoader3@19518cc
137 ClassLoader@13c2d7f
563 ClassLoader3@1267649
...
Then I could (using sed
itself) massage this into a sed
command like
sed -e 's/ClassLoader3@19518cc/ClassLoader3@2/g'
-e 's/ClassLoader@13c2d7f/ClassLoader@137/g'
-e 's/ClassLoader3@1267649/ClassLoader3@563/g'
file1.log > file1_processed.log
and similarly for file2.
However, before I start, I would like to verify that my plan is the simplest possible working solution to this.
Is there any flaw in this approach? Is there a simpler way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这确实有效,或者至少接近成功,
如果您对其工作原理和原因有疑问,请告诉我。
这是我的测试数据和输出
file1.log:
file2.log (类似的模式,除了“C”集重复“A”集)
处理后你会得到 file1_processed.log
和 file2_processed.log
I think this does the trick, or at least comes close
Let me know if you have questions on how it works and why.
Here's my test data and the output
file1.log:
file2.log (Similar patterns except the "C" set repeats the "A" set)
And after processing you get file1_processed.log
and file2_processed.log