不确定的分隔符,用 sed 解析凌乱的日志
我正在处理 #huge# 文本文件(从 100mb 到 1gb),我必须解析它们以提取一些特定的数据。令人烦恼的是,这些文件没有明确定义的分隔符。
例如:
"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"
我必须删除由“(引号)限制的字符串中的空格,问题是我不能删除引号“外部”的空格(否则某些数字会合并)。 我找不到合适的 sed 解决方案,有人可以帮助我吗?
I'm working on #huge# text files (from 100mb to 1gb), I have to parse them to extract some particoular data. The annoying thing is that the files have not a clearly defined separator.
For example:
"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"
I have to delete the white spaces in strings limited by " (quote), the problem is that I must not erase the white spaces "outside" the quotes (otherwise some numbers would merge).
I can't find a decent sed solution, can someone help me with this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你使用awk,而不是sed。当然不需要创建自己的 C 程序,因为
awk
已经是一个出色的 C 程序来进行文件处理,甚至可以处理 GB 文件。所以这里有一个单衬来完成这项工作。you use awk, not sed. And there's certainly no need to create your own C program, as
awk
is already an excellent C program to do file processing, even on GB files. So here's a one liner to do the job.我无法提出 sed 解决方案,但是您最好编写一个小应用程序来执行此操作。
然后去
./a.out <测试日志> new.out
免责声明
如果您在行上转义引号或引号内的多行内容,这将完全令人窒息。
例如
“这个词\”word\”很奇怪”
与此相关的事情会引起问题
I can't come up with a sed solution, however you might be better off just writing a small application to do this.
Then go
./a.out < test.log > new.out
DISCLAIMER
This will completely choke if you have escaped quotes on lines or multiline things within quotes.
For instance
"The word \"word\" is weird"
and things to that effect will cause problems
和 Jamie 一样,我认为 sed 不适合这项工作。可能是我的 sed 技能不足以胜任这项工作。下面是一个与 Jamie 的解决方案基本相同的解决方案,但使用的是 Python:
将此脚本保存到文件中,例如 rmspaces.py。然后,您可以从命令行调用该脚本:
请注意,该脚本假定数据位于名为 data 的文件中。您可以修改脚本以适应口味。
Like Jamie, I don't think sed is good for the job. It could be that my sed skill is not good enough for the job. Here is a solution which essentially the same as Jamie's, but in Python:
Save this script to a file, say rmspaces.py. You can then invoke the script from the command line:
Note that the script assumes the data is in a file called data. You can modify the script to taste.