不确定的分隔符，用 sed 解析凌乱的日志

发布于 2024-09-06 15:37:39 字数 290 浏览 5 评论 0原文

我正在处理 #huge# 文本文件（从 100mb 到 1gb），我必须解析它们以提取一些特定的数据。令人烦恼的是，这些文件没有明确定义的分隔符。

例如：

"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"

我必须删除由“（引号）限制的字符串中的空格，问题是我不能删除引号“外部”的空格（否则某些数字会合并）。我找不到合适的 sed 解决方案，有人可以帮助我吗？

原文

I'm working on #huge# text files (from 100mb to 1gb), I have to parse them to extract some particoular data. The annoying thing is that the files have not a clearly defined separator.

For example:

"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"

I have to delete the white spaces in strings limited by " (quote), the problem is that I must not erase the white spaces "outside" the quotes (otherwise some numbers would merge).
I can't find a decent sed solution, can someone help me with this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

青萝楚歌 2024-09-13 15:37:39

你使用awk，而不是sed。当然不需要创建自己的 C 程序，因为 awk 已经是一个出色的 C 程序来进行文件处理，甚至可以处理 GB 文件。所以这里有一个单衬来完成这项工作。

$ more file
"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"

$ awk -F'"' '{for(i=2;i<=NF;i+=2) {gsub(/ +/,"",$i)}}1' OFS='"' file
"element" 123124 16758 "12.4" "element" "elementwithwhitespacesinside" "element"

you use awk, not sed. And there's certainly no need to create your own C program, as awk is already an excellent C program to do file processing, even on GB files. So here's a one liner to do the job.

$ more file
"element" 123124 16758 "12.4" "element" "element with white spaces inside" "element"

$ awk -F'"' '{for(i=2;i<=NF;i+=2) {gsub(/ +/,"",$i)}}1' OFS='"' file
"element" 123124 16758 "12.4" "element" "elementwithwhitespacesinside" "element"

回复收藏 0 原文

尐籹人 2024-09-13 15:37:39

我无法提出 sed 解决方案，但是您最好编写一个小应用程序来执行此操作。

#include <iostream>
#include <string>
using namespace std;

int main() {
    string line;
    while(getline(cin,line)) {
        bool inquot = false;
        for(string::iterator i = line.begin(); i != line.end(); i++) {
            char c = *i;
            if (c == '"') inquot = !inquot;

            if (c != ' ' || !inquot) cout << c;
        }
        cout << endl;
    }
    return 0;
}

然后去

./a.out <测试日志> new.out

免责声明

如果您在行上转义引号或引号内的多行内容，这将完全令人窒息。

例如
“这个词\”word\”很奇怪”
与此相关的事情会引起问题

I can't come up with a sed solution, however you might be better off just writing a small application to do this.

#include <iostream>
#include <string>
using namespace std;

int main() {
    string line;
    while(getline(cin,line)) {
        bool inquot = false;
        for(string::iterator i = line.begin(); i != line.end(); i++) {
            char c = *i;
            if (c == '"') inquot = !inquot;

            if (c != ' ' || !inquot) cout << c;
        }
        cout << endl;
    }
    return 0;
}

Then go

./a.out < test.log > new.out

DISCLAIMER

This will completely choke if you have escaped quotes on lines or multiline things within quotes.

For instance
"The word \"word\" is weird"
and things to that effect will cause problems

回复收藏 0 原文

梦罢 2024-09-13 15:37:39

和 Jamie 一样，我认为 sed 不适合这项工作。可能是我的 sed 技能不足以胜任这项工作。下面是一个与 Jamie 的解决方案基本相同的解决方案，但使用的是 Python：

#!/usr/bin/env python

# Script to delete spaces within the double quotes, but not outside.

QUOTE = '"'
SPACE = ' '

file = open('data', 'r')
for line in file:
    line = line.rstrip('\r\n')
    newline = ''
    inside_quote = False
    for char in list(line):
        if char == QUOTE:
            inside_quote = not inside_quote
        if not (char == SPACE and inside_quote):
            newline += char
    print(newline)
file.close()

将此脚本保存到文件中，例如 rmspaces.py。然后，您可以从命令行调用该脚本：

python rmspaces.py

请注意，该脚本假定数据位于名为 data 的文件中。您可以修改脚本以适应口味。

Like Jamie, I don't think sed is good for the job. It could be that my sed skill is not good enough for the job. Here is a solution which essentially the same as Jamie's, but in Python:

#!/usr/bin/env python

# Script to delete spaces within the double quotes, but not outside.

QUOTE = '"'
SPACE = ' '

file = open('data', 'r')
for line in file:
    line = line.rstrip('\r\n')
    newline = ''
    inside_quote = False
    for char in list(line):
        if char == QUOTE:
            inside_quote = not inside_quote
        if not (char == SPACE and inside_quote):
            newline += char
    print(newline)
file.close()

Save this script to a file, say rmspaces.py. You can then invoke the script from the command line: