字段中元素的高效拆分

发布于 2024-08-28 12:51:29 字数 2130 浏览 17 评论 0原文

我在从数据库导出的文本文件中有一个字段。该字段包含地址,但有时它们很长,数据库允许它们包含多行。导出时,换行符将替换为美元符号,如下所示:

first part of very long address$second part of very long address$third part of very long address

并非每个地址都有多行,并且没有地址包含超过三行。每条线的长度是可变的。

我正在整理数据以导入到用于邮件合并的 MS Access 中。我想拆分 $ 符号上的字段(如果存在),但如果该字段仅包含 1 行,我想将两个额外的输出字段设置为零长度字符串,这样我就不会在地址中出现空行当它被打印出来时。

我有一个 awk 文件,它可以正确处理文本文件中的所有其他数据,但我需要让最后一点工作。我尝试了下面的代码。除了我在 else 处遇到语法错误这一事实之外,我不确定这是执行我想要的操作的好方法。这是通过 Windows 上的 gawk 完成的。

BEGIN { FS = "|" }
$1 != "HEADER" {
    if ($6 ~ /\$/)
        split($6, arr, "$")
        address = arr[1]
        addresstwo = arr[2]
        addressthree = arr[3]
        addressLength = length(address)
        addressTwoLength = length(addresstwo)
        addressThreeLength = length(addressthree)

    else {
        address = $6
        addressLength = length($6)
        addresstwo = ""
        addressTwoLength = length(addresstwo)
    addressthree = ""
        addressThreeLength = length(addressthree)
        }

    printf("%*s\t%*s\t\%*s\n",
          addressLength, address, addressTwoLength, addresstwo, addressThreeLength, addressthree)
}

编辑: 对此感到抱歉。这是一个示例,

HEADER|0000000130|0000527350|0000171250|0000058000|0000756600|0000814753|0000819455|100106
rec1|ILL/COLORADO COLLEGE$TUTT LIBRARY|1021 N CASCADE$COLORADO SPRINGS, CO 80903|
rec2|ILL /PIKES PEAK LIBRARY DISTRICT|20 N. CASCADE AVE. / PO BOX 1579$COLORADO SPRINGS, CO 80903|
rec3|DOE,JOHN|PO Box 8034|
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY|INFORMATION DELIVERY DEPT$704 CHERRY ST$ATLANTA, GA 30332-0900

我仅匹配其中没有标题的行。我需要拆分 $ 符号上的文本字符串。管道之间的字符串不应被填充(这就是为什么我试图在原始代码中获取长度)。对于此示例,有 6 个输出字段,任何没有数据的字段都只是一个空字符串(也是我在代码中尝试执行的操作)。

rec1|ILL/COLORADO COLLEGE|TUTT LIBRARY|1021 N CASCADE|COLORADO SPRINGS, CO 80903||
rec2|ILL /PIKES PEAK LIBRARY DISTRICT||20 N. CASCADE AVE. / PO BOX 1579|COLORADO SPRINGS, CO 80903||
rec3|DOE,JOHN||PO Box 8034|||
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY||INFORMATION DELIVERY DEPT|704 CHERRY ST|ATLANTA, GA 30332-0900|

希望有帮助!如果这仍然不清楚,请告诉我。

I have a field in a text file exported from a database. The field contains addresses but sometimes they are quite long and the database allows them to contain multiple lines. When exported, the newline character gets replaced with a dollar sign like this:

first part of very long address$second part of very long address$third part of very long address

Not every address has multiple lines and no address contains more than three lines. The length of each line is variable.

I'm massaging the data for import into MS Access which is used for a mailmerge. I want to split the field on the $ sign if it's there but if the field only contains 1 line, I want to set my two extra output fields to a zero length string so that I don't wind up with blank lines in the address when it gets printed.

I have an awk file that's working correctly on all the other data in the textfile but I need to get this last bit working. I tried the below code. Aside from the fact that I get a syntax error at the else, I'm not sure this is a good way to do what I want. This is being done with gawk on Windows.

BEGIN { FS = "|" }
$1 != "HEADER" {
    if ($6 ~ /\$/)
        split($6, arr, "$")
        address = arr[1]
        addresstwo = arr[2]
        addressthree = arr[3]
        addressLength = length(address)
        addressTwoLength = length(addresstwo)
        addressThreeLength = length(addressthree)

    else {
        address = $6
        addressLength = length($6)
        addresstwo = ""
        addressTwoLength = length(addresstwo)
    addressthree = ""
        addressThreeLength = length(addressthree)
        }

    printf("%*s\t%*s\t\%*s\n",
          addressLength, address, addressTwoLength, addresstwo, addressThreeLength, addressthree)
}

EDIT:
Sorry about that. Here's a sample

HEADER|0000000130|0000527350|0000171250|0000058000|0000756600|0000814753|0000819455|100106
rec1|ILL/COLORADO COLLEGE$TUTT LIBRARY|1021 N CASCADE$COLORADO SPRINGS, CO 80903|
rec2|ILL /PIKES PEAK LIBRARY DISTRICT|20 N. CASCADE AVE. / PO BOX 1579$COLORADO SPRINGS, CO 80903|
rec3|DOE,JOHN|PO Box 8034|
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY|INFORMATION DELIVERY DEPT$704 CHERRY ST$ATLANTA, GA 30332-0900

I match only lines without HEADER in them. I need to split the textstrings on the $ signs. The string between the pipes should not be padded (which is why I was trying to get the length in my original code). For this example, there are 6 output fields and any field for which there is no data is simply an empty string (also what I was trying to do in the code).

rec1|ILL/COLORADO COLLEGE|TUTT LIBRARY|1021 N CASCADE|COLORADO SPRINGS, CO 80903||
rec2|ILL /PIKES PEAK LIBRARY DISTRICT||20 N. CASCADE AVE. / PO BOX 1579|COLORADO SPRINGS, CO 80903||
rec3|DOE,JOHN||PO Box 8034|||
rec4|ILL/GEORGIA INSTITUTE OF TECHNOLOGY||INFORMATION DELIVERY DEPT|704 CHERRY ST|ATLANTA, GA 30332-0900|

Hope that helps! Let me know if this still isn't clear.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

桃气十足 2024-09-04 12:51:29
BEGIN { FS = "|" }
$1 != "HEADER" {
    for(i = gsub(/\$/, "\t", $6); i < 3; i++)
        $6 = $6 "\t"
    print $6
}

我不太确定我是否满足您的要求。

BEGIN { FS = "|" }
$1 != "HEADER" {
    for(i = gsub(/\$/, "\t", $6); i < 3; i++)
        $6 = $6 "\t"
    print $6
}

I'm not really sure if I got your requirements right though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文