在尴尬中实现TR和SED功能

发布于 2025-02-12 20:48:43 字数 963 浏览 1 评论 0原文

我需要处理一个文本文件 - 一个大型CSV-以纠正其中的格式。该CSV的字段包含XML数据，格式为人类可读性：分解为多行和空间的压痕。我需要在一行中拥有所有记录，所以我使用尴尬加入行，然后使用SED来摆脱XML标签之间的额外空间，然后在此之后消除了不需要的“ \ r”字符。（第一个记录始终是8个数字，而FIELS分隔符是管道字符：“ |”

尴尬scrips（join4.awk）

BEGIN {
  # initialise "line" variable. Maybe unnecessary
  line=""
}

{
  # check if this line is a beginning of a new record
  if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
    # if it is a new record, then print stuff already collected
    # then update line variable with $0
    print line
    line = $0
  } else {
    # if it is not, then just attach $0 to the line
    line = line $0
  }
}

END {
  # print out the last record kept in line variable
  if (line) print line
}

，命令行是

cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/>  *</></g'   > corrected_data.csv

我的问题是，是否有一种有效的方法来实现TR和SED功能。脚本 - 这不是Linux，所以我不给gawk，只有简单的旧尴尬和

nawk

尴尬

原文

I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters.
(the first record is always 8 numbers and the fiels separator is the pipe character: "|"

The awk scrips is (join4.awk)

BEGIN {
  # initialise "line" variable. Maybe unnecessary
  line=""
}

{
  # check if this line is a beginning of a new record
  if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
    # if it is a new record, then print stuff already collected
    # then update line variable with $0
    print line
    line = $0
  } else {
    # if it is not, then just attach $0 to the line
    line = line $0
  }
}

END {
  # print out the last record kept in line variable
  if (line) print line
}

and the commandline is

cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/>  *</></g'   > corrected_data.csv

My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.

thanks,

--Trifo

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

东走西顾 2025-02-19 20:48:43

  tr -d“ \ r”

只是gsub（/\ r/，“”）。

  sed's/＆gt; *＆lt;/＆gt;＆lt;/g'

那只是gsub（/＆gt; *＆lt;/，“”＆gt;＆lt;“）

回复收藏 0 原文

月依秋水 2025-02-19 20:48:43

mawk NF=NF RS='\r?\n' FS='> *<' OFS='><'

mawk NF=NF RS='\r?\n' FS='> *<' OFS='><'

回复收藏 0 原文

孤星 2025-02-19 20:48:43

谢谢大家！

您给了我灵感来解决解决方案。就是这样：

BEGIN {
  # initialize "line" variable. Maybe unnecessary.
  line=""
}

{
  # if the line begins with 8 numbers and a pipe char (the format of the first record)...
  if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {

    # ... then the previous record is ready. We can post process it, the print out

    # workarounds for the missing gsub function
    # removing extra spaces between xml tags
    # removing extra \r characters the same way
    while ( line ~ "\r") { sub( /\r/,"",line) }
    # "<text text>      <tag tag>" should look like "<text text><tag tag>"
    while ( line ~ ">  *<") { sub( />  *</,"><",line) }

    # then print the record and update line var with the beginning of the new record
    print line
    line = $0
  } else {

    # just keep extending the record with the actual line
    line = line $0
  }
}

END {
  # print the last record kept in line var
  if (line) {
    while ( line ~ "\r") { sub( /\r/,"",line) }
    while ( line ~ ">  *<") { sub( />  *</,"><",line) }
    print line
  }
}

是的，它是有效的：嵌入式版本运行的速度快33％。

是的，创建一个在“行”变量中创建记录后处理的函数会更好。现在，我必须两次编写相同的代码才能处理末尾部分中的最后一个章节。但是它有效，它可以创建与链式命令相同的输出，并且速度更快。

因此，再次感谢您的灵感！

- trifo

Thank you all folks!

You gave me the inspiration to get to a solution. It is like this:

BEGIN {
  # initialize "line" variable. Maybe unnecessary.
  line=""
}

{
  # if the line begins with 8 numbers and a pipe char (the format of the first record)...
  if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|" ) {

    # ... then the previous record is ready. We can post process it, the print out

    # workarounds for the missing gsub function
    # removing extra spaces between xml tags
    # removing extra \r characters the same way
    while ( line ~ "\r") { sub( /\r/,"",line) }
    # "<text text>      <tag tag>" should look like "<text text><tag tag>"
    while ( line ~ ">  *<") { sub( />  *</,"><",line) }

    # then print the record and update line var with the beginning of the new record
    print line
    line = $0
  } else {

    # just keep extending the record with the actual line
    line = line $0
  }
}

END {
  # print the last record kept in line var
  if (line) {
    while ( line ~ "\r") { sub( /\r/,"",line) }
    while ( line ~ ">  *<") { sub( />  *</,"><",line) }
    print line
  }
}

And yes, it is efficient: the embedded version runs abou 33% faster.

And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.

So, thanks for the inspiration again!

--Trifo

回复收藏 0 原文

~没有更多了~