在尴尬中实现TR和SED功能
我需要处理一个文本文件 - 一个大型CSV-以纠正其中的格式。该CSV的字段包含XML数据,格式为人类可读性:分解为多行和空间的压痕。我需要在一行中拥有所有记录,所以我使用尴尬加入行,然后使用SED来摆脱XML标签之间的额外空间,然后在此之后消除了不需要的“ \ r”字符。 (第一个记录始终是8个数字,而FIELS分隔符是管道字符:“ |”
尴尬scrips(join4.awk)
BEGIN {
# initialise "line" variable. Maybe unnecessary
line=""
}
{
# check if this line is a beginning of a new record
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
# if it is a new record, then print stuff already collected
# then update line variable with $0
print line
line = $0
} else {
# if it is not, then just attach $0 to the line
line = line $0
}
}
END {
# print out the last record kept in line variable
if (line) print line
}
,命令行是
cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/> *</></g' > corrected_data.csv
我的问题是,是否有一种有效的方法来实现TR和SED功能。 脚本 - 这不是Linux,所以我不给gawk,只有简单的旧尴尬和
nawk
尴尬
I need to process a text file - a big CSV - to correct format in it. This CSV has a field which contains XML data, formatted to be human readable: break up into multiple lines and indentation with spaces. I need to have every record in one line, so I am using awk to join lines, and after that I am using sed, to get rid of extra spaces between XML tags, and after that tr to eliminate unwanted "\r" characters.
(the first record is always 8 numbers and the fiels separator is the pipe character: "|"
The awk scrips is (join4.awk)
BEGIN {
# initialise "line" variable. Maybe unnecessary
line=""
}
{
# check if this line is a beginning of a new record
if ( $0 ~ "^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]|" ) {
# if it is a new record, then print stuff already collected
# then update line variable with $0
print line
line = $0
} else {
# if it is not, then just attach $0 to the line
line = line $0
}
}
END {
# print out the last record kept in line variable
if (line) print line
}
and the commandline is
cat inputdata.csv | awk -f join4.awk | tr -d "\r" | sed 's/> *</></g' > corrected_data.csv
My question is if there is an efficient way to implement tr and sed functionality inside the awk script? - this is not Linux, so I gave no gawk, just simple old awk and nawk.
thanks,
--Trifo
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
只是
gsub(/\ r/,“”)
。那只是
gsub(/&gt; *&lt;/,“”&gt;&lt;“)
Is just
gsub(/\r/, "")
.That's just
gsub(/> *</, "><")
谢谢大家!
您给了我灵感来解决解决方案。就是这样:
是的,它是有效的:嵌入式版本运行的速度快33%。
是的,创建一个在“行”变量中创建记录后处理的函数会更好。现在,我必须两次编写相同的代码才能处理末尾部分中的最后一个章节。但是它有效,它可以创建与链式命令相同的输出,并且速度更快。
因此,再次感谢您的灵感!
- trifo
Thank you all folks!
You gave me the inspiration to get to a solution. It is like this:
And yes, it is efficient: the embedded version runs abou 33% faster.
And yes, it would be nicer to create a function for the postprocessing of the records in "line" variable. Now I have to write the same code twice to process the last recond in the END section. But it works, it creates the same output as the chained commands and it is way faster.
So, thanks for the inspiration again!
--Trifo