如何使用bash脚本在时间序列数据文件中将非采样值替换为上一个实例值?
我有时间序列数据,其中在同一ASCII文件中异步捕获了来自不同传感器的测量值。值分开了。
原始文件如下所示。
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:46:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:47:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:48:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:49:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:50:30 10.19 NOTSAMPLED NOTSAMPLED
2022-04-03 21:51:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:52:30 NOTSAMPLED NOTSAMPLED OPEN
2022-04-03 21:53:30 10.19 NOTSAMPLED NOTSAMPLED
现在,除非在其他测量值可用时,除非在特定时间进行测量,否则我需要用以下其他传感器的上一个实例值替换字符串“ notsmpled”。
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
可以使用SED/AWK或任何其他Bash Shell脚本命令来实现它吗?
I have time series data wherein measurement values from different sensors have been captured asynchronously in same ascii file. The values are white space separated.
Original file looks like below.
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:46:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:47:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:48:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:49:30 NOTSAMPLED NOTSAMPLED CLOSE
2022-04-03 21:50:30 10.19 NOTSAMPLED NOTSAMPLED
2022-04-03 21:51:30 NOTSAMPLED 460 NOTSAMPLED
2022-04-03 21:52:30 NOTSAMPLED NOTSAMPLED OPEN
2022-04-03 21:53:30 10.19 NOTSAMPLED NOTSAMPLED
Now barring the non availability of the measurements at a particular time when the other measurement value is available,I need to replace the string "NOTSAMPLED" with the previous instance value of the other sensor like below.
2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN
Can it be achieved using sed/awk or any other bash shell scripting commands ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
更新 2:@Fravadona:你的速度更快,但有一个缺陷 - 你忘记打印出你将其作为 getline 一部分的行,所以它是一个短的。
更新1:
使用基准测试结果
2.66GB,6040 万行
示例数据的合成版本输入吞吐量
75.0 MB/s
~166 万行/秒
====== =============================================
已测试并确认可以在
gawk 5.1.1 ,
mawk 1.3.4
、mawk 1.996
和macOS nawk
--
4Chan Teller
UPDATE 2: @Fravadona : yours is faster, with a flaw - you forgot to print out the line you took it as part of getline, so it's one-short.
UPDATE 1 :
benchmarking results using a
2.66GB, 60.4mn row
synthetic version of sample dataInput throughput
75.0 MB/s
~1.66 mn rows/sec
=================================================
Tested and confirmed working on
gawk 5.1.1
,mawk 1.3.4
,mawk 1.996
, andmacOS nawk
--
The 4Chan Teller
首先,这不是一个有效的答案,但它可以完成工作,并显示正在发生的事情。
文件text.txt包含示例输入
输出:
First of.. this is not an efficient answer, but it gets the job done, and shows what is going on.
The file text.txt contains the example input
Output:
这是一个
awk
解决方案,它填充所有NONSAMPLED
字段(从字段 #3 开始):edit: moving
NR==1{split($0, filldown)}
到BEGIN
块,因为它使大文件的处理速度减慢了两倍这是
bash 版本实现相同的逻辑。使用 bash 进行文本处理速度很慢,并且不被认为是一种好的做法,但如果您不熟悉
awk
,您可能会更好地理解它:Here's an
awk
solution that fills down all theNONSAMPLED
fields (starting from the field #3):edit: moved
NR==1{split($0, filldown)}
to theBEGIN
block as it slowed down the processing of big files by twoHere's the
bash
version implementing the same logic. Text processing with bash is slow and isn't considered a good practice, but you might understand it better if you're not familiar withawk
: