如何使用bash脚本在时间序列数据文件中将非采样值替换为上一个实例值？

发布于 2025-01-18 17:38:53 字数 1258 浏览 4 评论 0原文

我有时间序列数据，其中在同一ASCII文件中异步捕获了来自不同传感器的测量值。值分开了。

原始文件如下所示。

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:46:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:47:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:48:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:49:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:50:30  10.19      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:51:30  NOTSAMPLED 460  NOTSAMPLED
2022-04-03 21:52:30  NOTSAMPLED NOTSAMPLED      OPEN
2022-04-03 21:53:30  10.19      NOTSAMPLED      NOTSAMPLED

现在，除非在其他测量值可用时，除非在特定时间进行测量，否则我需要用以下其他传感器的上一个实例值替换字符串“ notsmpled”。

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  10.20      460     NOTSAMPLED
2022-04-03 21:46:30  10.20      460     CLOSE
2022-04-03 21:47:30  10.20      460     CLOSE
2022-04-03 21:48:30  10.20      460     CLOSE
2022-04-03 21:49:30  10.20      460     CLOSE
2022-04-03 21:50:30  10.19      460     CLOSE
2022-04-03 21:51:30  10.19      460     CLOSE
2022-04-03 21:52:30  10.19      460     OPEN
2022-04-03 21:53:30  10.19      460     OPEN

可以使用SED/AWK或任何其他Bash Shell脚本命令来实现它吗？

原文

I have time series data wherein measurement values from different sensors have been captured asynchronously in same ascii file. The values are white space separated.

Original file looks like below.

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:46:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:47:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:48:30  NOTSAMPLED 460     NOTSAMPLED
2022-04-03 21:49:30  NOTSAMPLED NOTSAMPLED      CLOSE
2022-04-03 21:50:30  10.19      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:51:30  NOTSAMPLED 460  NOTSAMPLED
2022-04-03 21:52:30  NOTSAMPLED NOTSAMPLED      OPEN
2022-04-03 21:53:30  10.19      NOTSAMPLED      NOTSAMPLED

Now barring the non availability of the measurements at a particular time when the other measurement value is available,I need to replace the string "NOTSAMPLED" with the previous instance value of the other sensor like below.

2022-04-03 21:42:30  10.20      NOTSAMPLED      NOTSAMPLED
2022-04-03 21:45:30  10.20      460     NOTSAMPLED
2022-04-03 21:46:30  10.20      460     CLOSE
2022-04-03 21:47:30  10.20      460     CLOSE
2022-04-03 21:48:30  10.20      460     CLOSE
2022-04-03 21:49:30  10.20      460     CLOSE
2022-04-03 21:50:30  10.19      460     CLOSE
2022-04-03 21:51:30  10.19      460     CLOSE
2022-04-03 21:52:30  10.19      460     OPEN
2022-04-03 21:53:30  10.19      460     OPEN

Can it be achieved using sed/awk or any other bash shell scripting commands ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我也只是我 2025-01-25 17:38:53

更新 2：@Fravadona：你的速度更快，但有一个缺陷 - 你忘记打印出你将其作为 getline 一部分的行，所以它是一个短的。

f='sample3.txt'; echo; 

 ( time (pvE0 < "${f}" | mawk2 '

       BEGIN { ____["NOTSAMPLED"]
             OFS=sprintf("%c",(___=++_+(++_))^--___)
       } {
           if (NR<___) {
                     NF = split($!_,__)
       } else {    _=NF
               do { __[_]=$_=($_ in ____)? \
                    __[_]:$_ } while(___<--_) } }_' ) | pvE9)  | LC_ALL=C  gwc -lc | ggXy3 | lgp3


     out9:  339MiB 0:00:06 [55.6MiB/s] [55.6MiB/s] [                                 <=>           ]
      in0:  521MiB 0:00:06 [85.4MiB/s] [85.4MiB/s] [=============================>] 100%            
( pvE 0.1 in0 < "${f}" | mawk2 ; )  6.03s user 0.27s system 102% cpu 6.131 total
10000000 356000019

j % f='sample3.txt'; echo; 

 ( time (pvE0 < "${f}" | mawk2 '                                             
    
    BEGIN { getline; split($0, filldown) }
    {                            
        for (i = 3; i <= NF; i++)  
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]                             
    } 1
          '     ) | pvE9)  | LC_ALL=C  gwc -lc | ggXy3 | lgp3

     out9: 5.50MiB 0:00:00 [55.0MiB/s] [55.0MiB/s] [<=>                                            ]
      in0:  521MiB 0:00:05 [96.8MiB/s] [96.8MiB/s] [=============================>] 100%            
     out9:  339MiB 0:00:05 [63.0MiB/s] [63.0MiB/s] [                                        <=>    ]
( pvE 0.1 in0 < "${f}" | mawk2 ; )  5.31s user 0.26s system 103% cpu 5.409 total
9999999 355999971

更新1：

使用基准测试结果
2.66GB，6040 万行 示例数据的合成版本

 in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100%            
out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ]

   60,406,830 行 2054.705 MB (2154514147) /dev/stdin

％pvE0＜样本3.txt |莫克2'

       开始 { ____[“未采样”]
             OFS=sprintf("%c",(___=++_+(++_))^--___)
       } {
           如果（NR<___）{
                     NF = 分割($!_,__)
       } 否则 { _=NF
               做 { __[_]=$_=($_ in ____)？ \
                    __[_]:$_ } while(___<--_) } }_' | pvE9 |厕所4

输入吞吐量

75.0 MB/s
~166 万行/秒

====== =============================================

    < sample2.txt gawk -e '

      BEGIN { ____["NOTSAMPLED"]
              OFS=sprintf("%c",(___=++_+(++_))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____) ? \
                       __[_]:$_ } while(___<--_) } }_'

2022-04-03  21:42:30    10.20   NOTSAMPLED  NOTSAMPLED
2022-04-03  21:45:30    10.20   460     NOTSAMPLED
2022-04-03  21:46:30    10.20   460     CLOSE
2022-04-03  21:47:30    10.20   460     CLOSE
2022-04-03  21:48:30    10.20   460     CLOSE
2022-04-03  21:49:30    10.20   460     CLOSE
2022-04-03  21:50:30    10.19   460     CLOSE
2022-04-03  21:51:30    10.19   460     CLOSE
2022-04-03  21:52:30    10.19   460     OPEN
2022-04-03  21:53:30    10.19   460     OPEN

已测试并确认可以在gawk 5.1.1 ，mawk 1.3.4、mawk 1.996 和 macOS nawk

-- 4Chan Teller

UPDATE 2: @Fravadona : yours is faster, with a flaw - you forgot to print out the line you took it as part of getline, so it's one-short.

f='sample3.txt'; echo; 

 ( time (pvE0 < "${f}" | mawk2 '

       BEGIN { ____["NOTSAMPLED"]
             OFS=sprintf("%c",(___=++_+(++_))^--___)
       } {
           if (NR<___) {
                     NF = split($!_,__)
       } else {    _=NF
               do { __[_]=$_=($_ in ____)? \
                    __[_]:$_ } while(___<--_) } }_' ) | pvE9)  | LC_ALL=C  gwc -lc | ggXy3 | lgp3


     out9:  339MiB 0:00:06 [55.6MiB/s] [55.6MiB/s] [                                 <=>           ]
      in0:  521MiB 0:00:06 [85.4MiB/s] [85.4MiB/s] [=============================>] 100%            
( pvE 0.1 in0 < "${f}" | mawk2 ; )  6.03s user 0.27s system 102% cpu 6.131 total
10000000 356000019

j % f='sample3.txt'; echo; 

 ( time (pvE0 < "${f}" | mawk2 '                                             
    
    BEGIN { getline; split($0, filldown) }
    {                            
        for (i = 3; i <= NF; i++)  
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]                             
    } 1
          '     ) | pvE9)  | LC_ALL=C  gwc -lc | ggXy3 | lgp3

     out9: 5.50MiB 0:00:00 [55.0MiB/s] [55.0MiB/s] [<=>                                            ]
      in0:  521MiB 0:00:05 [96.8MiB/s] [96.8MiB/s] [=============================>] 100%            
     out9:  339MiB 0:00:05 [63.0MiB/s] [63.0MiB/s] [                                        <=>    ]
( pvE 0.1 in0 < "${f}" | mawk2 ; )  5.31s user 0.26s system 103% cpu 5.409 total
9999999 355999971

UPDATE 1 :

benchmarking results using a
2.66GB, 60.4mn row synthetic version of sample data

 in0: 2.66GiB 0:00:36 [75.0MiB/s] [75.0MiB/s] [===..====>] 100%            
out9: 2.01GiB 0:00:36 [56.5MiB/s] [56.5MiB/s] [ <=> ]

   60,406,830 lines 2054.705 MB (2154514147) /dev/stdin

% pvE0 < sample3.txt | mawk2 '

       BEGIN { ____["NOTSAMPLED"]
             OFS=sprintf("%c",(___=++_+(++_))^--___)
       } {
           if (NR<___) {
                     NF = split($!_,__)
       } else {    _=NF
               do { __[_]=$_=($_ in ____)? \
                    __[_]:$_ } while(___<--_) } }_' | pvE9 | wc4

Input throughput

75.0 MB/s
~1.66 mn rows/sec

=================================================

    < sample2.txt gawk -e '

      BEGIN { ____["NOTSAMPLED"]
              OFS=sprintf("%c",(___=++_+(++_))^--___)
          } {
              if (NR<___) {
                        NF = split($!_,__)
          } else {    _=NF
                  do { __[_]=$_=($_ in ____) ? \
                       __[_]:$_ } while(___<--_) } }_'

2022-04-03  21:42:30    10.20   NOTSAMPLED  NOTSAMPLED
2022-04-03  21:45:30    10.20   460     NOTSAMPLED
2022-04-03  21:46:30    10.20   460     CLOSE
2022-04-03  21:47:30    10.20   460     CLOSE
2022-04-03  21:48:30    10.20   460     CLOSE
2022-04-03  21:49:30    10.20   460     CLOSE
2022-04-03  21:50:30    10.19   460     CLOSE
2022-04-03  21:51:30    10.19   460     CLOSE
2022-04-03  21:52:30    10.19   460     OPEN
2022-04-03  21:53:30    10.19   460     OPEN

Tested and confirmed working on gawk 5.1.1, mawk 1.3.4, mawk 1.996, and macOS nawk

-- The 4Chan Teller

回复收藏 0 原文

音栖息无 2025-01-25 17:38:53

首先，这不是一个有效的答案，但它可以完成工作，并显示正在发生的事情。

文件text.txt包含示例输入

#!/bin/bash
#set -x

# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"

while read line; do

        field1=$(echo ${line}| cut -d ' ' -f 1)
        field2=$(echo ${line}| cut -d ' ' -f 2)
        field3=$(echo ${line}| cut -d ' ' -f 3)
        field4=$(echo ${line}| cut -d ' ' -f 4)
        field5=$(echo ${line}| cut -d ' ' -f 5)

        [[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
        [[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
        [[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
        [[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
        [[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}

        echo "${field1} ${field2} ${field3} ${field4} ${field5}"

        oldfield1="${field1}"
        oldfield2="${field2}"
        oldfield3="${field3}"
        oldfield4="${field4}"
        oldfield5="${field5}"
done <test.txt

输出：

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

First of.. this is not an efficient answer, but it gets the job done, and shows what is going on.

The file text.txt contains the example input

#!/bin/bash
#set -x

# first set the variables for the first run
oldfield1="NOTSAMPLED"
oldfield2="NOTSAMPLED"
oldfield3="NOTSAMPLED"
oldfield4="NOTSAMPLED"
oldfield5="NOTSAMPLED"
NOTSAMPLED="NOTSAMPLED"

while read line; do

        field1=$(echo ${line}| cut -d ' ' -f 1)
        field2=$(echo ${line}| cut -d ' ' -f 2)
        field3=$(echo ${line}| cut -d ' ' -f 3)
        field4=$(echo ${line}| cut -d ' ' -f 4)
        field5=$(echo ${line}| cut -d ' ' -f 5)

        [[ ${field1} == ${NOTSAMPLED} ]] && field1=${oldfield1}
        [[ ${field2} == ${NOTSAMPLED} ]] && field2=${oldfield2}
        [[ ${field3} == ${NOTSAMPLED} ]] && field3=${oldfield3}
        [[ ${field4} == ${NOTSAMPLED} ]] && field4=${oldfield4}
        [[ ${field5} == ${NOTSAMPLED} ]] && field5=${oldfield5}

        echo "${field1} ${field2} ${field3} ${field4} ${field5}"

        oldfield1="${field1}"
        oldfield2="${field2}"
        oldfield3="${field3}"
        oldfield4="${field4}"
        oldfield5="${field5}"
done <test.txt

Output:

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

回复收藏 0 原文

杀手六號 2025-01-25 17:38:53

这是一个 awk 解决方案，它填充所有 NONSAMPLED 字段（从字段 #3 开始）：

_{edit: moving NR==1{split($0, filldown)} 到 BEGIN 块，因为它使大文件的处理速度减慢了两倍}

awk '
    BEGIN { getline; split($0, filldown); print }
    {
        for (i = 3; i <= NF; i++)
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]
    } 1
' file.txt

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

这是 bash 版本实现相同的逻辑。使用 bash 进行文本处理速度很慢，并且不被认为是一种好的做法，但如果您不熟悉 awk，您可能会更好地理解它：

#!/bin/bash
{
    read -ra filldown

    while read -ra fields
    do
        for ((i = 2; i < ${#fields[@]}; i++))
        do
            if [[ ${fields[i]} != NOTSAMPLED ]]
            then
                filldown[i]=${fields[i]}
            else
                fields[i]=${filldown[i]}
            fi
        done
        printf '%s\n' "${fields[*]}"
    done
} < file.txt

Here's an awk solution that fills down all the NONSAMPLED fields (starting from the field #3):

_{edit: moved NR==1{split($0, filldown)} to the BEGIN block as it slowed down the processing of big files by two}

awk '
    BEGIN { getline; split($0, filldown); print }
    {
        for (i = 3; i <= NF; i++)
            if ($i != "NOTSAMPLED")
                filldown[i] = $i
            else
                $i = filldown[i]
    } 1
' file.txt

2022-04-03 21:42:30 10.20 NOTSAMPLED NOTSAMPLED
2022-04-03 21:45:30 10.20 460 NOTSAMPLED
2022-04-03 21:46:30 10.20 460 CLOSE
2022-04-03 21:47:30 10.20 460 CLOSE
2022-04-03 21:48:30 10.20 460 CLOSE
2022-04-03 21:49:30 10.20 460 CLOSE
2022-04-03 21:50:30 10.19 460 CLOSE
2022-04-03 21:51:30 10.19 460 CLOSE
2022-04-03 21:52:30 10.19 460 OPEN
2022-04-03 21:53:30 10.19 460 OPEN

Here's the bash version implementing the same logic. Text processing with bash is slow and isn't considered a good practice, but you might understand it better if you're not familiar with awk:

#!/bin/bash
{
    read -ra filldown

    while read -ra fields
    do
        for ((i = 2; i < ${#fields[@]}; i++))
        do
            if [[ ${fields[i]} != NOTSAMPLED ]]
            then
                filldown[i]=${fields[i]}
            else
                fields[i]=${filldown[i]}
            fi
        done
        printf '%s\n' "${fields[*]}"
    done
} < file.txt

回复收藏 0 原文

~没有更多了~