使用awk将文件与两个单独的查找文件进行比较

发布于 2025-01-20 20:28:28 字数 2482 浏览 1 评论 0原文

基本上，我想检查lookup_1＆amp;的字符串是否存在。 Lookup_2存在于我的xyz.txt文件中，然后执行操作＆amp;将输出重定向到输出文件。另外，我的代码目前正在替换Lookup_1中字符串的所有出现，即使是子字符串，我也需要它仅在整个单词匹配的情况下才能替换。您可以帮助调整代码以实现同样的方法吗？

代码

awk '
FNR==NR { if ($0 in lookups)    
             next                            
          lookups[$0]=$0
          for (i=1;i<=NF;i++) {         
              oldstr=$i
              newstr=""
              while (oldstr) {               
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)   
              }
              ndx=index(lookups[$0],$i)   
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
          }
          next
        }

        { for (i in lookups) { 
              ndx=index($0,i)                
              while (ndx > 0) {                       t
                    $0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
                    ndx=index($0,i)                    
              }
          }
          print
        }
' lookup_1 xyz.txt > output.txt

lookup_1

ha
achine
skhatw
at
ree
ter
man
dun

lookup_2

United States
CDEXX123X
Institution

xyz.txt

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file 
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States

当前输出

[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States

所需的输出强>

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

原文

Basically, I want to check if strings present in lookup_1 & lookup_2 exists in my xyz.txt file then perform action & redirect output to an output file. Also, my code is currently substituting all occurrences of the strings in lookup_1 even as substring, but I need it to only substitute if there's a whole word match.
Can you please help in tweaking the code to achieve the same?

code

awk '
FNR==NR { if ($0 in lookups)    
             next                            
          lookups[$0]=$0
          for (i=1;i<=NF;i++) {         
              oldstr=$i
              newstr=""
              while (oldstr) {               
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)   
              }
              ndx=index(lookups[$0],$i)   
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
          }
          next
        }

        { for (i in lookups) { 
              ndx=index($0,i)                
              while (ndx > 0) {                       t
                    $0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
                    ndx=index($0,i)                    
              }
          }
          print
        }
' lookup_1 xyz.txt > output.txt

lookup_1

ha
achine
skhatw
at
ree
ter
man
dun

lookup_2

United States
CDEXX123X
Institution

xyz.txt

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file 
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States

current output

[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States

desired output

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file 
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

邮友 2025-01-27 20:28:28

我们可以对当前代码进行一些更改：

将 cat Lookup_1 Lookup_2 的结果输入到 awk 中，使其看起来像 awk 中的单个文件> （参见新代码的最后一行）
使用字边界标志（\< 和 \>）来构建用于执行替换的正则表达式（参见第二部分）新代码）

新代码：

awk '
        # the FNR==NR block of code remains the same

FNR==NR { if ($0 in lookups)
             next
          lookups[$0]=$0
          for (i=1;i<=NF;i++) {
              oldstr=$i
              newstr=""
              while (oldstr) {
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)
              }
              ndx=index(lookups[$0],$i)
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
          }
          next
        }

        # complete rewrite of the following block to perform replacements based on a regex using word boundaries

        { for (i in lookups) {
              regex= "\\<" i "\\>"            # build regex
              gsub(regex,lookups[i])          # replace strings that match regex
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt            # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code

这会生成：

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

注释：

“边界”字符（\< 和 \>）与非单词字符匹配；在awk中，单词被定义为数字、字母和下划线的序列；有关更多信息，请参阅 GNU awk - 正则表达式运算符详细说明了
所有示例查找值都属于 awk 单词的定义范围，因此这个新代码可以根据需要工作，
您之前的问题包括不能被视为awk '单词'（例如，@vanti Finserv Co.、11:11 - Capital、MS&CO(NY)< /code>）在这种情况下，这个新代码可能无法替换
包含非单词字符的查找值的这些新查找值，不清楚您将如何定义“全字匹配”需要判断何时出现非单词字符（例如，@）将被视为查找字符串的一部分，而不是被视为单词边界

如果您需要替换包含（awk）非-的查找值单词字符，您可以尝试用 \W 替换单词边界字符，但这会导致 (awk) 'words' 查找值出现问题。

一种可能的解决方法是为每个查找值运行一组正则表达式匹配，例如：

awk '
FNR==NR { ... no changes to this block of code ... }

        { for (i in lookups) {
              regex= "\\<" i "\\>"
              gsub(regex,lookups[i])
              regex= "\\W" i "\\W"
              gsub(regex,lookups[i])
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt

您需要确定第二个正则表达式是否违反您的“全字匹配”要求。

We can make a couple changes to the current code:

feed the results of cat lookup_1 lookup_2 into awk such that it looks like a single file to awk (see last line of new code)
use word boundary flags (\< and \>) to build regexes with which to perform the replacements (see 2nd half of new code)

The new code:

awk '
        # the FNR==NR block of code remains the same

FNR==NR { if ($0 in lookups)
             next
          lookups[$0]=$0
          for (i=1;i<=NF;i++) {
              oldstr=$i
              newstr=""
              while (oldstr) {
                    len=length(oldstr)
                    newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
                    oldstr=substr(oldstr,4)
              }
              ndx=index(lookups[$0],$i)
              lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
          }
          next
        }

        # complete rewrite of the following block to perform replacements based on a regex using word boundaries

        { for (i in lookups) {
              regex= "\\<" i "\\>"            # build regex
              gsub(regex,lookups[i])          # replace strings that match regex
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt            # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code

This generates:

[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

NOTES:

the 'boundary' characters (\< and \>) match on non-word characters; in awk a word is defined as a sequence of numbers, letters and underscores; see GNU awk - regex operators for more details
all of the sample lookup values fall within the definition of an awk word so this new code works as desired
your previous question includes lookup values that cannot be considered as an awk 'word' (eg, @vanti Finserv Co., 11:11 - Capital, MS&CO(NY)) in which case this new code may fail to replace these new lookup values
for lookup values that contain non-word characters it's not clear how you would define 'whole word match' as you would also need to determine when a non-word character (eg, @) is to be treated as part of a lookup string vs being treated as a word boundary

If you need to replace lookup values that contain (awk) non-word characters you could try replacing the word-boundary characters with \W, though this then causes problems for the lookup values that are (awk) 'words'.

One possible workaround may be to run a dual set of regex matches for each lookup value, eg:

awk '
FNR==NR { ... no changes to this block of code ... }

        { for (i in lookups) {
              regex= "\\<" i "\\>"
              gsub(regex,lookups[i])
              regex= "\\W" i "\\W"
              gsub(regex,lookups[i])
          }
          print
        }
' <(cat lookup_1 lookup_2) xyz.txt

You'll need to determine if the 2nd regex breaks your 'whole word match' requirement.

回复收藏 0 原文

~没有更多了~