使用awk将文件与两个单独的查找文件进行比较
基本上,我想检查lookup_1&的字符串是否存在。 Lookup_2存在于我的xyz.txt文件中,然后执行操作&将输出重定向到输出文件。另外,我的代码目前正在替换Lookup_1中字符串的所有出现,即使是子字符串,我也需要它仅在整个单词匹配的情况下才能替换。 您可以帮助调整代码以实现同样的方法吗?
代码
awk '
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
{ for (i in lookups) {
ndx=index($0,i)
while (ndx > 0) { t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
ndx=index($0,i)
}
}
print
}
' lookup_1 xyz.txt > output.txt
lookup_1
ha
achine
skhatw
at
ree
ter
man
dun
lookup_2
United States
CDEXX123X
Institution
xyz.txt
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States
当前输出
[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States
所需的输出强>
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
Basically, I want to check if strings present in lookup_1 & lookup_2 exists in my xyz.txt file then perform action & redirect output to an output file. Also, my code is currently substituting all occurrences of the strings in lookup_1 even as substring, but I need it to only substitute if there's a whole word match.
Can you please help in tweaking the code to achieve the same?
code
awk '
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
{ for (i in lookups) {
ndx=index($0,i)
while (ndx > 0) { t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
ndx=index($0,i)
}
}
print
}
' lookup_1 xyz.txt > output.txt
lookup_1
ha
achine
skhatw
at
ree
ter
man
dun
lookup_2
United States
CDEXX123X
Institution
xyz.txt
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States
current output
[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States
desired output
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我们可以对当前代码进行一些更改:
cat Lookup_1 Lookup_2
的结果输入到awk
中,使其看起来像awk
中的单个文件> (参见新代码的最后一行)\<
和\>
)来构建用于执行替换的正则表达式(参见第二部分)新代码)新代码:
这会生成:
注释:
\<
和\>
)与非单词字符匹配;在awk中,单词被定义为数字、字母和下划线的序列;有关更多信息,请参阅 GNU awk - 正则表达式运算符详细说明了awk
单词的定义范围,因此这个新代码可以根据需要工作,awk
'单词'(例如,@vanti Finserv Co.
、11:11 - Capital
、MS&CO(NY)< /code>)在这种情况下,这个新代码可能无法替换
@
)将被视为查找字符串的一部分,而不是被视为单词边界如果您需要替换包含(
awk
)非-的查找值单词字符,您可以尝试用\W
替换单词边界字符,但这会导致 (awk
) 'words' 查找值出现问题。一种可能的解决方法是为每个查找值运行一组正则表达式匹配,例如:
您需要确定第二个正则表达式是否违反您的“全字匹配”要求。
We can make a couple changes to the current code:
cat lookup_1 lookup_2
intoawk
such that it looks like a single file toawk
(see last line of new code)\<
and\>
) to build regexes with which to perform the replacements (see 2nd half of new code)The new code:
This generates:
NOTES:
\<
and\>
) match on non-word characters; inawk
a word is defined as a sequence of numbers, letters and underscores; see GNU awk - regex operators for more detailsawk
word so this new code works as desiredawk
'word' (eg,@vanti Finserv Co.
,11:11 - Capital
,MS&CO(NY)
) in which case this new code may fail to replace these new lookup values@
) is to be treated as part of a lookup string vs being treated as a word boundaryIf you need to replace lookup values that contain (
awk
) non-word characters you could try replacing the word-boundary characters with\W
, though this then causes problems for the lookup values that are (awk
) 'words'.One possible workaround may be to run a dual set of regex matches for each lookup value, eg:
You'll need to determine if the 2nd regex breaks your 'whole word match' requirement.