查找并替换许多单词
我经常需要在文件中进行许多替换。为了解决这个问题,我创建了两个文件 old.text
和 new.text
。第一个包含必须找到的单词列表。第二个包含应替换这些单词的单词列表。
- 我的所有文件都使用 UTF-8 并使用各种语言。
我已经构建了这个脚本,我希望它可以进行替换。首先,它一次读取 old.text
一行,然后用 new.text
文件中的相应单词替换 input.txt 中该行的单词。
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
但是,我的解决方案效果不佳。当我运行脚本时:
- 在第 6 行,
sed
命令不知道$number
在哪里结束。 $number
变量正在更改为“0+1”,然后是“0+1+1”,而它应该更改为“1”,然后是“2”。- 带有
awk
的行似乎除了将 input.txt 完全复制到 output.txt 之外没有做任何事情。
您有什么建议吗?
更新:
标记的答案效果很好,但是,我经常使用这个脚本,需要很多小时才能完成。因此,我为能够更快地完成这些替换的解决方案提供了悬赏。 BASH、Perl 或 Python 2 中的解决方案都可以,只要它仍然兼容 UTF-8。如果您认为使用 Linux 系统上常见的其他软件的其他解决方案会更快,那么只要不需要巨大的依赖项,那也可能没问题。
I frequently need to make many replacements within files. To solve this problem, I have created two files old.text
and new.text
. The first contains a list of words which must be found. The second contains the list of words which should replace those.
- All of my files use UTF-8 and make use of various languages.
I have built this script, which I hoped could do the replacement. First, it reads old.text
one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text
file.
#!/bin/sh
number=1
while read linefromoldwords
do
echo $linefromoldwords
linefromnewwords=$(sed -n '$numberp' new.text)
awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
number=$number+1
echo $number
done < old.text
However, my solution does not work well. When I run the script:
- On line 6, the
sed
command does not know where the$number
ends. - The
$number
variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2". - The line with
awk
does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.
Do you have any suggestions?
Update:
The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
尝试用双引号引用变量
这样做:
awk 不会采用其外部的变量范围。 awk 中的用户定义变量需要在使用时定义或在 awk 的 BEGIN 语句中预定义。您可以使用
-v
选项包含 shell 变量。这是 bash 中的一个解决方案,可以满足您的需要。
Bash 解决方案:
此解决方案一次从
替换文件
和替换文件
读取一行,并执行内联 sed
> 替代。Try quoting the variable with double quotes
Do this instead:
awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using
-v
option.Here is a solution in
bash
that would do what you need.Bash Solution:
This solution reads one line at a time from
substitution file
andreplacement file
and performsin-line sed
substitution.为什么不呢
?
Why not to
?
我喜欢这类问题,所以这是我的答案:
首先为了简单起见,为什么不只使用带有源代码和翻译的文件。我的意思是:(文件名changeThis)
然后您可以在脚本中定义适当的分隔符。 (文件replaceWords.sh)
以这个例子(文件changeMe)
调用它
,你会得到
用sed记下“i”娱乐。 “-i”表示在源文件中替换,而 s// 命令中的“I”表示忽略大小写 -GNU 扩展,检查您的 sed 实现 -
当然请注意,bash while 循环比 python 或类似的脚本语言慢得多。根据您的需要,您可以执行嵌套 while,一个在源文件上,一个在内部循环翻译(更改)。将所有内容回显到标准输出以实现管道灵活性。
I love this kind of questions, so here is my answer:
First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)
Then you can define a proper separator in the script. (file replaceWords.sh)
Take this example (file changeMe)
Call it with
And you will get
Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-
Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.
此 Python 2 脚本将旧单词形成单个正则表达式,然后根据匹配的旧单词的索引替换相应的新单词。旧词仅在不同时才匹配。这种独特性是通过将单词包围在 r'\b'(正则表达式单词边界)中来强制执行的。
输入来自命令行(它们是我用于空闲开发的带注释的替代方案)。输出到 stdout
在该解决方案中,主要文本仅被扫描一次。根据 Jaypals 答案的输入,输出是相同的。
我刚刚对一个约 100K 字节的文本文件做了一些统计:
该文本是由 生成的 250 段 lorum ipsum这里我只是取出了最常出现的十个单词,并按顺序将它们替换为字符串 ONE 到 TEN。
Python regexp 解决方案比 Jaypal 当前选择的最佳解决方案快一个数量级。
Python 选择将替换后跟换行符或标点符号以及任何空格(包括制表符等)的单词。
有人评论说 C 解决方案创建起来既简单又速度最快。几十年前,一些聪明的 Unix 研究员观察到情况通常并非如此,并创建了 awk 等脚本工具来提高生产力。此任务非常适合脚本语言,并且 Python 中所示的技术可以在 Ruby 或 Perl 中复制。
This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.
Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout
The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.
I just did some stats on a ~100K byte text file:
The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.
The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).
Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.
我发现一个通用的 Perl 解决方案可以很好地用关联值替换映射中的键,如下所示:
您必须首先将两个文件读入映射(显然),但是一旦完成,您就只有一个遍历每一行,并为每次替换进行一次哈希查找。我只尝试过相对较小的地图(大约 1,000 个条目),因此不能保证您的地图是否明显更大。
A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:
You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.
我不确定引用,但 ${number}p 会起作用 - 也许“${number}p”
bash 中的算术整数计算可以使用 $(( )) 完成,并且比
eval
(eval=evil) 更好。一般来说,我建议使用一个文件
,依此类推,每行一个 sed 命令,恕我直言,最好注意这一点 - 按字母顺序对其进行排序,并将其与以下一起使用:
这样您就可以轻松地比较映射。
作为显式分隔符,可能比空格更受欢迎,因为
\b:=word border
匹配行的开头/结尾和标点符号。I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"
Arithmetic integer evaluation in bash can be done with $(( )) and is better than
eval
(eval=evil).In general, I would recommend using one file with
and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:
So you can always easily compare the mappings.
might be prefered over blanks as explicit delimiters, because
\b:=word boundary
matches start/end of line and punctuation characters.这应该通过某种方式减少时间,因为这可以避免不必要的循环。
合并两个输入文件:
假设您有两个输入文件,old.text 包含所有替换 和 new.text 包含所有替换。
我们将创建一个新的文本文件,该文件将使用以下
awk
单行代码充当主文件的sed 脚本
:注意: < em>这种替换和替换的格式是基于您对单词之间有空格的要求。
使用合并文件作为 sed 脚本:
创建合并文件后,我们将使用
-f 选项
sed
实用程序的 code>。您可以使用
>
运算符将其重定向到另一个文件。This should reduce the time by some means as this avoids unnecessary loops.
Merge two input files:
Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.
We will create a new text file which will act as a
sed script
to your main file using the followingawk
one-liner:Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.
Using merged file as sed script:
Once your merged file has been created, we will use
-f option
ofsed
utility.You can redirect this into another file using the
>
operator.这可能对你有用:
This might work for you:
这是一个应该既节省空间又节省时间的 Python 2 脚本:
它正在运行:
编辑:向 @Paddy3118 致敬以进行空格处理。
Here is a Python 2 script that should be both space and time efficient:
Here it is in action:
EDIT: Hat tip to @Paddy3118 for whitespace handling.
这是 Perl 中的解决方案。如果您将输入单词列表合并为一个列表,则可以简化它:每一行都包含新旧单词的映射。
旧单词文件:
新单词文件:
输入文件:
创建输出:
Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.
Old words file:
New words file:
Input file:
Create output:
我不确定为什么以前的大多数发帖者坚持使用正则表达式来解决此任务,我认为这会比大多数方法更快(如果不是最快的方法)。
使用:
perl script.pl /file/to/modify
,结果打印到stdout。I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).
Use:
perl script.pl /file/to/modify
, result is printed to stdout.编辑-我刚刚注意到像我这样的两个答案已经在这里了...所以你可以忽略我的:)
我相信这个perl脚本虽然没有使用花哨的sed或awk东西,但工作速度相当快...
我做了随意使用 old_word 到 new_word 的另一种格式:
csv 格式。如果太复杂而无法做到,请告诉我,我将添加一个脚本来获取您的 old.txt、new.txt 并构建 csv 文件。
跑一跑然后让我知道!
顺便说一句 - 如果这里的任何 Perl 专家可以建议一种更危险的方式来完成我在这里所做的事情,我很乐意阅读评论:
EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)
I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...
I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.
take it on a run and let me know!
by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment: