查找并替换许多单词

发布于 2024-12-17 18:44:52 字数 1046 浏览 2 评论 0原文

我经常需要在文件中进行许多替换。为了解决这个问题，我创建了两个文件 old.text 和 new.text。第一个包含必须找到的单词列表。第二个包含应替换这些单词的单词列表。

我的所有文件都使用 UTF-8 并使用各种语言。

我已经构建了这个脚本，我希望它可以进行替换。首先，它一次读取 old.text 一行，然后用 new.text 文件中的相应单词替换 input.txt 中该行的单词。

#!/bin/sh
number=1
while read linefromoldwords
do
    echo $linefromoldwords
    linefromnewwords=$(sed -n '$numberp' new.text)
    awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
    number=$number+1
echo $number
done <  old.text

但是，我的解决方案效果不佳。当我运行脚本时：

在第 6 行，sed 命令不知道 $number 在哪里结束。
$number 变量正在更改为“0+1”，然后是“0+1+1”，而它应该更改为“1”，然后是“2”。
带有 awk 的行似乎除了将 input.txt 完全复制到 output.txt 之外没有做任何事情。

您有什么建议吗？

更新：

标记的答案效果很好，但是，我经常使用这个脚本，需要很多小时才能完成。因此，我为能够更快地完成这些替换的解决方案提供了悬赏。 BASH、Perl 或 Python 2 中的解决方案都可以，只要它仍然兼容 UTF-8。如果您认为使用 Linux 系统上常见的其他软件的其他解决方案会更快，那么只要不需要巨大的依赖项，那也可能没问题。

原文

I frequently need to make many replacements within files. To solve this problem, I have created two files old.text and new.text. The first contains a list of words which must be found. The second contains the list of words which should replace those.

All of my files use UTF-8 and make use of various languages.

I have built this script, which I hoped could do the replacement. First, it reads old.text one line at a time, then replaces the words at that line in input.txt with the corresponding words from the new.text file.

#!/bin/sh
number=1
while read linefromoldwords
do
    echo $linefromoldwords
    linefromnewwords=$(sed -n '$numberp' new.text)
    awk '{gsub(/$linefromoldwords/,$linefromnewwords);print}' input.txt >> output.txt
    number=$number+1
echo $number
done <  old.text

However, my solution does not work well. When I run the script:

On line 6, the sed command does not know where the $number ends.
The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".
The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.

Do you have any suggestions?

Update:

The marked answer works well, however, I use this script a lot and it takes many hours to finish. So I offer a bounty for a solution which can complete these replacements much quicker. A solution in BASH, Perl, or Python 2 will be okay, provided it is still UTF-8 compatible. If you think some other solution using other software commonly available on Linux systems would be faster, then that might be fine too, so long as huge dependencies are not required.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

So要识趣 2024-12-24 18:44:52

第6行，sed命令不知道$number在哪里结束。

尝试用双引号引用变量

linefromnewwords=$(sed -n "$number"p newwords.txt)

$number 变量正在更改为“0+1”，然后是“0+1+1”，当它应该更改为“1”时，然后是“2”。

这样做：

number=`expr $number + 1`

带有 awk 的行似乎除了将 input.txt 完全复制到 output.txt 之外没有做任何事情。

awk 不会采用其外部的变量范围。 awk 中的用户定义变量需要在使用时定义或在 awk 的 BEGIN 语句中预定义。您可以使用 -v 选项包含 shell 变量。

这是 bash 中的一个解决方案，可以满足您的需要。

Bash 解决方案：

#!/bin/bash

while read -r sub && read -r rep <&3; do
  sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text

此解决方案一次从替换文件 和替换文件 读取一行，并执行内联 sed > 替代。

One line 6, the sed command does not know where the $number ends.

Try quoting the variable with double quotes

linefromnewwords=$(sed -n "$number"p newwords.txt)

The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".

Do this instead:

number=`expr $number + 1`

The line with awk does not appear to be doing anything more than copying the input.txt exactly as is to output.txt.

awk won't take variables outside its scope. User defined variables in awk needs to be either defined when they are used or predefined in the awk's BEGIN statement. You can include shell variables by using -v option.

Here is a solution in bash that would do what you need.

Bash Solution:

#!/bin/bash

while read -r sub && read -r rep <&3; do
  sed -i "s/ "$sub" / "$rep" /g" main.file
done <old.text 3<new.text

This solution reads one line at a time from substitution file and replacement file and performs in-line sed substitution.

回复收藏 0 原文

反差帅 2024-12-24 18:44:52

为什么不呢

paste -d/ oldwords.txt newwords.txt |\
sed -e 's@/@ / @' -e 's@^@s/ @' -e 's@$@ /g@' >/tmp/$.sed

sed -f /tmp/$.sed original >changed

rm /tmp/$.sed

？

Why not to

paste -d/ oldwords.txt newwords.txt |\
sed -e 's@/@ / @' -e 's@^@s/ @' -e 's@$@ /g@' >/tmp/$.sed

sed -f /tmp/$.sed original >changed

rm /tmp/$.sed

回复收藏 0 原文

你对谁都笑 2024-12-24 18:44:52

我喜欢这类问题，所以这是我的答案：

首先为了简单起见，为什么不只使用带有源代码和翻译的文件。我的意思是：（文件名changeThis）

hello=Bye dudes
the morNing=next Afternoon
first=last

然后您可以在脚本中定义适当的分隔符。（文件replaceWords.sh）

#!/bin/bash

SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
   origin=${transline%%${SEP}*}
   dest=${transline##*${SEP}}
   sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE

以这个例子（文件changeMe）

Hello, this is me. 
I will be there at first time in the morning

调用它

$ bash replaceWords.sh = changeThis changeMe

，你会得到

Bye dudes, this is me.
I will be there at last time in next Afternoon

用sed记下“i”娱乐。 “-i”表示在源文件中替换，而 s// 命令中的“I”表示忽略大小写 -GNU 扩展，检查您的 sed 实现 -

当然请注意，bash while 循环比 python 或类似的脚本语言慢得多。根据您的需要，您可以执行嵌套 while，一个在源文件上，一个在内部循环翻译（更改）。将所有内容回显到标准输出以实现管道灵活性。

#!/bin/bash

SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
   while read transline
   do
      origin=${transline%%${SEP}*}
      dest=${transline##*${SEP}}
      line=$(echo $line | sed "s/${origin}/${dest}/gI")
   done < $TRANSLATION
   echo $line
done < $FILE

I love this kind of questions, so here is my answer:

First for the shake of simplicity, Why not use only a file with source and translation. I mean: (filename changeThis)

hello=Bye dudes
the morNing=next Afternoon
first=last

Then you can define a proper separator in the script. (file replaceWords.sh)

#!/bin/bash

SEP=${1}
REPLACE=${2}
FILE=${3}
while read transline
do
   origin=${transline%%${SEP}*}
   dest=${transline##*${SEP}}
   sed -i "s/${origin}/${dest}/gI" $FILE
done < $REPLACE

Take this example (file changeMe)

Hello, this is me. 
I will be there at first time in the morning

Call it with

$ bash replaceWords.sh = changeThis changeMe

And you will get

Bye dudes, this is me.
I will be there at last time in next Afternoon

Take note of the "i" amusement with sed. "-i" means replace in source file, and "I" in s// command means ignore case -a GNU extension, check your sed implementation-

Of course note that a bash while loop is horrendously slower than a python or similar scripting language. Depending on your needs you can do a nested while, one on the source file and one inside looping the translations (changes). Echoing all to stdout for pipe flexibility.

#!/bin/bash

SEP=${1}
TRANSLATION=${2}
FILE=${3}
while read line
do
   while read transline
   do
      origin=${transline%%${SEP}*}
      dest=${transline##*${SEP}}
      line=$(echo $line | sed "s/${origin}/${dest}/gI")
   done < $TRANSLATION
   echo $line
done < $FILE

回复收藏 0 原文

戴着白色围巾的女孩 2024-12-24 18:44:52

此 Python 2 脚本将旧单词形成单个正则表达式，然后根据匹配的旧单词的索引替换相应的新单词。旧词仅在不同时才匹配。这种独特性是通过将单词包围在 r'\b'（正则表达式单词边界）中来强制执行的。

输入来自命令行（它们是我用于空闲开发的带注释的替代方案）。输出到 stdout

在该解决方案中，主要文本仅被扫描一次。根据 Jaypals 答案的输入，输出是相同的。

#!/bin/env python

import sys, re

def replacer(match):
    global new
    return new[match.lastindex-1]

if __name__ == '__main__':
    fname_old, fname_new, fname_txt = sys.argv[1:4]
    #fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()

    with file(fname_old) as f:
        # Form regular expression that matches old words, grouped in order
        old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
                               for word in f.read().strip().split()) + ')'
    with file(fname_new) as f:
        # Ordered list of replacement words 
        new = [word for word in f.read().strip().split()]
    with file(fname_txt) as f:
        # input text
        txt = f.read()
    # Output the new text
    print( re.subn(old, replacer, txt)[0] )

我刚刚对一个约 100K 字节的文本文件做了一些统计：

Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%

该文本是由生成的 250 段 lorum ipsum这里我只是取出了最常出现的十个单词，并按顺序将它们替换为字符串 ONE 到 TEN。

Python regexp 解决方案比 Jaypal 当前选择的最佳解决方案快一个数量级。
Python 选择将替换后跟换行符或标点符号以及任何空格（包括制表符等）的单词。

有人评论说 C 解决方案创建起来既简单又速度最快。几十年前，一些聪明的 Unix 研究员观察到情况通常并非如此，并创建了 awk 等脚本工具来提高生产力。此任务非常适合脚本语言，并且 Python 中所示的技术可以在 Ruby 或 Perl 中复制。

稻田。

This Python 2 script forms the old words into a single regular expression then substitutes the corresponding new word based on the index of the old word that matched. The old words are matched only if they are distinct. This distinctness is enforced by surrounding the word in r'\b' which is the regular expression word boundary.

Input is from the commandline (their is a commented alternative I used for development in idle). Output is to stdout

The main text is scanned only once in this solution. With the input from Jaypals answer, the output is the same.

#!/bin/env python

import sys, re

def replacer(match):
    global new
    return new[match.lastindex-1]

if __name__ == '__main__':
    fname_old, fname_new, fname_txt = sys.argv[1:4]
    #fname_old, fname_new, fname_txt = 'oldwords.txt oldwordreplacements.txt oldwordreplacer.txt'.split()

    with file(fname_old) as f:
        # Form regular expression that matches old words, grouped in order
        old = '(?:' + '|'.join(r'\b(%s)\b' % re.escape(word)
                               for word in f.read().strip().split()) + ')'
    with file(fname_new) as f:
        # Ordered list of replacement words 
        new = [word for word in f.read().strip().split()]
    with file(fname_txt) as f:
        # input text
        txt = f.read()
    # Output the new text
    print( re.subn(old, replacer, txt)[0] )

I just did some stats on a ~100K byte text file:

Total characters in text: 116413
Total words in text: 17114
Total distinct words in text: 209
Top 10 distinct word occurences in text: 2664 = 15.57%

The text was 250 paragraphs of lorum ipsum generated from here I just took the ten most frequently occuring words and replaced them with the strings ONE to TEN in order.

The Python regexp solution is an order of magnitude faster than the currently selected best solution by Jaypal.
The Python selection will replace words followed by a newline character or by punctuation as well as by any whitespace (including tabs etc).

Someone commented that a C solution would be both simple to create and fastest. Decades ago, some wise Unix fellows observed that this is not usually the case and created scripting tools such as awk to boost productivity. This task is ideal for scripting languages and the technique shown in the Python coukld be replicated in Ruby or Perl.

Paddy.

回复收藏 0 原文

与酒说心事 2024-12-24 18:44:52

我发现一个通用的 Perl 解决方案可以很好地用关联值替换映射中的键，如下所示：

my %map = (
    19 => 'A',
    20 => 'B',
);

my $key_regex = '(' . join('|', keys %map) . ')';

while (<>) {
    s/$key_regex/$map{$1}/g;
    print $_;
}

您必须首先将两个文件读入映射（显然），但是一旦完成，您就只有一个遍历每一行，并为每次替换进行一次哈希查找。我只尝试过相对较小的地图（大约 1,000 个条目），因此不能保证您的地图是否明显更大。

A general perl solution that I have found to work well for replacing the keys in a map with their associated values is this:

my %map = (
    19 => 'A',
    20 => 'B',
);

my $key_regex = '(' . join('|', keys %map) . ')';

while (<>) {
    s/$key_regex/$map{$1}/g;
    print $_;
}

You would have to read your two files into the map first (obviously), but once that is done you only have one pass over each line, and one hash-lookup for every replacement. I've only tried it with relatively small maps (around 1,000 entries), so no guarantees if your map is significantly larger.

回复收藏 0 原文

满栀 2024-12-24 18:44:52

在第 6 行，sed 命令不知道 $number 在哪里结束。

linefromnewwords=$(sed -n '${number}p' newwords.txt)

我不确定引用，但 ${number}p 会起作用 - 也许“${number}p”

$number 变量正在更改为“0+1”，然后是“0+1+1”，而它应该更改为“1”，然后是“2”。

bash 中的算术整数计算可以使用 $(( )) 完成，并且比 eval (eval=evil) 更好。

number=$((number + 1))

一般来说，我建议使用一个文件

s/ ni3 / nǐ /g
s/ nei3 / neǐ /g

，依此类推，每行一个 sed 命令，恕我直言，最好注意这一点 - 按字母顺序对其进行排序，并将其与以下一起使用：

sed -f translate.sed input > output

这样您就可以轻松地比较映射。

s/\bni3\b/nǐ/g

作为显式分隔符，可能比空格更受欢迎，因为 \b:=word border 匹配行的开头/结尾和标点符号。

At line 6, the sed command does not know where the $number ends.

linefromnewwords=$(sed -n '${number}p' newwords.txt)

I'm not sure about the quoting, but ${number}p will work - maybe "${number}p"

The $number variable is changing to "0+1", then "0+1+1", when it should change to "1", then "2".

Arithmetic integer evaluation in bash can be done with $(( )) and is better than eval (eval=evil).

number=$((number + 1))

In general, I would recommend using one file with

s/ ni3 / nǐ /g
s/ nei3 / neǐ /g

and so on, one sed-command per line, which is imho better to take care about - sort it alphabetically, and use it with:

sed -f translate.sed input > output

So you can always easily compare the mappings.

s/\bni3\b/nǐ/g

might be prefered over blanks as explicit delimiters, because \b:=word boundary matches start/end of line and punctuation characters.

回复收藏 0 原文

依靠 2024-12-24 18:44:52

这应该通过某种方式减少时间，因为这可以避免不必要的循环。

合并两个输入文件：

假设您有两个输入文件，old.text 包含所有替换和 new.text 包含所有替换。

我们将创建一个新的文本文件，该文件将使用以下 awk 单行代码充当主文件的 sed 脚本：

awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text 

[jaypal:~/Temp] cat old.text 
19
20

[jaypal:~/Temp] cat new.text 
A
B

[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text

[jaypal:~/Temp] cat merge.text 
s/ 19 / A /g
s/ 20 / B /g

注意： < em>这种替换和替换的格式是基于您对单词之间有空格的要求。

使用合并文件作为 sed 脚本：

创建合并文件后，我们将使用 -f 选项 sed 实用程序的 code>。

sed -f merge.text input_file

[jaypal:~/Temp] cat input_file 
 12 adsflljl
 12 hgfahld
 12 ash;al
 13 a;jfda
 13 asldfj
 15 ;aljdf
 16 a;dlfj
 19 adads
 19 adfasf
 20 aaaadsf

[jaypal:~/Temp] sed -f merge.text input_file 
 12 adsflljl
 12 hgfahld
 12 ash;al
 13 a;jfda
 13 asldfj
 15 ;aljdf
 16 a;dlfj
 A adads
 A adfasf
 B aaaadsf

您可以使用 > 运算符将其重定向到另一个文件。

This should reduce the time by some means as this avoids unnecessary loops.

Merge two input files:

Lets assume you have two input files, old.text containing all substitutions and new.text containing all replacements.

We will create a new text file which will act as a sed script to your main file using the following awk one-liner:

awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text 

[jaypal:~/Temp] cat old.text 
19
20

[jaypal:~/Temp] cat new.text 
A
B

[jaypal:~/Temp] awk '{ printf "s/ "$0" /"; getline <"new.text"; print " "$0" /g" }' old.text > merge.text

[jaypal:~/Temp] cat merge.text 
s/ 19 / A /g
s/ 20 / B /g

Note: This formatting of substitution and replacement is based on your requirement of having spaces between the words.

Using merged file as sed script:

Once your merged file has been created, we will use -f option of sed utility.

sed -f merge.text input_file

[jaypal:~/Temp] cat input_file 
 12 adsflljl
 12 hgfahld
 12 ash;al
 13 a;jfda
 13 asldfj
 15 ;aljdf
 16 a;dlfj
 19 adads
 19 adfasf
 20 aaaadsf

[jaypal:~/Temp] sed -f merge.text input_file 
 12 adsflljl
 12 hgfahld
 12 ash;al
 13 a;jfda
 13 asldfj
 15 ;aljdf
 16 a;dlfj
 A adads
 A adfasf
 B aaaadsf

You can redirect this into another file using the > operator.

回复收藏 0 原文

倾城°AllureLove 2024-12-24 18:44:52

这可能对你有用：

paste {old,new}words.txt | 
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' | 
sed -i -f - text.txt

This might work for you:

paste {old,new}words.txt | 
sed 's,\(\w*\)\s*\(\w*\),s!\\<\1\\>!\2!g,' | 
sed -i -f - text.txt

回复收藏 0 原文

始终不够爱げ你 2024-12-24 18:44:52

这是一个应该既节省空间又节省时间的 Python 2 脚本：

import sys
import codecs
import re

sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
               (line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))

regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))

for line in codecs.open("input.txt", "r", "utf-8"):
    result = regexp.sub(lambda match:sub[match.group(0)], line)
    sys.stdout.write(result.encode("utf-8"))

它正在运行：

$ cat old.txt 
19
20
$ cat new.txt 
A
B
$ cat input.txt 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$

编辑：向 @Paddy3118 致敬以进行空格处理。

Here is a Python 2 script that should be both space and time efficient:

import sys
import codecs
import re

sub = dict(zip((line.strip() for line in codecs.open("old.txt", "r", "utf-8")),
               (line.strip() for line in codecs.open("new.txt", "r", "utf-8"))))

regexp = re.compile('|'.join(map(lambda item:r"\b" + re.escape(item) + r"\b", sub)))

for line in codecs.open("input.txt", "r", "utf-8"):
    result = regexp.sub(lambda match:sub[match.group(0)], line)
    sys.stdout.write(result.encode("utf-8"))

Here it is in action:

$ cat old.txt 
19
20
$ cat new.txt 
A
B
$ cat input.txt 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf
$ python convert.py 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf
$

EDIT: Hat tip to @Paddy3118 for whitespace handling.

回复收藏 0 原文

つ可否回来 2024-12-24 18:44:52

这是 Perl 中的解决方案。如果您将输入单词列表合并为一个列表，则可以简化它：每一行都包含新旧单词的映射。

#!/usr/bin/env perl

# usage:
#   replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt

use strict;
use warnings;

sub read_words {
    my $file = shift;

    open my $fh, "<$file" or die "Error reading file: $file; $!\n";
    my @words = <$fh>;
    chomp @words;
    close $fh;

    return \@words;
}

sub word_map {
    my ($old_words, $new_words) = @_;

    if (scalar @$old_words != scalar @$new_words) {
        warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
    }
    my $list_size = scalar @$old_words;
    $list_size = scalar @$new_words if $list_size > scalar @$new_words;

    my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;

    return \%map;
}

sub build_regex {
    my $words = shift;

    my $pattern = join "|", sort { length $b <=> length $a } @$words;

    return qr/$pattern/;
}

my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);

my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
    s/($old_pattern)/$word_map->{amp;}/g;
    print;
}
close $input;
__END__

旧单词文件：

$ cat old.txt 
19
20

新单词文件：

$ cat new.txt 
A
B

输入文件：

$ cat input.txt 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf

创建输出：

$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf

Here's a solution in Perl. It can be simplified if you combined your input word lists into one list: each line containing the map of old and new words.

#!/usr/bin/env perl

# usage:
#   replace.pl OLD.txt NEW.txt INPUT.txt >> OUTPUT.txt

use strict;
use warnings;

sub read_words {
    my $file = shift;

    open my $fh, "<$file" or die "Error reading file: $file; $!\n";
    my @words = <$fh>;
    chomp @words;
    close $fh;

    return \@words;
}

sub word_map {
    my ($old_words, $new_words) = @_;

    if (scalar @$old_words != scalar @$new_words) {
        warn "Old and new word lists are not equal in size; using the smaller of the two sizes ...\n";
    }
    my $list_size = scalar @$old_words;
    $list_size = scalar @$new_words if $list_size > scalar @$new_words;

    my %map = map { $old_words->[$_] => $new_words->[$_] } 0 .. $list_size - 1;

    return \%map;
}

sub build_regex {
    my $words = shift;

    my $pattern = join "|", sort { length $b <=> length $a } @$words;

    return qr/$pattern/;
}

my $old_words = read_words(shift);
my $new_words = read_words(shift);
my $word_map = word_map($old_words, $new_words);
my $old_pattern = build_regex($old_words);

my $input_file = shift;
open my $input, "<$input_file" or die "Error reading input file: $input_file; $!\n";
while (<$input>) {
    s/($old_pattern)/$word_map->{amp;}/g;
    print;
}
close $input;
__END__

Old words file:

$ cat old.txt 
19
20

New words file:

$ cat new.txt 
A
B

Input file:

$ cat input.txt 
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
19 adads
19 adfasf
20 aaaadsf

Create output:

$ perl replace.pl old.txt new.txt input.txt
12 adsflljl
12 hgfahld
12 ash;al
13 a;jfda
13 asldfj
15 ;aljdf
16 a;dlfj
A adads
A adfasf
B aaaadsf

回复收藏 0 原文

暮年 2024-12-24 18:44:52

我不确定为什么以前的大多数发帖者坚持使用正则表达式来解决此任务，我认为这会比大多数方法更快（如果不是最快的方法）。

use warnings;
use strict;

open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");

my @hay = <>;
my @old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my @new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;

my %r;
;  @r{@old} = @new;

print defined  $r{$_} ? $r{$_} : $_ for split (
  /(\s+)/, "@hay"
);

使用：perl script.pl /file/to/modify，结果打印到stdout。

I'm not sure why most of the previous posters insist on using regular-expressions to solve this task, I think this will be faster than most (if not the fastest method).

use warnings;
use strict;

open (my $fh_o, '<', "old.txt");
open (my $fh_n, '<', "new.txt");

my @hay = <>;
my @old = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_o>;
my @new = map {s/^\s*(.*?)\s*$/$1/; $_} <$fh_n>;

my %r;
;  @r{@old} = @new;

print defined  $r{$_} ? $r{$_} : $_ for split (
  /(\s+)/, "@hay"
);

Use: perl script.pl /file/to/modify, result is printed to stdout.

回复收藏 0 原文

旧话新听 2024-12-24 18:44:52

编辑-我刚刚注意到像我这样的两个答案已经在这里了...所以你可以忽略我的:)

我相信这个perl脚本虽然没有使用花哨的sed或awk东西，但工作速度相当快...

我做了随意使用 old_word 到 new_word 的另一种格式：
csv 格式。如果太复杂而无法做到，请告诉我，我将添加一个脚本来获取您的 old.txt、new.txt 并构建 csv 文件。

跑一跑然后让我知道！

顺便说一句 - 如果这里的任何 Perl 专家可以建议一种更危险的方式来完成我在这里所做的事情，我很乐意阅读评论：

    #! /usr/bin/perl
    # getting the user's input
    if ($#ARGV == 1)
        {
        $LUT_file = shift;
        $file = shift;
        $outfile = $file . ".out.txt";
        }
    elsif ($#ARGV == 2)
        {
        $LUT_file = shift;
        $file = shift;
        $outfile = shift;
        }
    else { &usage; }

    # opening the relevant files

    open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
    open FILE,"<",$file or die "can't open $file for reading!\n : $!";
    open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";

    # getting the lines from the text to be changed and changing them
    %word_LUT = ();
    WORD_EXT:while (<LUT>)
        {
        $_ =~ m/(\w+),(\w+)/;
        $word_LUT{ $1 } =  $2 ;
        }
    close LUT;

    OUTER:while ($line = <FILE>)
        {
        @words = split(/\s+/,$line);
        for( $i = 0; $i <= $#words; $i++)
            {
            if ( exists ($word_LUT { $words[$i] }) ) 
                {
                $words[$i] = $word_LUT { $words[$i] };
                }

            }
        $newline = join(' ',@words);
        print "old line - $line\nnewline - $newline\n\n";
        print OUT $newline . "\n";

        }   
    # now we have all the signals needed in the swav array, build the file.

        close OUT;close FILE;

    # Sub Routines
    #
    #

    sub usage(){
    print "\n\n\replacer.pl Usage:\n";
    print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
    print "<LUT file> -    a LookUp Table of words, from the old word to the new one.
    \t\t\twith the following csv format:
    \t\t\told word,new word\n";
    print "<Input file>       -    the input file\n";
    print "<out file>         -    out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";

    exit;
    }

EDIT - I just noticed that two answers like mine are already here... so you can just disregard mine :)

I believe that this perl script, although not using fancy sed or awk thingies, does the job fairly quick...

I did take the liberty to use another format of old_word to new_word:
the csv format. if it is too complicated to do it let me know and I'll add a script that takes your old.txt,new.txt and builds the csv file.

take it on a run and let me know!

by the way - if any of you perl gurus here can suggest a more perlish way to do something I do here I will love to read the comment:

    #! /usr/bin/perl
    # getting the user's input
    if ($#ARGV == 1)
        {
        $LUT_file = shift;
        $file = shift;
        $outfile = $file . ".out.txt";
        }
    elsif ($#ARGV == 2)
        {
        $LUT_file = shift;
        $file = shift;
        $outfile = shift;
        }
    else { &usage; }

    # opening the relevant files

    open LUT, "<",$LUT_file or die "can't open $signal_LUT_file for reading!\n : $!";
    open FILE,"<",$file or die "can't open $file for reading!\n : $!";
    open OUT,">",$outfile or die "can't open $outfile for writing\n :$!";

    # getting the lines from the text to be changed and changing them
    %word_LUT = ();
    WORD_EXT:while (<LUT>)
        {
        $_ =~ m/(\w+),(\w+)/;
        $word_LUT{ $1 } =  $2 ;
        }
    close LUT;

    OUTER:while ($line = <FILE>)
        {
        @words = split(/\s+/,$line);
        for( $i = 0; $i <= $#words; $i++)
            {
            if ( exists ($word_LUT { $words[$i] }) ) 
                {
                $words[$i] = $word_LUT { $words[$i] };
                }

            }
        $newline = join(' ',@words);
        print "old line - $line\nnewline - $newline\n\n";
        print OUT $newline . "\n";

        }   
    # now we have all the signals needed in the swav array, build the file.

        close OUT;close FILE;

    # Sub Routines
    #
    #

    sub usage(){
    print "\n\n\replacer.pl Usage:\n";
    print "replacer.pl <LUT file> <Input file> [<out file>]\n\n";
    print "<LUT file> -    a LookUp Table of words, from the old word to the new one.
    \t\t\twith the following csv format:
    \t\t\told word,new word\n";
    print "<Input file>       -    the input file\n";
    print "<out file>         -    out file is optional. \nif not entered the default output file will be: <Input file>.out.txt\n\n";

    exit;
    }

回复收藏 0 原文

~没有更多了~