使用Unix工具处理文本：搜索并替换不在某些行之间的所有文本

发布于 2024-12-09 01:44:56 字数 943 浏览 1 评论 0原文

我正在寻找对一堆 *.org 文件进行一些文本处理。我想在每个文件中更改以下内容：

[my description](link)

[[link][my description]]

、

`some text`

=some text=

、

## some heading

** some heading

、

*some italics*

/some italics/

和

**some bold**

*some bold*

。是的，这是 markdown 语法到 org-mode 语法。我知道 pandoc。需要注意的是，我想要上述更改，除非它们发生在以下块中：

#+BEGIN_EXAMPLE
don't want above changes to take place in this block
...
#+END_EXAMPLE

因此，我不能使用 pandoc。我想根据上述要求使用某种unix脚本处理这些文件：awk、sed、python、perl、bash等。一旦我有了一个工作脚本，我就可以修改它并从中学习。

感谢您的帮助！

原文

I'm looking to do some text processing on a bunch of *.org files. I would like to change the following in each file:

[my description](link)

[[link][my description]]

`some text`

=some text=

## some heading

** some heading

*some italics*

/some italics/

, and

**some bold**

*some bold*

. Yes, this IS markdown syntax to org-mode syntax. I AM aware of pandoc. The caveat is that I want the above changes, except when they occur in the following block:

#+BEGIN_EXAMPLE
don't want above changes to take place in this block
...
#+END_EXAMPLE

Hence, I can't use pandoc. I'd like to process these files according to the above requirements using some kind of unix script: awk, sed, python, perl, bash, etc. Once I have a working script, I can modify it and learn from it.

Thanks for your help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

琉璃繁缕 2024-12-16 01:44:56

Perl 解决方案

这是我建议对 @jkerian 脚本进行简化更改的结果：使用触发器运算符和 -p。我还修复了他的正则表达式，以在 RHS 中使用正确的 $1 和 $2，将分隔符从 s/// 更改为 s::: 以避免 LTS （“倾斜牙签综合症”），并添加 /x 以提高可读性。处理粗体和斜体时存在逻辑错误，我已修复。我添加了注释，显示每种情况下的变换应该是什么，与原始问题描述相对应，并对齐变换的 RHS 以使其更易于阅读。

#!/usr/bin/perl -p
#
# the -p option makes this a pass-through filter
#####################################################

# omit protected region
next if /^#\+BEGIN_EXAMPLE/ .. /^#\+END_EXAMPLE/;

# `some text`                      ⇒   =some text=
s: ` ( [^`]* ) `                       :=$1=:gx;

# [desc](link)                     ⇒   [[link][desc]]
s: \[ ( [^]]* ) \] \( ( [^)]* ) \)     :[[$2][$1]]:gx;

# ^## some heading ⇒ ** some heading
#      NB: can't use /x here or would have to use ugly \#
s:^##:**:;   

# *some italics*                   ⇒   /some italics/
s: (?!< \* ) \* ( [^*]+ ) \* (?! \*)   :/$1/:gx;

# **some bold**                    ⇒   *some bold*
s: \*{2} ( [^*]+ ) \*{2}               :*$1*:gx;

看看这有多容易吗？只需 6 行简单易读的 Perl 代码。在 Perl 中这很容易，因为 Perl 是专门为使编写此类过滤器变得超级简单而设计的，而 Python 则不然。 Python 有不同的设计目标。

虽然您当然可以用 Python 重写它，但不值得这么麻烦，因为 Python 根本就不是为此类事情而设计的。 Python 缺少用于隐式循环和隐式打印的 -p“make-me-a-filter”标志。 Python 缺少隐式累加器变量。 Python 缺少内置正则表达式。 Python 缺少 s/// 运算符。 Python 缺少有状态触发器运算符。所有这些都使得 Perl 解决方案比 Python 解决方案更容易阅读、编写和维护。

然而，您不应该认为这总是成立。事实并非如此。在其他领域，你可以提出Python在这些领域领先的问题。但不在这里。这是因为这个过滤器是 Perl 的一个重点专业领域，而不是 Python 的一个。

因此，与这个简单的 Perl 版本相比，Python 解决方案会更长、更嘈杂、更难以阅读，因此更难维护，这都是因为 Perl 的设计初衷是让简单的事情变得简单，而这是它的目标应用领域之一。尝试用 Python 重写它，你会发现它是多么令人讨厌。当然这是可能的，但不值得这么麻烦，也不值得维护噩梦。

Python 版本

#!/usr/bin/env python3.2

from __future__ import print_function

import sys
import re

if (sys.version_info[0] == 2):
    sys.stderr.write("%s: legacy Python detected! Please upgrade to v3+\n"
                   % sys.argv[0] )
    ##sys.exit(2)

if len(sys.argv) == 1:
    sys.argv.append("/dev/stdin")

flip_rx = re.compile(r'^#\+BEGIN_EXAMPLE')
flop_rx = re.compile(r'^#\+END_EXAMPLE')

#EG# `some text`  -->   =some text=
lhs_backticks = re.compile(r'` ( [^`]* ) `', re.VERBOSE)
rhs_backticks =            r'=\1='

#EG# [desc](link)  -->  [[link][desc]]
lhs_desclink  = re.compile(r' \[ ( [^]]* ) \] \( ( [^)]* ) \) ', re.VERBOSE)
rhs_desclink  =            r'[[\2][\1]]'

#EG# ^## some heading  -->  ** some heading
lhs_header    = re.compile(r'^##')
rhs_header    =            r'**'

#EG# *some italics*  -->  /some italics/
lhs_italics   = re.compile(r' (?!< \* ) \* ( [^*]+ ) \* (?! \*)  ', re.VERBOSE)
rhs_italics   =            r'/\1/'

## **some bold**  -->  *some bold*
lhs_bold      = re.compile(r'\*{2} ( [^*]+ ) \*{2}', re.VERBOSE)
rhs_bold      =            r'*\1*'

errcnt = 0

flipflop = "flip"

for filename in sys.argv[1:]:
    try:
        filehandle = open(filename, "r")
    except IOError as oops:
        errcnt = errcnt + 1
        sys.stderr.write("%s: can't open '%s' for reading: %s\n"
                      % ( sys.argv[0],    filename,        oops) )
    else:
        try:
            for line in filehandle:

                new_flipflop = None

                if flipflop == "flip":
                    if flip_rx.search(line):
                        new_flipflop = "flop"
                elif flipflop == "flop":
                    if flop_rx.search(line):
                        new_flipflop = "flip"
                else:
                    raise FlipFlop_SNAFU

                if flipflop != "flop":
                    line = lhs_backticks . sub ( rhs_backticks, line)
                    line = lhs_desclink  . sub ( rhs_desclink,  line)
                    line = lhs_header    . sub ( rhs_header,    line)
                    line = lhs_italics   . sub ( rhs_italics,   line)
                    line = lhs_bold      . sub ( rhs_bold,      line)                        
                print(line, end="")

                if new_flipflop != None:
                    flipflop = new_flipflop

        except IOError as oops:
            errcnt = errcnt + 1
            sys.stderr.write("%s: can't read '%s': %s\n"
              % ( sys.argv[0],    filename,        oops) )
        finally:
            try:
                filehandle.close()
            except IOError as oops:
                errcnt = errcnt + 1
                sys.stderr.write("%s: can't close '%s': %s\n"
                  % ( sys.argv[0],    filename,        oops) )

if errcnt == 0:
    sys.exit(0)
else:
    sys.exit(1)

摘要

使用正确的工具完成正确的工作非常重要。对于这个任务，该工具是 Perl，只花了 7 行。只有 7 件事要做，但不要尝试告诉 Python。这就像回到具有太多中断堆栈的汇编语言。 72 行的 Python 显然不适合这种工作，所有令人痛苦的复杂性和嘈杂、不可读的代码都向您展示了原因。无论哪种语言，每行代码的错误率都是相同的，因此如果您可以选择编写 N 行代码还是 10*N 行代码，那就别无选择。

Perl Solution

This is the result of the simplifying changes I suggested for @jkerian’s script: use the flipflop operator and -p. I’ve also fixed his regexes to use the correct $1 and $2 in the RHS, altered delimiters from s/// to s::: to avoid LTS (“Leaning Toothpick Syndrome”), and added /x to improve readability. There was a logic error in dealing with bold and italics, which I’ve fixed. I added comments showing what the transform should be in each case, corresponding to the original problem description, and aligned the RHS of the transforms to make them easier to read.

#!/usr/bin/perl -p
#
# the -p option makes this a pass-through filter
#####################################################

# omit protected region
next if /^#\+BEGIN_EXAMPLE/ .. /^#\+END_EXAMPLE/;

# `some text`                      ⇒   =some text=
s: ` ( [^`]* ) `                       :=$1=:gx;

# [desc](link)                     ⇒   [[link][desc]]
s: \[ ( [^]]* ) \] \( ( [^)]* ) \)     :[[$2][$1]]:gx;

# ^## some heading ⇒ ** some heading
#      NB: can't use /x here or would have to use ugly \#
s:^##:**:;   

# *some italics*                   ⇒   /some italics/
s: (?!< \* ) \* ( [^*]+ ) \* (?! \*)   :/$1/:gx;

# **some bold**                    ⇒   *some bold*
s: \*{2} ( [^*]+ ) \*{2}               :*$1*:gx;

See how easy that is? Just 6 simple lines of eminently readable code in Perl. It’s easy in Perl because Perl is specifically designed to make writing this sort of filter super-easy, and Python is not. Python has separate design goals.

Although you could certainly rewrite this in Python, it wouldn’t be worth the bother because Python simply isn’t designed for this sort of thing. Python’s missing the -p “make-me-a-filter” flag for an implicit loop and implicit print. Python is missing the implicit accumulator variable. Python is missing built-in regexes. Python is missing the s/// operator. And Python is missing the stateful flipflop operator. All those contribute to making the Perl solution much much much easier to read, write, and maintain than the Python solution.

However, you should not get the idea that this always holds. It doesn’t. In other areas, you can come up with problems that Python comes out ahead in those areas. But not here. It’s because this filter thing is a focused specialty area for Perl, and it isn’t for Python.

The Python solution would consequently be much longer, noisier, and harder to read — and therefore harder to maintain — than this easy Perl version, all because Perl was designed to make easy things easy, and this is one of its target application areas. Try rewriting this in Python and notice how nasty it is. Sure it’s possible, but not worth the hassle, or the maintenance nightmare.

Python Version

#!/usr/bin/env python3.2

from __future__ import print_function

import sys
import re

if (sys.version_info[0] == 2):
    sys.stderr.write("%s: legacy Python detected! Please upgrade to v3+\n"
                   % sys.argv[0] )
    ##sys.exit(2)

if len(sys.argv) == 1:
    sys.argv.append("/dev/stdin")

flip_rx = re.compile(r'^#\+BEGIN_EXAMPLE')
flop_rx = re.compile(r'^#\+END_EXAMPLE')

#EG# `some text`  -->   =some text=
lhs_backticks = re.compile(r'` ( [^`]* ) `', re.VERBOSE)
rhs_backticks =            r'=\1='

#EG# [desc](link)  -->  [[link][desc]]
lhs_desclink  = re.compile(r' \[ ( [^]]* ) \] \( ( [^)]* ) \) ', re.VERBOSE)
rhs_desclink  =            r'[[\2][\1]]'

#EG# ^## some heading  -->  ** some heading
lhs_header    = re.compile(r'^##')
rhs_header    =            r'**'

#EG# *some italics*  -->  /some italics/
lhs_italics   = re.compile(r' (?!< \* ) \* ( [^*]+ ) \* (?! \*)  ', re.VERBOSE)
rhs_italics   =            r'/\1/'

## **some bold**  -->  *some bold*
lhs_bold      = re.compile(r'\*{2} ( [^*]+ ) \*{2}', re.VERBOSE)
rhs_bold      =            r'*\1*'

errcnt = 0

flipflop = "flip"

for filename in sys.argv[1:]:
    try:
        filehandle = open(filename, "r")
    except IOError as oops:
        errcnt = errcnt + 1
        sys.stderr.write("%s: can't open '%s' for reading: %s\n"
                      % ( sys.argv[0],    filename,        oops) )
    else:
        try:
            for line in filehandle:

                new_flipflop = None

                if flipflop == "flip":
                    if flip_rx.search(line):
                        new_flipflop = "flop"
                elif flipflop == "flop":
                    if flop_rx.search(line):
                        new_flipflop = "flip"
                else:
                    raise FlipFlop_SNAFU

                if flipflop != "flop":
                    line = lhs_backticks . sub ( rhs_backticks, line)
                    line = lhs_desclink  . sub ( rhs_desclink,  line)
                    line = lhs_header    . sub ( rhs_header,    line)
                    line = lhs_italics   . sub ( rhs_italics,   line)
                    line = lhs_bold      . sub ( rhs_bold,      line)                        
                print(line, end="")

                if new_flipflop != None:
                    flipflop = new_flipflop

        except IOError as oops:
            errcnt = errcnt + 1
            sys.stderr.write("%s: can't read '%s': %s\n"
              % ( sys.argv[0],    filename,        oops) )
        finally:
            try:
                filehandle.close()
            except IOError as oops:
                errcnt = errcnt + 1
                sys.stderr.write("%s: can't close '%s': %s\n"
                  % ( sys.argv[0],    filename,        oops) )

if errcnt == 0:
    sys.exit(0)
else:
    sys.exit(1)

Summary

It’s important to use the right tool for the right job. For this task, that tool is Perl, which took only 7 lines. There are only 7 things to do, but don’t try telling Python that. It’s like being back to assembly language with too many interrupt stacks. Python at 72 lines is clearly not cut out for this kind of work, and all the painful complexity and noisey unreadable code shows you just exactly why. Bug rate per line of code is the same no matter the language, so if you have your choice between writing N lines of code or 10*N lines of code, there is no choice.

回复收藏 0 原文

赏烟花じ飞满天 2024-12-16 01:44:56

我认为您正在寻找类似以下 perl 脚本的内容

while(<>) {
    if /#\+BEGIN_EXAMPLE/ .. /#\+END_EXAMPLE/ {
        print;
        next;
    }
    s/`([^`]*)`/=\1=/g;
    s/\[([^]]*)\]\(([^)]*)\)/[[\2][\1]]/g;
    s/^##/**/;
    s/\*([^\*]+)\*/\/\1\//g;
    s/\*\/([^\/]+)\/\*/*\1*/g;
    print;
}

使用 cat testfile | 运行它perl scriptname.pl

用于 python 的非愚蠢版本。注意：Perl 是完成这项工作的正确工具，但 tchrist 的 python 版本是一个糟糕的笑话，因此必须修复它。

from __future__ import print_function
import fileinput
import re
import sys

sys.tracebacklimit=0    #For those desperate to hide tracebacks in one-off scripts
example = 0
for line in fileinput.input():
    if example==0 and re.match(r'^#\+BEGIN_EXAMPLE',line):
        example+=1
    elif example>=1:
        if re.match(r'^#\+END_EXAMPLE',line): example-=1
    else:
        line = re. sub (r'` ( [^`]* ) `',                      r'=\1=',       line, 0, re.VERBOSE)
        line = re. sub (r'\[ ( [^]]* ) \] \( ( [^)]* ) \) ',   r'[[\2][\1]]', line, 0, re.VERBOSE)
        line = re. sub (r'^\#\#',                              r'**',         line, 0, re.VERBOSE)
        line = re. sub (r'(?!< \* ) \* ( [^*]+ ) \* (?! \*)',  r'/\1/',       line, 0, re.VERBOSE)
        line = re. sub (r'\*{2} ( [^*]+ ) \*{2}',              r'*\1*',       line, 0, re.VERBOSE)
    print(line, end="")

I think you're looking for something like the following perl script

while(<>) {
    if /#\+BEGIN_EXAMPLE/ .. /#\+END_EXAMPLE/ {
        print;
        next;
    }
    s/`([^`]*)`/=\1=/g;
    s/\[([^]]*)\]\(([^)]*)\)/[[\2][\1]]/g;
    s/^##/**/;
    s/\*([^\*]+)\*/\/\1\//g;
    s/\*\/([^\/]+)\/\*/*\1*/g;
    print;
}

Run it with cat testfile | perl scriptname.pl

For a non-silly version of python. Note: Perl is the right tool for the job, but tchrist's python version is such a bad joke that it had to be fixed.

from __future__ import print_function
import fileinput
import re
import sys

sys.tracebacklimit=0    #For those desperate to hide tracebacks in one-off scripts
example = 0
for line in fileinput.input():
    if example==0 and re.match(r'^#\+BEGIN_EXAMPLE',line):
        example+=1
    elif example>=1:
        if re.match(r'^#\+END_EXAMPLE',line): example-=1
    else:
        line = re. sub (r'` ( [^`]* ) `',                      r'=\1=',       line, 0, re.VERBOSE)
        line = re. sub (r'\[ ( [^]]* ) \] \( ( [^)]* ) \) ',   r'[[\2][\1]]', line, 0, re.VERBOSE)
        line = re. sub (r'^\#\#',                              r'**',         line, 0, re.VERBOSE)
        line = re. sub (r'(?!< \* ) \* ( [^*]+ ) \* (?! \*)',  r'/\1/',       line, 0, re.VERBOSE)
        line = re. sub (r'\*{2} ( [^*]+ ) \*{2}',              r'*\1*',       line, 0, re.VERBOSE)
    print(line, end="")

回复收藏 0 原文

香草可樂 2024-12-16 01:44:56

只是为了笑，这是我的 python 解决方案版本：

from __future__ import print_function
import fileinput, functools, re, sys

# For those desperate to hide tracebacks in one-off scripts
sys.tracebacklimit = 0
# Precompile all our patterns for speed
begin_example = re.compile(r'^#\+BEGIN_EXAMPLE').match
end_example = re.compile(r'^#\+END_EXAMPLE').match
# Use partial to eliminate lookups inside our loop
fixes = [ functools.partial(re.compile(x[0], x[2]).sub, x[1]) for x in
          (r'` ( [^`]* ) `',                      r'=\1=',       re.VERBOSE),
          (r'\[ ( [^]]* ) \] \( ( [^)]* ) \) ',   r'[[\2][\1]]', re.VERBOSE),
          (r'^\#\#',                              r'**',         re.VERBOSE),
          (r'(?!< \* ) \* ( [^*]+ ) \* (?! \*)',  r'/\1/',       re.VERBOSE),
          (r'\*{2} ( [^*]+ ) \*{2}',              r'*\1*',       re.VERBOSE),
          ]

inside = False
for line in fileinput.input():
    if inside:
        if end_example(line):
            inside = False
    else:
        if begin_example(line):
            inside = True
        for fixup in fixes:
            line = fixup(line)
    print(line, end='')

Just for grins, here's my version of the python solution:

from __future__ import print_function
import fileinput, functools, re, sys

# For those desperate to hide tracebacks in one-off scripts
sys.tracebacklimit = 0
# Precompile all our patterns for speed
begin_example = re.compile(r'^#\+BEGIN_EXAMPLE').match
end_example = re.compile(r'^#\+END_EXAMPLE').match
# Use partial to eliminate lookups inside our loop
fixes = [ functools.partial(re.compile(x[0], x[2]).sub, x[1]) for x in
          (r'` ( [^`]* ) `',                      r'=\1=',       re.VERBOSE),
          (r'\[ ( [^]]* ) \] \( ( [^)]* ) \) ',   r'[[\2][\1]]', re.VERBOSE),
          (r'^\#\#',                              r'**',         re.VERBOSE),
          (r'(?!< \* ) \* ( [^*]+ ) \* (?! \*)',  r'/\1/',       re.VERBOSE),
          (r'\*{2} ( [^*]+ ) \*{2}',              r'*\1*',       re.VERBOSE),
          ]

inside = False
for line in fileinput.input():
    if inside:
        if end_example(line):
            inside = False
    else:
        if begin_example(line):
            inside = True
        for fixup in fixes:
            line = fixup(line)
    print(line, end='')

回复收藏 0 原文

~没有更多了~