将 HTML 实体转换为字符的 Bash 脚本

发布于 2024-11-05 11:23:45 字数 201 浏览 0 评论 0原文

我正在寻找一种方法将其转变

hello < world

为:

hello < world

我可以使用 sed,但是如何在不使用神秘的正则表达式的情况下完成此操作?

I'm looking for a way to turn this:

hello < world

to this:

hello < world

I could use sed, but how can this be accomplished without using cryptic regex?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

夜光 2024-11-12 11:23:45

尝试重新编码 (存档页面; GitHub 镜像; rel="nofollow noreferrer">Debian 页面):

$ echo '<' |recode html..ascii
<

在 Linux 和类似的 Unix-y 系统上安装:

$ sudo apt-get install recode

在 Mac OS 上安装,使用:

$ brew install recode

Try recode (archived page; GitHub mirror; Debian page):

$ echo '<' |recode html..ascii
<

Install on Linux and similar Unix-y systems:

$ sudo apt-get install recode

Install on Mac OS using:

$ brew install recode
坐在坟头思考人生 2024-11-12 11:23:45

使用 perl:

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

使用命令行中的 php:

cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

With perl:

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

With php from the command line:

cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'
许久 2024-11-12 11:23:45

另一种方法是通过网络浏览器进行管道传输,例如 w3m:

echo '&# 33;' | w3m -dump -T text/html

这对我在 Cygwin 中非常有用,因为在 Cygwin 中下载和安装发行版很困难。

这个答案是在此评论中找到的。

An alternative is to pipe through a web browser like w3m:

echo '!' | w3m -dump -T text/html

This worked great for me in Cygwin, where downloading and installing distributions are difficult.

This answer was found in this comment.

长不大的小祸害 2024-11-12 11:23:45

使用 xmlstarlet:

echo 'hello < world' | xmlstarlet unesc

Using xmlstarlet:

echo 'hello < world' | xmlstarlet unesc
丿*梦醉红颜 2024-11-12 11:23:45

python 3.2+版本:

cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

A python 3.2+ version:

cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
望喜 2024-11-12 11:23:45

这个答案基于:在 Bash 中转义 HTML 的快捷方式?它非常适合在 Stack Exchange 上获取答案(使用 wget)并将 HTML 转换为常规 ASCII 字符:

sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'

编辑 1: 2017 年 4 月 7 日 - 添加左双引号和右双引号引用 转换。这是 bash 脚本的一部分,它通过网络抓取 SE 答案并将其与本地代码文件进行比较:询问 Ubuntu -
本地文件和 Ask Ubuntu 答案之间的代码版本控制


编辑 2017 年 6 月 26 日

使用 sed 将来自 Ask Ubuntu / Stack Exchange 的 1K 行文件上的 HTML 转换为 ASCII 需要大约 3 秒的时间。因此,我被迫使用 Bash 内置搜索和替换,响应时间约为 1 秒。

这是函数:

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&/\&/g; s/</\</g; 
    # s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g; 
    # s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut// / }"
    LineOut="${LineOut//&/&}"
    LineOut="${LineOut//</<}"
    LineOut="${LineOut//>/>}"
    LineOut="${LineOut//"/'"'}"
    LineOut="${LineOut//'/"'"}"
    LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:

sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'

Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu -
Code Version Control between local files and Ask Ubuntu answers


Edit June 26, 2017

Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.

Here's the function:

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&/\&/g; s/</\</g; 
    # s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g; 
    # s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut// / }"
    LineOut="${LineOut//&/&}"
    LineOut="${LineOut//</<}"
    LineOut="${LineOut//>/>}"
    LineOut="${LineOut//"/'"'}"
    LineOut="${LineOut//'/"'"}"
    LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()
梦幻之岛 2024-11-12 11:23:45

在 macOS 上,您可以使用内置命令 textutil (一般来说这是一个方便的实用程序):

echo '👋 hello < world 🌐' | textutil -convert txt -format html -stdin -stdout

输出:

On macOS, you can use the built-in command textutil (which is a handy utility in general):

echo '👋 hello < world 🌐' | textutil -convert txt -format html -stdin -stdout

outputs:

???? hello < world ????
冷情妓 2024-11-12 11:23:45

我喜欢 https://stackoverflow.com/a/13161719/1506477 中给出的 Perl 答案。

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

但是,它在纯文本文件上生成的行数不相等。 (而且我对 perl 的了解不足以调试它。)

我喜欢 https://stackoverflow.com/a/42672936 中给出的 python 答案/1506477 -

python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

但它会在内存中创建一个列表 [ ... for l in sys.stdin] ,这对于大文件是禁止的。

这是另一种无需在内存中缓冲的简单 Python 方法:使用 awkg

$ echo 'hello < : " world' | \
   awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world

awkg 是一个基于 Python 的类似 awk 的行处理器。您可以使用 pip https://pypi.org/project/awkg/ 安装它:

pip install awkg

-b 是 awk 的 BEGIN{} 块,在开头运行一次。
这里我们刚刚做了from html import unescape

每行记录都在 R0 变量中,我们为此做了
print(unescape(R0))

免责声明:
我是 awkg 的维护者

I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.

cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'

But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)

I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --

python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.

Here is another easy pythonic way without buffering in memory: using awkg.

$ echo 'hello < : " world' | \
   awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world

awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:

pip install awkg

-b is awk's BEGIN{} block that runs once in the beginning.
Here we just did from html import unescape.

Each line record is in R0 variable, for which we did
print(unescape(R0))

Disclaimer:
I am the maintainer of awkg

往日情怀 2024-11-12 11:23:45

仅通过 sed 替换来支持所有 HTML 实体的转义将需要太长的命令列表而不实用,因为每个 Unicode 代码点至少有两个相应的 HTML 实体。

但仅使用 sed、grep、Bourne shell 和基本 UNIX 实用程序(GNU coreutils 或等效工具)即可完成此操作:

#!/bin/sh

htmlEscDec2Hex() {
    file=$1
    [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"

    printf -- \
        "$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
        $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')

    [ x"$1" != x"$file" ] && rm -f -- "$file"
}

htmlHexUnescape() {
    printf -- "$(
        sed 's/\\/\\\\/g;s/%/%%/g
            ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\�\1;/g
            ;s/�*\([0-9a-fA-F]\{4\}\);/\\u\1/g
            ;s/�*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}

htmlEscDec2Hex "$1" | htmlHexUnescape \
    | sed -f named_entities.sed

但是请注意,支持 \uHHHH\UHHHHHHHH 的 printf 实现 序列是必需的,例如 GNU 实用程序的序列。要进行测试,请检查 printf "\u00A7\n" 是否打印 §。要调用该实用程序而不是内置 shell,请将出现的 printf 替换为 env printf

该脚本使用一个附加文件 named_entities.sed 来支持命名实体。它可以使用以下 HTML 页面根据规范生成:

<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';

  function writeln(element, text) {
    element.appendChild( document.createTextNode(text) );
    element.appendChild( document.createElement("br") );
  }

  (async function(container) {
    const json = await (await fetch(referenceURL)).json();
    container.innerHTML = "";
    writeln(container, "#!/usr/bin/sed -f");
    const addLast = [];
    for (const name in json) {
      const characters = json[name].characters
        .replace("\\", "\\\\")
        .replace("/", "\\/");
      const command = "s/" + name + "/" + characters + "/g";
      if ( name.endsWith(";") ) {
        writeln(container, command);
      } else {
        addLast.push(command);
      }
    }
    for (const command of addLast) { writeln(container, command); }
  })( document.getElementById("sed-script") );
</script>
</body></html>

只需在现代浏览器中打开它,并将生成的页面另存为 named_entities.sed 文本。如果只需要命名实体,这个 sed 脚本也可以单独使用;这种情况下给它可执行权限就可以方便的直接调用了。

现在,上面的 shell 脚本可以用作 ./html_unescape.sh foo.html,或者在从标准输入读取的管道中使用。

例如,如果由于某种原因需要分块处理数据(如果 printf 不是内置 shell 并且要处理的数据很大,可能会出现这种情况),可以使用它是:

nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
    do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done

脚本的解释如下。

需要支持三种类型的转义序列:

  1. &#D; 其中 D 是转义字符的 Unicode 代码点的十进制值;< /p>

  2. &#xH; 其中 H 是转义字符的 Unicode 代码点的十六进制值;

  3. &N; 其中 N 是转义字符的命名实体之一的名称。

生成的 named_entities.sed 脚本支持 &N; 转义,该脚本仅执行替换列表。

此支持代码点转义的方法的核心部分是 printf 实用程序,它能够:

  1. 以十六进制格式打印数字,以及

  2. 从代码点的十六进制值打印字符(使用转义符 \uHHHH\UHHHHHHHH)。

第一个功能在 sed 和 grep 的帮助下,用于将 &#D; 转义减少为 &#xH; 转义。 shell 函数 htmlEscDec2Hex 可以做到这一点。

函数 htmlHexUnescape 使用 sed 将 &#xH; 转义符转换为 printf 的 \u/\U 转义符,然后使用第二个功能来打印未转义的字符。

To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.

But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):

#!/bin/sh

htmlEscDec2Hex() {
    file=$1
    [ ! -r "$file" ] && file=$(mktemp) && cat >"$file"

    printf -- \
        "$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
        $(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')

    [ x"$1" != x"$file" ] && rm -f -- "$file"
}

htmlHexUnescape() {
    printf -- "$(
        sed 's/\\/\\\\/g;s/%/%%/g
            ;s/&#x\([0-9a-fA-F]\{1,8\}\);/\�\1;/g
            ;s/�*\([0-9a-fA-F]\{4\}\);/\\u\1/g
            ;s/�*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}

htmlEscDec2Hex "$1" | htmlHexUnescape \
    | sed -f named_entities.sed

Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.

This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:

<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
  const referenceURL = 'https://html.spec.whatwg.org/entities.json';

  function writeln(element, text) {
    element.appendChild( document.createTextNode(text) );
    element.appendChild( document.createElement("br") );
  }

  (async function(container) {
    const json = await (await fetch(referenceURL)).json();
    container.innerHTML = "";
    writeln(container, "#!/usr/bin/sed -f");
    const addLast = [];
    for (const name in json) {
      const characters = json[name].characters
        .replace("\\", "\\\\")
        .replace("/", "\\/");
      const command = "s/" + name + "/" + characters + "/g";
      if ( name.endsWith(";") ) {
        writeln(container, command);
      } else {
        addLast.push(command);
      }
    }
    for (const command of addLast) { writeln(container, command); }
  })( document.getElementById("sed-script") );
</script>
</body></html>

Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.

Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.

For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:

nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
    do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done

Explanation of the script follows.

There are three types of escape sequences that need to be supported:

  1. &#D; where D is the decimal value of the escaped character’s Unicode code point;

  2. &#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;

  3. &N; where N is the name of one of the named entities for the escaped character.

The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.

The central piece of this method for supporting the code point escapes is the printf utility, which is able to:

  1. print numbers in hexadecimal format, and

  2. print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).

The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.

The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.

喜你已久 2024-11-12 11:23:45

我创建了一个基于 实体列表所以它必须处理大多数实体。

sed -f htmlentities.sed < file.html

I have created a sed script based on the list of entities so it must handle most of the entities.

sed -f htmlentities.sed < file.html
阿楠 2024-11-12 11:23:45

我原来的答案得到了一些评论,即 recode 不适用于 UTF-8 编码的 HTML 文件。这是正确的。 recode 仅支持 HTML 4。编码 HTMLHTML_4.0 的别名:

$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML

HTML 4 的默认编码是 Latin-1。这在 HTML 5 中发生了变化。HTML 5 的默认编码是 UTF-8。这就是 recode 不适用于 HTML 5 文件的原因。

HTML 5 在此处定义实体列表:

该定义包括 JSON 格式的机器可读规范:

JSON 文件可用于执行简单的文本替换。以下示例是一个自修改 Perl 脚本,它将 JSON 规范缓存在其 DATA 块中。

注意:出于一些模糊的兼容性原因,规范允许实体没有终止分号。因此,实体按长度以相反的顺序排序,以确保首先替换正确的实体,这样它们就不会被没有结束分号的实体破坏。

#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);

my $entities;

INIT {
  if (eof DATA) {
    my $data = tell DATA;
    open DATA, '+<', $0;
    seek DATA, $data, 0;
    my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
    print DATA $entities_json;
    truncate DATA, tell DATA;
    close DATA;
    $entities = parse_json ($entities_json);
  } else {
    local $/ = undef;
    $entities = parse_json (<DATA>);
  }
}

local $/ = undef;
my $html = <>;

for my $entity (sort { length $b <=> length $a } keys %$entities) {
  my $characters = $entities->{$entity}->{characters};
  $html =~ s/$entity/$characters/g;
}

print $html;

__DATA__

用法示例:

$ echo '

My original answer got some comments, that recode does not work for UTF-8 encoded HTML files. This is correct. recode supports only HTML 4. The encoding HTML is an alias for HTML_4.0:

$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML

The default encoding for HTML 4 is Latin-1. This has changed in HTML 5. The default encoding for HTML 5 is UTF-8. This is the reason, why recode does not work for HTML 5 files.

HTML 5 defines the list of entities here:

The definition includes a machine readable specification in JSON format:

The JSON file can be used to perform a simple text replacement. The following example is a self modifying Perl script, which caches the JSON specification in its DATA chunk.

Note: For some obscure compatibility reasons, the specification allows entities without a terminating semicolon. Because of that the entities are sorted by length in reverse order to make sure, that the correct entities are replaced first so that they do not get destroyed by entities without the ending semicolon.

#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);

my $entities;

INIT {
  if (eof DATA) {
    my $data = tell DATA;
    open DATA, '+<', $0;
    seek DATA, $data, 0;
    my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
    print DATA $entities_json;
    truncate DATA, tell DATA;
    close DATA;
    $entities = parse_json ($entities_json);
  } else {
    local $/ = undef;
    $entities = parse_json (<DATA>);
  }
}

local $/ = undef;
my $html = <>;

for my $entity (sort { length $b <=> length $a } keys %$entities) {
  my $characters = $entities->{$entity}->{characters};
  $html =~ s/$entity/$characters/g;
}

print $html;

__DATA__

Example usage:

$ echo '???? & ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
???? & ٱلْعَرَبِيَّة
疧_╮線 2024-11-12 11:23:45

我还找到了这个解决方案:

#!/bin/bash

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&/\&/g; s/</\</g; 
    # s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g; 
    # s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut// / }"
    LineOut="${LineOut//&/&}"
    LineOut="${LineOut//</<}"
    LineOut="${LineOut//>/>}"
    LineOut="${LineOut//"/'"'}"
    LineOut="${LineOut//'/"'"}"
    LineOut="${LineOut//'/"'"}"
    LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

[ -n "$1" ] && HTMLtoText "$1" && echo "$LineOut" && exit 0 || echo "no arg(s)" && echo chk && exit 1```

Also I found this one solution:

#!/bin/bash

LineOut=""      # Make global
HTMLtoText () {
    LineOut=$1  # Parm 1= Input line
    # Replace external command: Line=$(sed 's/&/\&/g; s/</\</g; 
    # s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g; 
    # s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
    LineOut="${LineOut// / }"
    LineOut="${LineOut//&/&}"
    LineOut="${LineOut//</<}"
    LineOut="${LineOut//>/>}"
    LineOut="${LineOut//"/'"'}"
    LineOut="${LineOut//'/"'"}"
    LineOut="${LineOut//'/"'"}"
    LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
    LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

[ -n "$1" ] && HTMLtoText "$1" && echo "$LineOut" && exit 0 || echo "no arg(s)" && echo chk && exit 1```
随心而道 2024-11-12 11:23:45

这是纯 bash 中的一个:

#!/bin/bash
uri_decode() {
local encoded_str="$1"
printf -v decoded_str '%b' "${encoded_str//%/\\x}"
echo "$decoded_str"
}

decoded_str=$(uri_decode "$1")

享受 bash 的乐趣:-)

Here is one in pure bash:

#!/bin/bash
uri_decode() {
local encoded_str="$1"
printf -v decoded_str '%b' "${encoded_str//%/\\x}"
echo "$decoded_str"
}

decoded_str=$(uri_decode "$1")

Have some bash fun :-)

魂ガ小子 2024-11-12 11:23:45

使用 Xidel

echo 'hello < : " world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

With Xidel:

echo 'hello < : " world' | xidel -s - -e 'parse-html($raw)'
hello < : " world
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文