在 R 中转换 HTML 字符实体编码
R 有没有办法转换 HTML 字符实体编码?
我想转换 HTML 字符实体,例如 &
到 &
或 >
到 >
对于 Perl 来说,存在可以做到这一点的 HTML::Entities 包,但我在 R 中找不到类似的东西。
我也尝试过 < code>iconv() 但无法得到满意的结果。也许还有一种使用 XML
包的方法,但我还没有弄清楚。
Is there a way in R to convert HTML Character Entity Encodings?
I would like to convert HTML character entities like&
to &
or>
to >
For Perl exists the package HTML::Entities which could do that, but I couldn't find something similar in R.
I also tried iconv()
but couldn't get satisfying results. Maybe there is also a way using the XML
package but I haven't figured it out yet.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
使用
xml2
包对 xml/html 值进行转义:示例:
Unescape xml/html values using
xml2
package:Examples:
更新:这个答案已经过时了。请根据新的 xml2 pkg 检查下面的答案。
尝试以下内容:
更新:编辑了 html2txt() 函数,使其适用于更多情况
Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.
Try something along the lines of:
UPDATE: Edited the html2txt() function so it applies to more situations
虽然 Jeroen 的解决方案可以完成这项工作,但它的缺点是它不是矢量化的,因此如果应用于大型数据集,速度会很慢。字符数。此外,它仅适用于长度为 1 的字符向量,并且必须使用 sapply 来获取更长的字符向量。
为了演示这一点,我首先创建一个大字符向量:
并应用该函数:
如果将字符向量中的所有字符串组合成一个大字符串,则速度会快得多,例如
read_html()
和xml_text()
只需使用一次。然后可以使用strsplit()
轻松地再次分离字符串:当然,您需要注意用于组合
str
中各个字符串的字符串 (<在我的示例中,code>"#_|") 没有出现在str
中的任何位置。否则,当最后再次分割大字符串时,您将引入错误。While the solution by Jeroen does the job, it has the disadvantage that it is not vectorised and therefore slow if applied to a large number of characters. In addition, it only works with a character vector of length one and one has to use
sapply
for a longer character vector.To demonstrate this, I first create a large character vector:
And apply the function:
It is much faster if all the strings in the character vector are combined into a single, large string, such that
read_html()
andxml_text()
need only be used once. The strings can then easily be separated again usingstrsplit()
:Of course, you need to be careful that the string that you use to combine the various strings in
str
("#_|"
in my example) does not appear anywhere instr
. Otherwise, you will introduce an error, when the large string is split again in the end.给出:
gives:
根据Stibu的回答,我对函数进行了基准测试。
在这里,我通过
purrr::map_chr
运算符对 Jeroen 的unescape_html
函数进行矢量化。到目前为止,这正好证实了 Stibu 的说法,即unescape_html2
确实快了很多倍!它甚至比textutils::HTMLdecode
函数更快。但我还发现
xml
版本可能更快。但是,这个函数在处理
many_strings
对象时失败了(可能是因为read_xml
无法读取欧元符号。所以我必须尝试不同的方法进行基准测试。我们也可以尝试在
hex
上,xml
版本甚至比html
版本更快。Based on Stibu's answer, I went to benchmark the functions.
Here I vectorize Jeroen's
unescape_html
function bypurrr::map_chr
operator. So far, this just confirms Stibu's claim that theunescape_html2
is indeed many times faster! It is even way faster thantextutils::HTMLdecode
function.But I also found that the
xml
version could be even faster.However, this function fails when dealing with the
many_strings
object (maybe becauseread_xml
can not read Euro symbol. So I have to try a different way for benchmarking.We can also try on
hex
ones.Here the
xml
version is even more faster than thehtml
version.