在 R 中使用通配符过滤语义标签之间的单词

发布于 2025-01-16 16:36:51 字数 513 浏览 3 评论 0原文

我有一个数据集,它有一个特征主体,其中所有文本都来自 html 文件,并包含像这样的语义标签,

获得了 6 托尼提名本周,其中包括尼永奥(最佳女主角)的提名。 黯然失色作为第一部百老汇戏剧也具有重要意义。演员阵容和创意团队全部由黑人、女性和非洲裔组成。 (该剧由 Danai Gurira 编剧,他还在《行尸走肉》中扮演米琼恩。) \n

我想使用通配符删除语义标记之间的所有文本。有办法这样做吗?

我的逻辑是删除注释标签及其内部的所有内容。

I have a dataset that has a feature body which are all text from an html file and include semantic tags like so,

</strong>earned six <a href="http://www.vox.com/2016/5/3/11576244/tony-award-nominations-hamilton">Tony nominations</a> this week, including one for Nyong'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the <a href="http://www.vox.com/identities/2016/5/3/11578062/eclipsed-play-tony-nomination">first Broadway play</a> to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!-- ######## BEGIN SNIPPET ######## -->

I would like to remove all text between semantic tags using wildcards. Is there a way to do so?

<!-- .--> My logic here is to remove the comment tag with everything inside of it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

楠木可依 2025-01-23 16:36:51

假设您的数据框如下所示

df <- data.frame(text = '</strong>earned six <a href="http://www.vox.com/2016/5/3/11576244/tony-award-nominations-hamilton">Tony nominations</a> this week, including one for Nyong\'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the <a href="http://www.vox.com/identities/2016/5/3/11578062/eclipsed-play-tony-nomination">first Broadway play</a> to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!--  ########  BEGIN SNIPPET  ########  -->')

,那么您可以使用

df$new_text <- gsub("<!--.*-->", "", df$text)

在新列 new_text 中获取所需的输出。

Supposing your data frame looks like this

df <- data.frame(text = '</strong>earned six <a href="http://www.vox.com/2016/5/3/11576244/tony-award-nominations-hamilton">Tony nominations</a> this week, including one for Nyong\'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the <a href="http://www.vox.com/identities/2016/5/3/11578062/eclipsed-play-tony-nomination">first Broadway play</a> to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!--  ########  BEGIN SNIPPET  ########  -->')

Then you could use

df$new_text <- gsub("<!--.*-->", "", df$text)

to get your desired output in a new column new_text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文