在 R 中使用通配符过滤语义标签之间的单词
我有一个数据集,它有一个特征主体,其中所有文本都来自 html 文件,并包含像这样的语义标签,
获得了 6 托尼提名本周,其中包括尼永奥(最佳女主角)的提名。 黯然失色作为第一部百老汇戏剧也具有重要意义。演员阵容和创意团队全部由黑人、女性和非洲裔组成。 (该剧由 Danai Gurira 编剧,他还在《行尸走肉》中扮演米琼恩。) \n
我想使用通配符删除语义标记之间的所有文本。有办法这样做吗?
我的逻辑是删除注释标签及其内部的所有内容。
I have a dataset that has a feature body which are all text from an html file and include semantic tags like so,
</strong>earned six <a href="http://www.vox.com/2016/5/3/11576244/tony-award-nominations-hamilton">Tony nominations</a> this week, including one for Nyong'o (Best Actress in a Leading Role). <em>Eclipsed</em> is also significant for being the <a href="http://www.vox.com/identities/2016/5/3/11578062/eclipsed-play-tony-nomination">first Broadway play</a> to feature a cast and creative team that is entirely black, female, and of African descent. (The play was written by Danai Gurira, who also plays Michonne on <em>The Walking Dead</em>.)</p> \n<p><!-- ######## BEGIN SNIPPET ######## -->
I would like to remove all text between semantic tags using wildcards. Is there a way to do so?
<!-- .-->
My logic here is to remove the comment tag with everything inside of it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设您的数据框如下所示
,那么您可以使用
在新列
new_text
中获取所需的输出。Supposing your data frame looks like this
Then you could use
to get your desired output in a new column
new_text
.