去除标签内容之外的文本

发布于 2024-09-05 06:56:15 字数 1495 浏览 4 评论 0原文

使用 pyparsing 可以实现相反的效果,如下所示:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

How can I keep the content of the tag "table"?

更新0:

我尝试过: # 只保留表 tableOpen, tableClose = makeHTMLTags("表") tableBody = tableOpen + SkipTo(tableClose) + tableClose f = 替换(tableBody) tableBody.setParseAction(f) 数据 = (tableBody).transformString(数据) 打印数据

,我得到类似这样的东西...

garbages
<input type="hidden" name="cassstx"   value="en_US:frontend"></form></td></tr></table></span></td></tr></table> 

{<"table"> SkipTo:(</"table">) </"table">} 
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">} 
</div> 
even more garbages

更新 2:

谢谢 Martelli。我需要的是:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

print thetable

The opposite may be achieved using pyparsing as follows:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

How could I keep the contents of the tag "table"?

UPDATE 0:

I tried:
# keep only the tables
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
f = replaceWith(tableBody)
tableBody.setParseAction(f)
data = (tableBody).transformString(data)
print data

and I get something like this...

garbages
<input type="hidden" name="cassstx"   value="en_US:frontend"></form></td></tr></table></span></td></tr></table> 

{<"table"> SkipTo:(</"table">) </"table">} 
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">} 
</div> 
even more garbages

UPDATE 2:

Thanks Martelli. What I need is:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

print thetable

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

用心笑 2024-09-12 06:56:16

您可以首先提取表(与现在提取脚本的方式类似,但当然不删除;-),获取 thetable 字符串;然后,您提取脚本 replaceWith(thetable) 而不是 replaceWith('')。或者,您可以准备更复杂的解析操作,但简单的两阶段方法对我来说看起来更直接。例如(专门保留table内容,而不是table标签):

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

print data

这会打印beforebuhafter (脚本标签之外的内容,夹在里面的表标签的内容),希望“根据需要”。

You could first extract the table (similarly to the way you're now extracting the script but without the removal of course;-), obtaining a thetable string; then, you extract the script, replaceWith(thetable) instead of replaceWith(''). Alternatively, you could prepare a more elaborate parse action, but the simple two-phase approach looks more straightforward to me. E.g. (to preserve specifically the contents of the table, not the table tags):

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

print data

This prints beforebuhafter (what's outside the script tag, with the contents of the table tag sandwiched inside), hopefully "as desired".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文