去除标签内容之外的文本

发布于 2024-09-05 06:56:15 字数 1495 浏览 4 评论 0原文

使用 pyparsing 可以实现相反的效果，如下所示：

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

How can I keep the content of the tag "table"?

更新0：

我尝试过： # 只保留表 tableOpen, tableClose = makeHTMLTags("表") tableBody = tableOpen + SkipTo(tableClose) + tableClose f = 替换（tableBody） tableBody.setParseAction(f) 数据 = (tableBody).transformString(数据) 打印数据

，我得到类似这样的东西...

garbages
<input type="hidden" name="cassstx"   value="en_US:frontend"></form></td></tr></table></span></td></tr></table> 

{<"table"> SkipTo:(</"table">) </"table">} 
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">} 
</div> 
even more garbages

更新 2：

谢谢 Martelli。我需要的是：

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

print thetable

原文

The opposite may be achieved using pyparsing as follows:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

How could I keep the contents of the tag "table"?

UPDATE 0:

I tried:
# keep only the tables
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
f = replaceWith(tableBody)
tableBody.setParseAction(f)
data = (tableBody).transformString(data)
print data

and I get something like this...

garbages
<input type="hidden" name="cassstx"   value="en_US:frontend"></form></td></tr></table></span></td></tr></table> 

{<"table"> SkipTo:(</"table">) </"table">} 
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">} 
</div> 
even more garbages

UPDATE 2:

Thanks Martelli. What I need is:

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

print thetable

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

用心笑 2024-09-12 06:56:16

您可以首先提取表（与现在提取脚本的方式类似，但当然不删除;-），获取 thetable 字符串；然后，您提取脚本 replaceWith(thetable) 而不是 replaceWith('')。或者，您可以准备更复杂的解析操作，但简单的两阶段方法对我来说看起来更直接。例如（专门保留table的内容，而不是table标签）：

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

print data

这会打印beforebuhafter （脚本标签之外的内容，夹在里面的表标签的内容），希望“根据需要”。

You could first extract the table (similarly to the way you're now extracting the script but without the removal of course;-), obtaining a thetable string; then, you extract the script, replaceWith(thetable) instead of replaceWith(''). Alternatively, you could prepare a more elaborate parse action, but the simple two-phase approach looks more straightforward to me. E.g. (to preserve specifically the contents of the table, not the table tags):

from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'

tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]

removeText = replaceWith(thetable)
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)

print data

This prints beforebuhafter (what's outside the script tag, with the contents of the table tag sandwiched inside), hopefully "as desired".

回复收藏 0 原文

~没有更多了~