去除标签内容之外的文本
使用 pyparsing 可以实现相反的效果,如下所示:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
How can I keep the content of the tag "table"
?
更新0:
我尝试过: # 只保留表 tableOpen, tableClose = makeHTMLTags("表") tableBody = tableOpen + SkipTo(tableClose) + tableClose f = 替换(tableBody) tableBody.setParseAction(f) 数据 = (tableBody).transformString(数据) 打印数据
,我得到类似这样的东西...
garbages
<input type="hidden" name="cassstx" value="en_US:frontend"></form></td></tr></table></span></td></tr></table>
{<"table"> SkipTo:(</"table">) </"table">}
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">}
</div>
even more garbages
更新 2:
谢谢 Martelli。我需要的是:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
print thetable
The opposite may be achieved using pyparsing as follows:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
removeText = replaceWith("")
scriptOpen, scriptClose = makeHTMLTags("script")
scriptBody = scriptOpen + SkipTo(scriptClose) + scriptClose
scriptBody.setParseAction(removeText)
data = (scriptBody).transformString(data)
How could I keep the contents of the tag "table"
?
UPDATE 0:
I tried:
# keep only the tables
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
f = replaceWith(tableBody)
tableBody.setParseAction(f)
data = (tableBody).transformString(data)
print data
and I get something like this...
garbages
<input type="hidden" name="cassstx" value="en_US:frontend"></form></td></tr></table></span></td></tr></table>
{<"table"> SkipTo:(</"table">) </"table">}
<div id="asbnav" style="padding-bottom: 10px;">{<"table"> SkipTo:(</"table">) </"table">}
</div>
even more garbages
UPDATE 2:
Thanks Martelli. What I need is:
from pyparsing import Suppress, replaceWith, makeHTMLTags, SkipTo
#...
data = 'before<script>ciao<table>buh</table>bye</script>after'
tableOpen, tableClose = makeHTMLTags("table")
tableBody = tableOpen + SkipTo(tableClose) + tableClose
thetable = (tableBody).searchString(data)[0][2]
print thetable
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以首先提取表(与现在提取脚本的方式类似,但当然不删除;-),获取
thetable
字符串;然后,您提取脚本replaceWith(thetable)
而不是replaceWith('')
。或者,您可以准备更复杂的解析操作,但简单的两阶段方法对我来说看起来更直接。例如(专门保留table
的内容,而不是table
标签):这会打印
beforebuhafter
(脚本标签之外的内容,夹在里面的表标签的内容),希望“根据需要”。You could first extract the table (similarly to the way you're now extracting the script but without the removal of course;-), obtaining a
thetable
string; then, you extract the script,replaceWith(thetable)
instead ofreplaceWith('')
. Alternatively, you could prepare a more elaborate parse action, but the simple two-phase approach looks more straightforward to me. E.g. (to preserve specifically the contents of thetable
, not thetable
tags):This prints
beforebuhafter
(what's outside the script tag, with the contents of the table tag sandwiched inside), hopefully "as desired".