Python 的 javascript 感知 html 解析器 ~
<html>
<head>
<script type="text/javascript">
document.write('<a href="http://www.google.com">f*** js</a>');
document.write("f*** js!");
</script>
</head>
<body>
<script type="text/javascript">
document.write('<a href="http://www.google.com">f*** js</a>');
document.write("f*** js!");
</script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>
我想用xpath捕获上面html页面中的所有lable对象...
In [1]: import lxml.html as H
In [2]: f = open("test.html","r")
In [3]: c = f.read()
In [4]: doc = H.document_fromstring(c)
In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]
In [6]: a = doc.xpath('//a')[0]
In [7]: a.getparent()
Out[7]: <Element div at a01d41c>
我只得到一个不是js生成的~ 但是 Firefox xpath 检查器可以找到所有标签!?
<块引用> <块引用>
如何做到这一点???谢谢~!
<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>
<html>
<head>
<script type="text/javascript">
document.write('<a href="http://www.google.com">f*** js</a>');
document.write("f*** js!");
</script>
</head>
<body>
<script type="text/javascript">
document.write('<a href="http://www.google.com">f*** js</a>');
document.write("f*** js!");
</script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>
I want use xpath to catch all lable object in the html page above...
In [1]: import lxml.html as H
In [2]: f = open("test.html","r")
In [3]: c = f.read()
In [4]: doc = H.document_fromstring(c)
In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]
In [6]: a = doc.xpath('//a')[0]
In [7]: a.getparent()
Out[7]: <Element div at a01d41c>
I only get one don't generate by js~
but firefox xpath checker can find all lable!?
how to do that??? thx~!
<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没有关于 python 中 javascript 感知解析器的线索,但您可以使用 ANTLR 来完成这项工作。这个想法不是我的,所以我给你留下了链接。
它实际上非常酷,因为您可以优化解析器以有选择地选择需要解析(和执行)的指令。
Not a clue about javascript-aware parser in python but you can use ANTLR to do the job. The idea is not mine so I'm leaving you the link.
It's actually quite cool because you can optimize your parser to selectively choose what instruction needs to be parsed (and executed).
在 Java 中,有 Cobra。我不知道有任何支持 Javascript 的 Python HTML 解析器。
In Java there is Cobra. I don't know any Javascript-aware HTML parser for Python.
在谷歌上搜索“javascript独立运行时”,我发现jslibs:一个“独立的JavaScript开发运行时环境”使用 JavaScript 作为通用脚本语言”,基于“SpiderMonkey 库” Gecko 的 JavaScript 引擎”。
听起来很棒!我还没有测试过,但看起来这将允许您运行在页面中找到的 JavaScript 代码。我不知道这会是多么棘手,但..
Searching google for "javascript standalone runtime", I found jslibs: a "standalone JavaScript development runtime environment for using JavaScript as a general-purpose scripting language", based on "SpiderMonkey library that is Gecko's JavaScript engine".
Sounds great! I haven't tested yet, but it seems like this will allow you to run the javascript code you find in the page. I don't know how much it will be tricky, though..