Python 的 javascript 感知 html 解析器 ~

发布于 2024-10-09 12:48:58 字数 1530 浏览 7 评论 0原文

<html>
<head>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
</head>
<body>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>

我想用xpath捕获上面html页面中的所有lable对象...

In [1]: import lxml.html as H

In [2]: f = open("test.html","r")

In [3]: c = f.read()

In [4]: doc = H.document_fromstring(c)

In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]

In [6]: a = doc.xpath('//a')[0]

In [7]: a.getparent()
Out[7]: <Element div at a01d41c>

我只得到一个不是js生成的~ 但是 Firefox xpath 检查器可以找到所有标签!?

<块引用> <块引用>

https://i.sstatic.net/0hSug.png

如何做到这一点???谢谢~!

<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>
<html>
<head>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
</head>
<body>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>

I want use xpath to catch all lable object in the html page above...

In [1]: import lxml.html as H

In [2]: f = open("test.html","r")

In [3]: c = f.read()

In [4]: doc = H.document_fromstring(c)

In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]

In [6]: a = doc.xpath('//a')[0]

In [7]: a.getparent()
Out[7]: <Element div at a01d41c>

I only get one don't generate by js~
but firefox xpath checker can find all lable!?

https://i.sstatic.net/0hSug.png

how to do that??? thx~!

<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

最美的太阳 2024-10-16 12:48:58

没有关于 python 中 javascript 感知解析器的线索,但您可以使用 ANTLR 来完成这项工作。这个想法不是我的,所以我给你留下了链接

它实际上非常酷,因为您可以优化解析器以有选择地选择需要解析(和执行)的指令。

Not a clue about javascript-aware parser in python but you can use ANTLR to do the job. The idea is not mine so I'm leaving you the link.

It's actually quite cool because you can optimize your parser to selectively choose what instruction needs to be parsed (and executed).

反差帅 2024-10-16 12:48:58

在 Java 中,有 Cobra。我不知道有任何支持 Javascript 的 Python HTML 解析器。

In Java there is Cobra. I don't know any Javascript-aware HTML parser for Python.

九八野马 2024-10-16 12:48:58

在谷歌上搜索“javascript独立运行时”,我发现jslibs:一个“独立的JavaScript开发运行时环境”使用 JavaScript 作为通用脚本语言”,基于“SpiderMonkey 库” Gecko 的 JavaScript 引擎”。

听起来很棒!我还没有测试过,但看起来这将允许您运行在页面中找到的 JavaScript 代码。我不知道这会是多么棘手,但..

Searching google for "javascript standalone runtime", I found jslibs: a "standalone JavaScript development runtime environment for using JavaScript as a general-purpose scripting language", based on "SpiderMonkey library that is Gecko's JavaScript engine".

Sounds great! I haven't tested yet, but it seems like this will allow you to run the javascript code you find in the page. I don't know how much it will be tricky, though..

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文