从 JavaScript 中提取数据(Python Scraper)
我目前正在使用 urllib2、pyquery 和 json 的融合来抓取网站,现在我发现我需要从 JavaScript 中提取一些数据。一种想法是使用 JavaScript 引擎(如 V8),但这对于我的需要来说似乎有点过分了。我会使用正则表达式,但这个表达式似乎很复杂。
JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
我需要提取 ,但我不完全确定如何执行此操作。
本身可以包含基本上所有字符,因此
[^"]
不起作用。
有什么想法吗?
I'm currently using a fusion of urllib2, pyquery, and json to scrape a site, and now I find that I need to extract some data from JavaScript. One thought would be to use a JavaScript engine (like V8), but that seems like overkill for what I need. I would use regular expressions, but the expression for this seems way to complex.
JavaScript:
(function(){DOM.appendContent(this, HTML("<html>"));;})
I need to extract the <html>
, but I'm not entirely sure how to do so. The <html>
itself can contain basically every character under the sun, so [^"]
won't work.
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为什么是正则表达式?当您知道要删除开头和结尾的字符数时,难道不能只使用两个子字符串吗?
除了比正则表达式更快之外,
内的引号是否被转义并不重要。
Why regex? Can't you just use two substrings as you know how many characters you want to trim off the beginning and end?
As well as being quicker than a regex, it then doesn't matter if quotes inside
<html>
are escaped or not.如果 html 代码中每次出现
"
都可以使用\"
进行转义(毕竟它是一个 JavaScript 字符串),那么您可以使用将 HTML 的参数获取到第一个捕获组。
请注意,此正则表达式尚未转义为 Javascript 字符串本身。
If every occurance of
"
inside the html code would be escaped by using\"
(it is a JavaScript string after all), you could useto get the parameter to HTML into the first capturing group.
Note that this Regex is not yet escaped to be a Javascript String itself.