ruby nokogiri Restclient 来抓取 javascript 变量
我正在使用restclient和nokogiri来解析一些效果很好的html,但是有一条信息存储在我需要返回的js(jquery)变量中,并且我不确定如何解析它。我可以使用 Nokogiri 来解析 javascript 块,但我需要它的一个子集,这可能很简单,但我不知道该怎么做。我可能可以对其进行正则表达式,但我假设有一种更简单的方法可以使用 JS 来请求它。
@resource = RestClient.get 'http://example.com'
doc = Nokogiri::HTML(@resource)
doc.css('script').each do |script|
puts script.content
end
我想要得到什么:
<script type="text/javascript">
$(function(){
//this is it
$.Somenamespace.theCurrency = 'EUR';
//a lot more stuff
I'm using restclient and nokogiri to parse some html which works great, but there is one piece of information stored in a js (jquery) variable which I need to return and I'm not sure how to parse it. I can use Nokogiri to parse the javascript block, but I need one subset of it which is probably simple but I'm not sure how to do it. I could probably regex it but I'm assuming there's an easier way to just ask for it using JS.
@resource = RestClient.get 'http://example.com'
doc = Nokogiri::HTML(@resource)
doc.css('script').each do |script|
puts script.content
end
What I'm trying to get:
<script type="text/javascript">
$(function(){
//this is it
$.Somenamespace.theCurrency = 'EUR';
//a lot more stuff
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不确定这是否合适,但您可以按如下方式检索它:
irb(main):017:0>
irb(主):018:0>
not sure if that fits, but you could retrieve it as follows:
irb(main):017:0>
irb(main):018:0>
Nokogiri 是一个 XML 和 HTML 解析器。它不会解析节点的 CDATA 或文本内容,但它可以为您提供内容,让您使用字符串解析或正则表达式来获取所需的数据。
对于 Javascript,如果它嵌入在页面中,那么您可以获得父节点的
text
。通常这很简单:如果页面的
块中有常见的
标记。如果有多个脚本标签,您必须扩展访问器以检索正确的节点,然后处理掉。
动态加载脚本时会变得更令人兴奋,但您仍然可以通过解析脚本的 src 参数中的 URL 来获取数据,然后检索它并再次处理。
有时Javascript会嵌入到其他标签的链接中,但这只是前两种方法的另一种旋转,以获取脚本并对其进行处理。
Nokogiri is an XML and HTML parser. It doesn't parse the CDATA or text content of nodes, but it can give you the content, letting you use string parsing or regex to get at the data you want.
In the case of Javascript, if it's embedded in the page then you can get the
text
of the parent node. Often that is simple:if there is the usual
<script>
tag in the<head>
block of the page. If there are multiple script tags you have to extend the accessor to retrieve the right node, then process away.It gets more exciting when the scripts are loaded dynamically, but you can still get the data by parsing the URL from the script's
src
parameter, then retrieving it, and processing away again.Sometimes Javascript is embedded in the links of other tags, but it's just another spin on the previous two methods to get the script and process it.