使用 Nokogiri 提取一些 JSON

发布于 2024-12-18 07:38:29 字数 209 浏览 6 评论 0原文

require 'open-uri'
require 'json'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.highcharts.com/demo/"))

puts doc

但是我希望能够从这个网页中提取json，使用正则表达式似乎不起作用，如何通过XPath提取JSON？

原文

require 'open-uri'
require 'json'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.highcharts.com/demo/"))

puts doc

But I want to be able to extract the json from this webpage, using regular expressions doesn't seem to work, and how to do extract JSON through XPath?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

溺ぐ爱和你が 2024-12-25 07:38:29

以下是从 URL 访问脚本标记（不引用外部文件）的方法：

require 'open-uri'
require 'nokogiri'
doc = Nokogiri.HTML(open('http://www.highcharts.com/demo/'))
inline_script = doc.xpath('//script[not(@src)]')
inline_script.each do |script|
  puts "-"*50, script.text
end

现在您只需找到所需的脚本块并提取所需的数据（使用正则表达式）。如果没有更多细节，很难猜测您想要什么并且依赖什么。

这是一个相当脆弱的正则表达式，可以找到我猜您正在寻找的内容：

inline = doc.xpath('//script[not(@src)]').map(&:text)
data   = inline.map{ |js| js[/new Highcharts\.Chart\((.+?\})\);/m,1] }.compact[0]
puts data

这是您得到的结果：

{
  chart: {
    renderTo: 'container',
    defaultSeriesType: 'line',
    marginRight: 130,
    marginBottom: 25
  },
  title: {
    text: 'Monthly Average Temperature',
    x: -20 //center
  },
  subtitle: {
    text: 'Source: WorldClimate.com',
    x: -20
  },
  xAxis: {
    categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
      'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
  },
  yAxis: {
    title: {
      text: 'Temperature (°C)'
    },
    plotLines: [{
      value: 0,
      width: 1,
      color: '#808080'
    }]
  },
  tooltip: {
    formatter: function() {
                return '<b>'+ this.series.name +'</b><br/>'+
        this.x +': '+ this.y +'°C';
    }
  },
  legend: {
    layout: 'vertical',
    align: 'right',
    verticalAlign: 'top',
    x: -10,
    y: 100,
    borderWidth: 0
  },
  series: [{
    name: 'Tokyo',
    data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6]
  }, {
    name: 'New York',
    data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5]
  }, {
    name: 'Berlin',
    data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0]
  }, {
    name: 'London',
    data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8]
  }]
}

请注意，这不是 JSON;这是一个表示 JavaScript 代码的字符串，包含对象、字符串、数组、数字和函数文字。

Here's how you can access the script tags (that don't reference an external file) from a URL:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri.HTML(open('http://www.highcharts.com/demo/'))
inline_script = doc.xpath('//script[not(@src)]')
inline_script.each do |script|
  puts "-"*50, script.text
end

Now you just need to find the script block you want and extract just the data you want (using regex). Without more details, it's hard to guess what you want and are relying upon.

Here's a fairly fragile regex that finds what I'm guessing you were looking for:

inline = doc.xpath('//script[not(@src)]').map(&:text)
data   = inline.map{ |js| js[/new Highcharts\.Chart\((.+?\})\);/m,1] }.compact[0]
puts data

Here's what you get out:

{
  chart: {
    renderTo: 'container',
    defaultSeriesType: 'line',
    marginRight: 130,
    marginBottom: 25
  },
  title: {
    text: 'Monthly Average Temperature',
    x: -20 //center
  },
  subtitle: {
    text: 'Source: WorldClimate.com',
    x: -20
  },
  xAxis: {
    categories: ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
      'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
  },
  yAxis: {
    title: {
      text: 'Temperature (°C)'
    },
    plotLines: [{
      value: 0,
      width: 1,
      color: '#808080'
    }]
  },
  tooltip: {
    formatter: function() {
                return '<b>'+ this.series.name +'</b><br/>'+
        this.x +': '+ this.y +'°C';
    }
  },
  legend: {
    layout: 'vertical',
    align: 'right',
    verticalAlign: 'top',
    x: -10,
    y: 100,
    borderWidth: 0
  },
  series: [{
    name: 'Tokyo',
    data: [7.0, 6.9, 9.5, 14.5, 18.2, 21.5, 25.2, 26.5, 23.3, 18.3, 13.9, 9.6]
  }, {
    name: 'New York',
    data: [-0.2, 0.8, 5.7, 11.3, 17.0, 22.0, 24.8, 24.1, 20.1, 14.1, 8.6, 2.5]
  }, {
    name: 'Berlin',
    data: [-0.9, 0.6, 3.5, 8.4, 13.5, 17.0, 18.6, 17.9, 14.3, 9.0, 3.9, 1.0]
  }, {
    name: 'London',
    data: [3.9, 4.2, 5.7, 8.5, 11.9, 15.2, 17.0, 16.6, 14.2, 10.3, 6.6, 4.8]
  }]
}

Note that this is not JSON; this is a string representing JavaScript code with object, string, array, numeric, and function literals.

回复收藏 0 原文

待天淡蓝洁白时 2024-12-25 07:38:29

require 'open-uri'
require 'json'
doc = JSON.parse(open("http://www.highcharts.com/demo/"))

require 'open-uri'
require 'json'
doc = JSON.parse(open("http://www.highcharts.com/demo/"))

回复收藏 0 原文

~没有更多了~

关于作者

心在旅行

暂无简介

文章

26 人气

关注发私信

达拉崩吧

文章 0 评论 0

关注

PANGOO

文章 0 评论 0

关注

kkgtx

文章 0 评论 0

关注

WordPress小学生

文章 0 评论 0

关注

酷炫老祖宗

文章 0 评论 0

关注

硪扪都還晓

文章 0 评论 0

友情链接

文江博客

使用 Nokogiri 提取一些 JSON

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

使用 Nokogiri 提取一些 JSON

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

达拉崩吧

PANGOO

kkgtx

WordPress小学生

酷炫老祖宗

硪扪都還晓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。