使用 YQL 提取 HTML 内容？

发布于 2024-12-12 02:54:27 字数 729 浏览 0 评论 0原文

假设我想从带有以下标记的网页中提取数据：

<table>
  <tr>
    <td><a href="Link 1">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  <tr>
    <td><a href="Link 2">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  ...
</table>

为 JSON 格式：

[
  {
    link: 'Link 1',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  },
  {
    link: 'Link 2',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  }
]

我们可以使用 YQL 来实现吗？如果是，请给我一个示例查询。

任何帮助将不胜感激！

原文

Let say I want to extract data from a web page with the following markup:

<table>
  <tr>
    <td><a href="Link 1">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  <tr>
    <td><a href="Link 2">Column 1 Text</a></td>
    <td>Column 2 Text</td>
    <td>Column 3 Text</td>
  </tr>
  ...
</table>

to JSON format :

[
  {
    link: 'Link 1',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  },
  {
    link: 'Link 2',
    text: 'Column 1 Text',
    data: 'Column 3 Text'
  }
]

Can we make it with YQL? If yes then please give me an example query.

Any helps would be appreciated!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话难免假 2024-12-19 02:54:27

这是一个很好的起点，它使用 HTML 表以及一些 XPath 查询（请参阅使用 XPath 提取 HTML 内容了解有关此技术的更多详细信息）：

select * from html where url="http://cantoni.org/test/table.html" and xpath='//table/tr'

其中生成如下 JSON 结果：

{
 "query": {
  "count": 2,
  "created": "2012-01-06T20:16:46Z",
  "lang": "en-US",
  "results": {
   "tr": [
    {
     "td": [
      {
       "a": {
        "href": "Link%201",
        "content": "Column 1 Text"
       }
      },
      {
       "p": "Column 2 Text"
      },
      {
       "p": "Column 3 Text"
      }
     ]
    },
    {
     "td": [
      {
       "a": {
        "href": "Link%202",
        "content": "Column 1 Text"
       }
      },
      {
       "p": "Column 2 Text"
      },
      {
       "p": "Column 3 Text"
      }
     ]
    }
   ]
  }
 }
}

Here's a query that's a good starting point, using the HTML table along with some XPath query (see Extracting HTML Content With XPath for more details on this technique):

select * from html where url="http://cantoni.org/test/table.html" and xpath='//table/tr'

Which produces JSON results like this:

{
 "query": {
  "count": 2,
  "created": "2012-01-06T20:16:46Z",
  "lang": "en-US",
  "results": {
   "tr": [
    {
     "td": [
      {
       "a": {
        "href": "Link%201",
        "content": "Column 1 Text"
       }
      },
      {
       "p": "Column 2 Text"
      },
      {
       "p": "Column 3 Text"
      }
     ]
    },
    {
     "td": [
      {
       "a": {
        "href": "Link%202",
        "content": "Column 1 Text"
       }
      },
      {
       "p": "Column 2 Text"
      },
      {
       "p": "Column 3 Text"
      }
     ]
    }
   ]
  }
 }
}

回复收藏 0 原文

~没有更多了~