有没有办法让YQL返回HTML？

发布于 2024-08-28 01:15:43 字数 683 浏览 12 评论 0原文

我正在尝试使用 YQL 从一系列网页中提取 HTML 的一部分。页面本身的结构略有不同（因此 Yahoo Pipes“获取页面”及其“剪切内容”功能效果不佳），但我感兴趣的片段始终具有相同的 class 属性。

如果我有一个像这样的 HTML 页面：

<html>
  <body>
    <div class="foo">
      <p>Wolf</p>
      <ul>
        <li>Dog</li>
        <li>Cat</li>
      </ul>
    </div>
  </body>
</html>

并使用像这样的 YQL 表达式：

SELECT * FROM html 
WHERE url="http://example.com/containing-the-fragment-above" 
AND xpath="//div[@class='foo']"

我返回的是（显然是无序的？）DOM 元素，其中我想要的是 HTML 内容本身。我也尝试过SELECT content，但这只选择文本内容。我想要 HTML。这可能吗？

原文

I am trying to use YQL to extract a portion of HTML from a series of web pages. The pages themselves have slightly different structure (so a Yahoo Pipes "Fetch Page" with its "Cut content" feature does not work well) but the fragment I am interested in always has the same class attribute.

If I have an HTML page like this:

<html>
  <body>
    <div class="foo">
      <p>Wolf</p>
      <ul>
        <li>Dog</li>
        <li>Cat</li>
      </ul>
    </div>
  </body>
</html>

and use a YQL expression like this:

SELECT * FROM html 
WHERE url="http://example.com/containing-the-fragment-above" 
AND xpath="//div[@class='foo']"

what I get back are the (apparently unordered?) DOM elements, where what I want is the HTML content itself. I've tried SELECT content as well, but that only selects textual content. I want HTML. Is this possible?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

剧终人散尽 2024-09-04 01:15:43

您可以编写一些 Open Data Table 来发送正常的 YQL html 表查询和 < em>将结果字符串化。如下所示：

<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>
    <sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
    <description>Retrieve HTML document fragments</description>
    <author>Peter Cowburn</author>
  </meta>
  <bindings>
    <select itemPath="result.html" produces="JSON">
      <inputs>
        <key id="url" type="xs:string" paramType="variable" required="true"/>
        <key id="xpath" type="xs:string" paramType="variable" required="true"/>
      </inputs>
      <execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
    </select>
  </bindings>
</table>

然后，您可以使用 YQL 查询对该自定义表进行查询，例如：

use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where 
  url="http://finance.yahoo.com/q?s=yhoo" 
  and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'

编辑： 刚刚意识到这是一个非常老的问题，已经被提出了；对于任何遇到这个问题的人来说，至少最终会有一个答案。 :)

You could write a little Open Data Table to send out a normal YQL html table query and stringify the result. Something like the following:

<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  <meta>
    <sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
    <description>Retrieve HTML document fragments</description>
    <author>Peter Cowburn</author>
  </meta>
  <bindings>
    <select itemPath="result.html" produces="JSON">
      <inputs>
        <key id="url" type="xs:string" paramType="variable" required="true"/>
        <key id="xpath" type="xs:string" paramType="variable" required="true"/>
      </inputs>
      <execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
    </select>
  </bindings>
</table>

You could then query against that custom table with a YQL query like:

use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where 
  url="http://finance.yahoo.com/q?s=yhoo" 
  and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'

Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. :)

回复收藏 0 原文