有没有办法让YQL返回HTML?
我正在尝试使用 YQL 从一系列网页中提取 HTML 的一部分。页面本身的结构略有不同(因此 Yahoo Pipes“获取页面”及其“剪切内容”功能效果不佳),但我感兴趣的片段始终具有相同的 class
属性。
如果我有一个像这样的 HTML 页面:
<html>
<body>
<div class="foo">
<p>Wolf</p>
<ul>
<li>Dog</li>
<li>Cat</li>
</ul>
</div>
</body>
</html>
并使用像这样的 YQL 表达式:
SELECT * FROM html
WHERE url="http://example.com/containing-the-fragment-above"
AND xpath="//div[@class='foo']"
我返回的是(显然是无序的?)DOM 元素,其中我想要的是 HTML 内容本身。我也尝试过SELECT content
,但这只选择文本内容。我想要 HTML。这可能吗?
I am trying to use YQL to extract a portion of HTML from a series of web pages. The pages themselves have slightly different structure (so a Yahoo Pipes "Fetch Page" with its "Cut content" feature does not work well) but the fragment I am interested in always has the same class
attribute.
If I have an HTML page like this:
<html>
<body>
<div class="foo">
<p>Wolf</p>
<ul>
<li>Dog</li>
<li>Cat</li>
</ul>
</div>
</body>
</html>
and use a YQL expression like this:
SELECT * FROM html
WHERE url="http://example.com/containing-the-fragment-above"
AND xpath="//div[@class='foo']"
what I get back are the (apparently unordered?) DOM elements, where what I want is the HTML content itself. I've tried SELECT content
as well, but that only selects textual content. I want HTML. Is this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以编写一些 Open Data Table 来发送正常的 YQL
html
表查询和 < em>将结果字符串化。如下所示:然后,您可以使用 YQL 查询对该自定义表进行查询,例如:
编辑: 刚刚意识到这是一个非常老的问题,已经被提出了;对于任何遇到这个问题的人来说,至少最终会有一个答案。 :)
You could write a little Open Data Table to send out a normal YQL
html
table query and stringify the result. Something like the following:You could then query against that custom table with a YQL query like:
Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. :)
我也遇到了同样的问题。我解决这个问题的唯一方法是避免 YQL,只使用正则表达式来匹配开始和结束标签:/。不是最好的解决方案,但如果 html 相对不变,并且模式只是从
到
I had this same exact problem. The only way I have gotten around it is to avoid YQL and just use regular expressions to match the start and end tags :/. Not the best solution, but if the html is relatively unchanging, and the pattern just from say
<div class='name'>
to<div class='just_after
>`, then you can get away with that. Then you can get the html between.YQL 将页面转换为 XML,然后对其执行 XPath,然后获取 DOMNodeList 并将其序列化回 XML 以用于输出(然后根据需要转换为 JSON)。您无法访问原始数据。
为什么不能使用 XML 来处理 HTML?
YQL converts the page into XML, then does your XPath on it, then takes the DOMNodeList and serializes that back to XML for your output (and then converts to JSON if needed). You can't access the original data.
Why can't you deal with XML instead of HTML?