YQL 和 cURL - 引号字符未正确返回

发布于 2024-09-06 11:55:47 字数 1019 浏览 6 评论 0原文

我正在使用 YQL 进行一些屏幕抓取,并且任何类似引号的字符都不会正确返回。

例如,正在抓取的页面上的标记为:

There should not be a “split between what we think and what we do,”  

YQL 将其返回为:

There should not be a �split between what we think and what we do,� 

刻度线和反刻度线也会发生这种情况。

我的 JS 是这样的:

var qurlString = '&url=' + encodeURIComponent(url);
$.ajax({
  type: "POST",
  url: "/k_sys/qurl.php",
  datatype: "xml",
  data: qurlString,
  success: function(data) {
    //do something
  }
});

我的 qurl.php 是这样的:

  $BASE_URL = "http://query.yahooapis.com/v1/public/yql";
  $url = my scraped site url;
  $yql_query = "select * from html where url='$url'";
  $yql_query_url = $BASE_URL . "?q=" . urlencode($yql_query) . "&format=xml";
  $session = curl_init($yql_query_url);
  curl_setopt($session, CURLOPT_RETURNTRANSFER,true);
  $xml = curl_exec($session);
  echo $xml;

这是 cURL 问题还是 YQL 问题,我需要做什么来解决它?

谢谢!

I am using YQL for some screen scraping, and any quote-like characters are not being returned properly.

For example, markup on the page being scraped is:

There should not be a “split between what we think and what we do,”  

This is returned by YQL as:

There should not be a �split between what we think and what we do,� 

This also happens with ticks and back-ticks.

My JS is like:

var qurlString = '&url=' + encodeURIComponent(url);
$.ajax({
  type: "POST",
  url: "/k_sys/qurl.php",
  datatype: "xml",
  data: qurlString,
  success: function(data) {
    //do something
  }
});

And my qurl.php is like:

  $BASE_URL = "http://query.yahooapis.com/v1/public/yql";
  $url = my scraped site url;
  $yql_query = "select * from html where url='$url'";
  $yql_query_url = $BASE_URL . "?q=" . urlencode($yql_query) . "&format=xml";
  $session = curl_init($yql_query_url);
  curl_setopt($session, CURLOPT_RETURNTRANSFER,true);
  $xml = curl_exec($session);
  echo $xml;

Is this a cURL issue or a YQL issue, and what to I need to do to fix it?

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冰之心 2024-09-13 11:55:47

这听起来像是字符编码问题。您正在抓取的站点可能会使用 head 元素中的元标记来设置字符集,而不是配置服务器以正确识别 http 标头中的字符编码。找出网站使用的字符编码(您应该能够在浏览器的视图菜单中找到它)并将字符集键添加到 YQL 查询中。

YQL 指南中的示例:

select * from html where url='http://example.com' and charset='iso-8559-1' 

This sounds like a character encoding issue. The site you are scraping may be setting the character set using a meta tag in the head element instead of configuring the server to properly identify the character encoding in the http header. Find out the character encoding used by the site (you should be able to find this in your browser's view menu) and add the charset key to your YQL query.

Example from the YQL guide:

select * from html where url='http://example.com' and charset='iso-8559-1' 
合约呢 2024-09-13 11:55:47

源页面由 IIS 和 ASP 提供服务。我最终不得不进行简单的搜索并替换为:

str_ireplace(chr(145), chr(39), $html)

The source pages are served by IIS and ASP. I ended up having to do a simple search and replace like :

str_ireplace(chr(145), chr(39), $html)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文