CFHTTP编码问题

发布于 2024-09-03 20:09:17 字数 1921 浏览 8 评论 0原文

我正在尝试使用 cfhttp 提取一个页面来解析其中的信息。我调用的页眉是:

内容编码:gzip

连接:保持活动状态

内容长度:19066

服务器:IBM_HTTP_Server

变化:接受编码,用户代理

内容语言:en-US

缓存控制:no-cache="set-cookie,

设置cookie2“

内容类型:

text/html;charset=ISO-8859-1

我将字符集设置为 ISO-8859-1 但是我在 FileContent 中得到以下内容(下面仅显示了一个小示例,但我认为它指出了这一点穿过)。

EðÑq·Oã?Ì\ZóL´þ´Vú5ðbä£ÿæ⁄_HÉÒñQãO\Çþãë85ÁÜ à±°ùÖ}&bßý?,u?2SùQyk5g?UÛ3Ѹfã×ARíi_iûRã _ òCA¿-ß."b /¯ßíWÝÆ´}w~,°iøÜCáÇþ@àZ5¤ïsÁ8½°ì* ZÜéjOÝK/Ë4§ÈG5×ä*Ø6ÚwÇ0]ã:àÑþéØG"ÅÁl/t° jlá»5¶&̀lìYìºØ'yDð½|#ý<ñìTé%¤ªÆªx¶}«±o9»ë⁄ÆÒï'w8Y?÷ðxsllû 6íqüGÞsÜóÀx·ªk®XºàåZ{íÁ½åo÷mbq¥ÝÝ8M

我尝试了其他字符集,并认为 gzip 编码是导致问题的原因,但我不确定如何测试这是否是问题所在。任何建议或帮助将非常有价值。

下面是我的代码,

<cfhttp 
    METHOD="get"
    throwonerror="yes" 
    CHARSET="ISO-8859-1"
    URL="http://www.cars.com/for-sale/searchresults.action?sf1Dir=DESC&prMn=1&crSrtFlds=stkTypId-feedSegId-pseudoPrice&rd=100000&zc=44203&PMmt=0-0-0&stkTypId=28881&sf2Dir=ASC&sf1Nm=price&sf2Nm=miles&feedSegId=28705&searchSource=UTILITY&pgId=2102&rpp=10">

    <cfhttpparam type="Header" name="Accept-Encoding" value="deflate;q=0">
    <cfhttpparam type= "Header" name= "TE" value= "deflate;q=0" >
</cfhttp>

<cfset listings = #cfhttp.FileContent#>
<cfoutput>
    #listings#
</cfoutput>

我也尝试过标头:

    <cfhttpparam type="Header" name="Accept-Encoding" value="*">
    <cfhttpparam type= "Header" name= "TE" value= "deflate;q=0" >

并尝试删除“Accept-Encoding”标头并只留下 TE。

更新: 我还没弄清楚,但我发现了一些可能会帮助别人帮助我的东西。当我使用我的测试 php 服务器在同一页面上运行 file_get_contents 并且它工作正常时,如果我运行相同的 cfhttp 代码来调用正在调用该页面的 php 页面,我需要它工作得很好。感谢迄今为止提出的建议。

I am trying to pull a page for parsing information out of it using cfhttp. The page headers that I am calling are:

Content-Encoding: gzip

Connection: Keep-Alive

Content-Length: 19066

Server: IBM_HTTP_Server

Vary: Accept-Encoding, User-Agent

Content-Language: en-US

Cache-Control: no-cache="set-cookie,

set-cookie2"

Content-Type:

text/html;charset=ISO-8859-1

I set the charset to ISO-8859-1 however I am getting the following in the FileContent (only a small sample is shown below but I think it gets to point across).

EðÑq·Oã?·Ì\ZóL¯þ´Vú5ðbä£ÿæ¾_HÉÒñQãO\Çþãë85ÁÜ
à±°ùÖ}&bßý?,u?2SùQyk5g?UÛ3Ѹfã×ARÃi_iûRã
_ òCA¿-ß."b /¯ßíWÝÆ´}w~,°iøÜCáÇþ@ÃZ5¤ïsÁ8½°ì*
ZÜéjOÝK/Ë4§ÈG5×ä*¬6ÚwÇ0]ã:àÑþé¬G"ÅÁl/t°
jlá»5¶&¯lìYìºØ'yDð½|#ý<ñìTé%¾ï¬ùƪx¶}«±o9»ë¼ÂÆÒï'w8Y?
÷ðxsllû
6íqüGÞsÜóÀx·ªk®XºàåZ{íÁ½åo÷mbq¥ÝÃ8M

I tried other charsets and was considering the gzip encoding to be causing the problem but I am unsure how the test if that is the issue. Any suggestions or help would be greatly valued.

Below is my Code

<cfhttp 
    METHOD="get"
    throwonerror="yes" 
    CHARSET="ISO-8859-1"
    URL="http://www.cars.com/for-sale/searchresults.action?sf1Dir=DESC&prMn=1&crSrtFlds=stkTypId-feedSegId-pseudoPrice&rd=100000&zc=44203&PMmt=0-0-0&stkTypId=28881&sf2Dir=ASC&sf1Nm=price&sf2Nm=miles&feedSegId=28705&searchSource=UTILITY&pgId=2102&rpp=10">

    <cfhttpparam type="Header" name="Accept-Encoding" value="deflate;q=0">
    <cfhttpparam type= "Header" name= "TE" value= "deflate;q=0" >
</cfhttp>

<cfset listings = #cfhttp.FileContent#>
<cfoutput>
    #listings#
</cfoutput>

I have also tried the headers:

    <cfhttpparam type="Header" name="Accept-Encoding" value="*">
    <cfhttpparam type= "Header" name= "TE" value= "deflate;q=0" >

And tried removing the 'Accept-Encoding' header and just leaving the TE.

UPDATE:
I still havn't figured it out, but I found something that might help someone help me out. When I used a test php server of mine to run file_get_contents on the same page and it worked fine, then if I ran the same cfhttp code to call the php page that was calling the page I need it worked just fine. Thanks for the suggestions so far.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

暮光沉寂 2024-09-10 20:09:17

cars.com 的问题似乎是他们将输出压缩两次(基于 此线程)

因此,我们需要再次解压内容...

首先,我们需要获取二进制内容,所以CFHTTP调用需要包含

getasbinary="yes"

然后,我们需要解压它。

我们可以使用 java.util.zip 来做到这一点。 Gunzip 是此 cflib.org 函数的修改版本:

<cfhttp
    getasbinary="yes"
    METHOD="get"
    throwonerror="yes"
    CHARSET="ISO-8859-1"
    URL="http://www.cars.com/for-sale/searchresults.action?sf1Dir=DESC&prMn=1&crSrtFlds=stkTypId-feedSegId-pseudoPrice&rd=100000&zc=44203&PMmt=0-0-0&stkTypId=28881&sf2Dir=ASC&sf1Nm=price&sf2Nm=miles&feedSegId=28705&searchSource=UTILITY&pgId=2102&rpp=10" >

    <cfhttpparam type="Header" name="Accept" value="application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5">
    <cfhttpparam type="Header" name="User-Agent" value="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41">
    <cfhttpparam type="Header" name="Accept-Encoding" value="deflate">
    <cfhttpparam type="Header" name="TE" value="deflate, chunked, identity, trailers">

</cfhttp>

<cfset unzippedHTML = gunzip(cfhttp.FileContent)>

<cfoutput>
    #unzippedHTML#
</cfoutput>

<cfscript>

    function gunzip(inBytes) {
        var gzInStream = createObject('java','java.util.zip.GZIPInputStream');
        var outStream = createObject('java','java.io.ByteArrayOutputStream');
        var inStream = createObject('java','java.io.ByteArrayInputStream');
        var buffer = repeatString(" ",1024).getBytes();
        var length = 0;
        var rv = "";

        try {
            inStream.init(inBytes);
            gzInStream.init(inStream);
            outStream.init();
            do {
                length = gzInStream.read(buffer,0,1024);
                if (length neq -1) outStream.write(buffer,0,length);
            } while (length neq -1);
            rv = outStream.toString();
            outStream.close();
            gzInStream.close();
            inStream.close();
        }
        catch (any e) {
            rv = "";
            try {
                outStream.close();
            } catch (any e) { }
                try {
                    gzInStream.close();
                } catch (any e) {
                    try {
                        inStream.close();
                    } catch (any e) {}
                }
        }
        return rv;
    }
</cfscript>

请务必仔细检查 var 作用域的功能。我可能错过了什么。

The issue with cars.com seems to be that they're gzipping the output twice (based on this thread)

So, we need to unzip the content... again...

First, we need to get the content as binary, so the CFHTTP call needs to include

getasbinary="yes"

Then, we need to unzip it.

We can use java.util.zip to do it. The gunzip is a modified version of this cflib.org function:

<cfhttp
    getasbinary="yes"
    METHOD="get"
    throwonerror="yes"
    CHARSET="ISO-8859-1"
    URL="http://www.cars.com/for-sale/searchresults.action?sf1Dir=DESC&prMn=1&crSrtFlds=stkTypId-feedSegId-pseudoPrice&rd=100000&zc=44203&PMmt=0-0-0&stkTypId=28881&sf2Dir=ASC&sf1Nm=price&sf2Nm=miles&feedSegId=28705&searchSource=UTILITY&pgId=2102&rpp=10" >

    <cfhttpparam type="Header" name="Accept" value="application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5">
    <cfhttpparam type="Header" name="User-Agent" value="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.41">
    <cfhttpparam type="Header" name="Accept-Encoding" value="deflate">
    <cfhttpparam type="Header" name="TE" value="deflate, chunked, identity, trailers">

</cfhttp>

<cfset unzippedHTML = gunzip(cfhttp.FileContent)>

<cfoutput>
    #unzippedHTML#
</cfoutput>

<cfscript>

    function gunzip(inBytes) {
        var gzInStream = createObject('java','java.util.zip.GZIPInputStream');
        var outStream = createObject('java','java.io.ByteArrayOutputStream');
        var inStream = createObject('java','java.io.ByteArrayInputStream');
        var buffer = repeatString(" ",1024).getBytes();
        var length = 0;
        var rv = "";

        try {
            inStream.init(inBytes);
            gzInStream.init(inStream);
            outStream.init();
            do {
                length = gzInStream.read(buffer,0,1024);
                if (length neq -1) outStream.write(buffer,0,length);
            } while (length neq -1);
            rv = outStream.toString();
            outStream.close();
            gzInStream.close();
            inStream.close();
        }
        catch (any e) {
            rv = "";
            try {
                outStream.close();
            } catch (any e) { }
                try {
                    gzInStream.close();
                } catch (any e) {
                    try {
                        inStream.close();
                    } catch (any e) {}
                }
        }
        return rv;
    }
</cfscript>

Be sure to double-check the var scoping of the function. I might have missed something.

无远思近则忧 2024-09-10 20:09:17

根据标题,您看到的是文件的 gzip 内容。它需要先解压缩,然后才能对您有用。我假设您可以使用 cfzip 来完成此操作,但没有任何经验。

这篇文章似乎表示您可以在请求中添加标头,以便在返回之前将其解压缩/压缩:

<cfhttp ...>
    <cfhttpparam type="Header" name="Accept-Encoding" value="deflate;q=0">
    <cfhttpparam type="Header" name="TE" value="deflate;q=0">
</cfhttp>

Per the header what you are seeing is the gzipped contents of the file. It will need to be uncompressed before it is useful to you. I assume you can do this with cfzip but have not had any experience doing it.

This post seems to indicate that you can add a header in your request to have it unzipped/deflated before being returned:

<cfhttp ...>
    <cfhttpparam type="Header" name="Accept-Encoding" value="deflate;q=0">
    <cfhttpparam type="Header" name="TE" value="deflate;q=0">
</cfhttp>
逆光下的微笑 2024-09-10 20:09:17

我要做的第一件事是通过在其他页面上尝试相同的代码来确保问题不是源内容/服务器。如果它们工作正常,那么它可能是您尝试使用的服务器/内容。如果他们有同样的问题,那么问题就出在你的代码中。如果您发布代码也会很有帮助。

The first thing I would do is make sure that it's not the source content/server that's the problem by trying your same code against other pages. If they work fine, then it's likely the server/content that you're trying to consume. If they have the same problem, then the issue is in your code. It would also be helpful if you posted your code.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文