Node.js代理,处理gzip解压缩

发布于 2024-10-10 22:38:06 字数 2646 浏览 7 评论 0原文

我目前正在开发一个代理服务器,在这种情况下我们必须修改我们推送的数据(通过使用正则表达式)。

在大多数情况下,它工作得很好,除了使用 gzip 作为内容编码的网站(我认为),我遇到过一个名为 压缩并尝试通过解压缩/gunzip流推送我收到的块,但结果并没有真正达到我的预期。 (请参阅下面的代码)

我想发布一些代码来支持我的问题,这是使用 mvc (express)加载的代理:

module.exports = {
index: function(request, response){
    var iframe_url = "www.nu.nl"; // site with gzip encoding    

    var http = require('http');     
    var httpClient = http.createClient(80, iframe_url);
    var headers = request.headers;
    headers.host = iframe_url;

    var remoteRequest = httpClient.request(request.method, request.url, headers);

    request.on('data', function(chunk) {
        remoteRequest.write(chunk);
    });

    request.on('end', function() {
        remoteRequest.end();
    });

    remoteRequest.on('response', function (remoteResponse){         
        var body_regexp = new RegExp("<head>"); // regex to find first head tag
        var href_regexp = new RegExp('\<a href="(.*)"', 'g'); // regex to find hrefs

        response.writeHead(remoteResponse.statusCode, remoteResponse.headers);

        remoteResponse.on('data', function (chunk) {
    var body = doDecompress(new compress.GunzipStream(), chunk);
            body = body.replace(body_regexp, "<head><base href=\"http://"+ iframe_url +"/\">");
            body = body.replace(href_regexp, '<a href="#" onclick="javascript:return false;"');             

            response.write(body, 'binary');
        });

        remoteResponse.on('end', function() {

            response.end();
            });
        });
    }
};

在 var 主体部分我想读取主体,例如在本例中删除所有通过将其替换为 # 来指定 href。当然,这里的问题是,当我们有一个经过 gzip 编码/压缩的网站时,它都是乱码,我们无法应用正则表达式。

现在我已经厌倦了使用节点压缩模块:

 doDecompress(new compress.GunzipStream(), chunk);

它指的是

function doDecompress(decompressor, input) {
  var d1 = input.substr(0, 25);
  var d2 = input.substr(25);

  sys.puts('Making decompression requests...');
  var output = '';
  decompressor.setInputEncoding('binary');
  decompressor.setEncoding('utf8');
  decompressor.addListener('data', function(data) {
    output += data;
  }).addListener('error', function(err) {
    throw err;
  }).addListener('end', function() {
    sys.puts('Decompressed length: ' + output.length);
    sys.puts('Raw data: ' + output);
  });
  decompressor.write(d1);
  decompressor.write(d2);
  decompressor.close();
  sys.puts('Requests done.');
}

但它失败了,因为块输入是一个对象,所以我尝试将它作为 chunk.toString() 提供,它也因无效输入而失败数据。

我想知道我是否正朝着正确的方向前进?

I'm currently working on a proxy server where we in this case have to modify the data (by using regexp) that we push through it.

In most cases it works fine except for websites that use gzip as content-encoding (I think), I've come across a module called compress and tried to push the chunks that I receive through a decompress / gunzip stream but it isn't really turning out as I expected. (see below for code)

figured i'd post some code to support my prob, this is the proxy that gets loaded with mvc (express):

module.exports = {
index: function(request, response){
    var iframe_url = "www.nu.nl"; // site with gzip encoding    

    var http = require('http');     
    var httpClient = http.createClient(80, iframe_url);
    var headers = request.headers;
    headers.host = iframe_url;

    var remoteRequest = httpClient.request(request.method, request.url, headers);

    request.on('data', function(chunk) {
        remoteRequest.write(chunk);
    });

    request.on('end', function() {
        remoteRequest.end();
    });

    remoteRequest.on('response', function (remoteResponse){         
        var body_regexp = new RegExp("<head>"); // regex to find first head tag
        var href_regexp = new RegExp('\<a href="(.*)"', 'g'); // regex to find hrefs

        response.writeHead(remoteResponse.statusCode, remoteResponse.headers);

        remoteResponse.on('data', function (chunk) {
    var body = doDecompress(new compress.GunzipStream(), chunk);
            body = body.replace(body_regexp, "<head><base href=\"http://"+ iframe_url +"/\">");
            body = body.replace(href_regexp, '<a href="#" onclick="javascript:return false;"');             

            response.write(body, 'binary');
        });

        remoteResponse.on('end', function() {

            response.end();
            });
        });
    }
};

at the var body part i want to read the body and for example in this case remove all hrefs by replacing them with an #. The problem here of course is when we have an site which is gzip encoded/ compressed it's all jibberish and we can't apply the regexps.

now I've already tired to mess around with the node-compress module:

 doDecompress(new compress.GunzipStream(), chunk);

which refers to

function doDecompress(decompressor, input) {
  var d1 = input.substr(0, 25);
  var d2 = input.substr(25);

  sys.puts('Making decompression requests...');
  var output = '';
  decompressor.setInputEncoding('binary');
  decompressor.setEncoding('utf8');
  decompressor.addListener('data', function(data) {
    output += data;
  }).addListener('error', function(err) {
    throw err;
  }).addListener('end', function() {
    sys.puts('Decompressed length: ' + output.length);
    sys.puts('Raw data: ' + output);
  });
  decompressor.write(d1);
  decompressor.write(d2);
  decompressor.close();
  sys.puts('Requests done.');
}

But it fails on it since the chunk input is an object, so i tried supplying it as an chunk.toString() which also fails with invalid input data.

I was wondering if I am at all heading in the right direction?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不羁少年 2024-10-17 22:38:06

解压缩器需要二进制编码输入。您的响应收到的块是 Buffer 的实例,其中 toString()< /code> 方法默认返回一个 UTF-8 编码的字符串。

所以你必须使用 chunk.toString('binary') 才能使其工作,这也可以在 演示

The decompressor expects binary encoded input. The chunk that your response receives is an instance of Buffer which toString() method does by default give you an UTF-8 encoded string back.

So you have to use chunk.toString('binary') to make it work, this can also be seen in the demo.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文