Node.js代理，处理gzip解压缩

发布于 2024-10-10 22:38:06 字数 2646 浏览 7 评论 0原文

我目前正在开发一个代理服务器，在这种情况下我们必须修改我们推送的数据（通过使用正则表达式）。

在大多数情况下，它工作得很好，除了使用 gzip 作为内容编码的网站（我认为），我遇到过一个名为压缩并尝试通过解压缩/gunzip流推送我收到的块，但结果并没有真正达到我的预期。（请参阅下面的代码）

我想发布一些代码来支持我的问题，这是使用 mvc （express）加载的代理：

module.exports = {
index: function(request, response){
    var iframe_url = "www.nu.nl"; // site with gzip encoding    

    var http = require('http');     
    var httpClient = http.createClient(80, iframe_url);
    var headers = request.headers;
    headers.host = iframe_url;

    var remoteRequest = httpClient.request(request.method, request.url, headers);

    request.on('data', function(chunk) {
        remoteRequest.write(chunk);
    });

    request.on('end', function() {
        remoteRequest.end();
    });

    remoteRequest.on('response', function (remoteResponse){         
        var body_regexp = new RegExp("<head>"); // regex to find first head tag
        var href_regexp = new RegExp('\<a href="(.*)"', 'g'); // regex to find hrefs

        response.writeHead(remoteResponse.statusCode, remoteResponse.headers);

        remoteResponse.on('data', function (chunk) {
    var body = doDecompress(new compress.GunzipStream(), chunk);
            body = body.replace(body_regexp, "<head><base href=\"http://"+ iframe_url +"/\">");
            body = body.replace(href_regexp, '<a href="#" onclick="javascript:return false;"');             

            response.write(body, 'binary');
        });

        remoteResponse.on('end', function() {

            response.end();
            });
        });
    }
};

在 var 主体部分我想读取主体，例如在本例中删除所有通过将其替换为 # 来指定 href。当然，这里的问题是，当我们有一个经过 gzip 编码/压缩的网站时，它都是乱码，我们无法应用正则表达式。

现在我已经厌倦了使用节点压缩模块：

 doDecompress(new compress.GunzipStream(), chunk);

它指的是

function doDecompress(decompressor, input) {
  var d1 = input.substr(0, 25);
  var d2 = input.substr(25);

  sys.puts('Making decompression requests...');
  var output = '';
  decompressor.setInputEncoding('binary');
  decompressor.setEncoding('utf8');
  decompressor.addListener('data', function(data) {
    output += data;
  }).addListener('error', function(err) {
    throw err;
  }).addListener('end', function() {
    sys.puts('Decompressed length: ' + output.length);
    sys.puts('Raw data: ' + output);
  });
  decompressor.write(d1);
  decompressor.write(d2);
  decompressor.close();
  sys.puts('Requests done.');
}

但它失败了，因为块输入是一个对象，所以我尝试将它作为 chunk.toString() 提供，它也因无效输入而失败数据。

我想知道我是否正朝着正确的方向前进？

原文

I'm currently working on a proxy server where we in this case have to modify the data (by using regexp) that we push through it.

In most cases it works fine except for websites that use gzip as content-encoding (I think), I've come across a module called compress and tried to push the chunks that I receive through a decompress / gunzip stream but it isn't really turning out as I expected. (see below for code)

figured i'd post some code to support my prob, this is the proxy that gets loaded with mvc (express):

module.exports = {
index: function(request, response){
    var iframe_url = "www.nu.nl"; // site with gzip encoding    

    var http = require('http');     
    var httpClient = http.createClient(80, iframe_url);
    var headers = request.headers;
    headers.host = iframe_url;

    var remoteRequest = httpClient.request(request.method, request.url, headers);

    request.on('data', function(chunk) {
        remoteRequest.write(chunk);
    });

    request.on('end', function() {
        remoteRequest.end();
    });

    remoteRequest.on('response', function (remoteResponse){         
        var body_regexp = new RegExp("<head>"); // regex to find first head tag
        var href_regexp = new RegExp('\<a href="(.*)"', 'g'); // regex to find hrefs

        response.writeHead(remoteResponse.statusCode, remoteResponse.headers);

        remoteResponse.on('data', function (chunk) {
    var body = doDecompress(new compress.GunzipStream(), chunk);
            body = body.replace(body_regexp, "<head><base href=\"http://"+ iframe_url +"/\">");
            body = body.replace(href_regexp, '<a href="#" onclick="javascript:return false;"');             

            response.write(body, 'binary');
        });

        remoteResponse.on('end', function() {

            response.end();
            });
        });
    }
};

at the var body part i want to read the body and for example in this case remove all hrefs by replacing them with an #. The problem here of course is when we have an site which is gzip encoded/ compressed it's all jibberish and we can't apply the regexps.

now I've already tired to mess around with the node-compress module:

 doDecompress(new compress.GunzipStream(), chunk);

which refers to

function doDecompress(decompressor, input) {
  var d1 = input.substr(0, 25);
  var d2 = input.substr(25);

  sys.puts('Making decompression requests...');
  var output = '';
  decompressor.setInputEncoding('binary');
  decompressor.setEncoding('utf8');
  decompressor.addListener('data', function(data) {
    output += data;
  }).addListener('error', function(err) {
    throw err;
  }).addListener('end', function() {
    sys.puts('Decompressed length: ' + output.length);
    sys.puts('Raw data: ' + output);
  });
  decompressor.write(d1);
  decompressor.write(d2);
  decompressor.close();
  sys.puts('Requests done.');
}

But it fails on it since the chunk input is an object, so i tried supplying it as an chunk.toString() which also fails with invalid input data.

I was wondering if I am at all heading in the right direction?

分享到QQ

分享到微博