是什么导致服务器使用RESET（RST标志）突然关闭TCP/IP连接？

发布于 2025-02-10 23:06:59 字数 5489 浏览 0 评论 0原文

TL; Dr

相当长的一段时间，我们在所有系统（包括产品！）都面临着一个奇怪的问题。定期使用服务器的TCP连接到服务器的连接（或从服务器到客户端的途中确切地关闭）。这导致了未能失败的请求，并且在文件上传中最突出，这些文件上传总是为更大的文件失败（其中较大的是＆gt; 100kb）。此外，如果通过NGINX反向代理路由，相同的请求的失败频率要少得多（但有时仍然失败！）。

设置

我们（我们称我们为 mycompany ）正在为 customercompany 开发软件（Java/Spring引导服务）。该软件以Docker容器的形式发货，并在本地托管，以 CloudCompany 或在两个不同的Azure Kubernetes群集中提供的私有云中。该软件与 saphostingcompany托管的SAP系统通信。实际上，有多个用于不同阶段的SAP系统。

该软件直接与SAP系统或通过NGINX反向代理（托管在 mycompany 的计算机上）直接通信（取决于阶段/环境）。 NGINX反向代理背后的推理是，每个IP与SAP系统通信必须由 saphostingCompany 列入白色。特别是对于地方发展，这将是非常繁琐的维护。

问题

从几周前开始，我们注意到有时请求（似乎是随机）失败。这发生在所有阶段。据说没有任何更改可能导致这种变化的任何更改...

虽然这对大多数请求来说是一个烦人的烦恼（如果失败了，可以重新审议），这完全防止了较大的文件被上传。。在这种情况下，更大的含义正义＆gt; 100kb。

我们试图调查问题，并在TCPDUMP中注意到，在故障时，服务器会发送TCP RST数据包，从而流产该连接（诚然，我们不能100％确定是服务器本身是发送RST还是某些中间组件的服务器）。这是在TCP连接内的不同阶段发送的，因此没有一个数据包（或数据包组合）立即导致服务器关闭连接。

最有趣的是，与中间Nginx反向代理的设置中，这种故障发生的频率要少得多（但仍然如此！）。

nginx反向代理

Nginx配置看起来像这样：

events {
worker_connections 1024;
}

http {
log_format combined_with_requesttime '$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time $upstream_response_time $pipe';
log_format combined_with_token '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_de_comdirect_cif_globalRequestId"';
log_format combined_with_token_host '$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_de_comdirect_cif_globalRequestId"';
log_format xcombined '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$ssl_client_s_dn"';

sendfile    on;
server_tokens on;
types_hash_max_size 1024;
types_hash_bucket_size 512;
server_names_hash_bucket_size 64;
server_names_hash_max_size 512;
keepalive_timeout  65;
tcp_nodelay        on;

client_max_body_size    10m;
client_body_buffer_size 128k;
proxy_redirect          off;
proxy_connect_timeout   90;
proxy_send_timeout      90;
proxy_read_timeout      90;
proxy_buffers           32 4k;
proxy_buffer_size       8k;
proxy_set_header        Host $host;
proxy_set_header        X-Real-IP $remote_addr;
proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_headers_hash_bucket_size 64;

server {
    listen                0.0.0.0:8080 default_server;
    server_name           _;
    resolver              127.0.0.11 valid=30s;

    access_log            /dev/stdout combined_with_token_host;
    error_log             /dev/stdout debug;

    underscores_in_headers on; # Fuer Uebertragung der Header an SAP
    large_client_header_buffers 4 16k;
    proxy_buffer_size           16k;
    proxy_buffers               4 16k;
    real_ip_header              <blurred>;
    set_real_ip_from            0.0.0.0/0;

    location /sap1/ {
        rewrite ^ $request_uri;
        rewrite ^/sap1/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap1:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  30m;
    }

    location /sap2/ {
        rewrite ^ $request_uri;
        rewrite ^/sap2/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap2:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  50m;
    }

    location /sap3/ {
        rewrite ^ $request_uri;
        rewrite ^/sap3/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap3:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  50m;
    }
}
}

服务器仅接受TLS安全的连接。一个区别是建立TLS连接：

软件＆lt; -tls-secured-＆gt; sap vs 软件＆lt; -unsecured-＆gt; nginx＆lt; -tls-seced-＆gt; SAP

这是一个成功请求的示例：

，在这里，同一请求中止了第一个标志：

此处的连接在客户端发送证书，客户端密钥交换，更改CIPER规格，加密握手消息之后立即中止该连接，但在任何时候可能会失败。例如，在大多数文件上传错误中，〜10-20数据包在连接中止之前已成功发送。

结论

我们完全亏损了，还有什么要调查/如何缩小范围。不幸的是，在这个错误亨特中， saphostingcompany 并不是很重要:( 当然，我们认为这一定是某种基础架构问题，因为错误同时出现在所有阶段/环境上，而它们责备我们，因为Nginx-solution似乎有效...

因此，如果有人知道任何人都知道我可能会非常感谢。

原文

TL;DR

For quite some time we are facing a weird issue with all of our systems (including Prod!). On a regular basis the TCP-connection to the server is closed abruptly by the server (or to be exact on the way from the server to the client).
This leads to failing requests and is most prominent in file uploads that always fail for bigger files (where bigger is just >100kb).
Additionally the same requests fail much less frequently (but still fail sometimes!) if routed through an nginx reverse proxy.

Setup

We (let's call us MyCompany) are developing a software (a Java/Spring Boot service) for CustomerCompany. The software is shipped as a Docker container and hosted either locally, in a private cloud provided by CloudCompany or in two different Azure Kubernetes cluster.
The software communicates with an SAP-system hosted by SAPHostingCompany. There are actually multiple SAP-systems for different stages.

The software communicates (depending on stage/environment) either directly with the SAP-system or through an nginx reverse proxy (hosted on a machine of MyCompany).
The reasoning behind the nginx reverse proxy is that each IP communicating with the SAP-system has to be whitelisted by SAPHostingCompany. Especially for local development this would have been quite cumbersome to maintain.

The problem

Starting a few weeks back we noticed that sometimes requests fail (seemingly) randomly. This happens on all stages. Supposedly there were no changes whatsoever conducted that might have caused this change...

While this is quite an annoyance for most requests (that can just be retried if they failed) this completely prevents larger files from being uploaded. Larger meaning just >100kb in this context.

We tried to investigate the problem and noticed in tcpdump that upon failure the server sends a TCP RST packet, thus aborting the connection (admittedly, we cannot be 100% sure whether it's the server itself sending the RST or some intermediate component).
This is sent at different stages within the TCP-connection so there is not one single packet (or packet-combination) that immediately causes the server to close the connection.

Most interestingly, this failure happens far less often (but still does!) in the setup with the intermediate nginx reverse proxy.

Nginx Reverse proxy

The nginx config looks like this:

events {
worker_connections 1024;
}

http {
log_format combined_with_requesttime '$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_time $upstream_response_time $pipe';
log_format combined_with_token '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_de_comdirect_cif_globalRequestId"';
log_format combined_with_token_host '$remote_addr $host $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_de_comdirect_cif_globalRequestId"';
log_format xcombined '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$ssl_client_s_dn"';

sendfile    on;
server_tokens on;
types_hash_max_size 1024;
types_hash_bucket_size 512;
server_names_hash_bucket_size 64;
server_names_hash_max_size 512;
keepalive_timeout  65;
tcp_nodelay        on;

client_max_body_size    10m;
client_body_buffer_size 128k;
proxy_redirect          off;
proxy_connect_timeout   90;
proxy_send_timeout      90;
proxy_read_timeout      90;
proxy_buffers           32 4k;
proxy_buffer_size       8k;
proxy_set_header        Host $host;
proxy_set_header        X-Real-IP $remote_addr;
proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_headers_hash_bucket_size 64;

server {
    listen                0.0.0.0:8080 default_server;
    server_name           _;
    resolver              127.0.0.11 valid=30s;

    access_log            /dev/stdout combined_with_token_host;
    error_log             /dev/stdout debug;

    underscores_in_headers on; # Fuer Uebertragung der Header an SAP
    large_client_header_buffers 4 16k;
    proxy_buffer_size           16k;
    proxy_buffers               4 16k;
    real_ip_header              <blurred>;
    set_real_ip_from            0.0.0.0/0;

    location /sap1/ {
        rewrite ^ $request_uri;
        rewrite ^/sap1/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap1:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  30m;
    }

    location /sap2/ {
        rewrite ^ $request_uri;
        rewrite ^/sap2/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap2:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  50m;
    }

    location /sap3/ {
        rewrite ^ $request_uri;
        rewrite ^/sap3/(.*) $1 break;
        return 400; #if the second rewrite won't match
        proxy_pass            https://SAPHostingCompany.sap3:8043/$uri;
        proxy_read_timeout    130;
        proxy_connect_timeout 90;
        proxy_redirect        off;
        proxy_buffering       off;
        client_max_body_size  50m;
    }
}
}

The server accepts only TLS-secured connections. One difference is the establishment of the TLS-connection:

software <-TLS-secured-> SAP vs software <-unsecured-> nginx <-TLS-secured-> SAP

Here is an example of a successful request:

And here the same request aborted with an RST flag:

Here the connection is aborted immediately after the client sends a Certificate, Client Key Exchange, Change Ciper Spec, Encrypted Handshake Message but it might fail at any point. For example in most file upload errors ~10-20 data packets are sent successfully before the connection is aborted.

Conclusion

We are at a complete loss what else to investigate/how to narrow this down. Unfortunately SAPHostingCompany is not very forthcoming in this bug-hunt :(
We, of course, think it must be some kind of infrastructure problem on their side since the error appeared on all stages/environments simultaneously while they blame us since the nginx-solution seems to work...

So if anybody has a clue as to what might be going on here I would be very grateful.