Oracle Cloud上的ISCSI磁盘上的Elasticsearch慢速查询响应

发布于 2025-01-21 16:55:58 字数 6295 浏览 5 评论 0原文

我正在使用Elastic 8从AWS到Oracle Cloud进行迁移，我制作了快照，该索引成功恢复了，但是当具有许多同时连接时，这种弹性需要很长时间才能返回答案。

AWS上的这些机器非常完美，正常工作，这是她的信息

aws 3x节点机器

8gb RAM 2 CPUS Disk SSD NVME JVM heapsize 5gb
Elastic version 7.1 *Query Time 500ms*
iowait (AWS 15,92% Disk: SSD NVME)

aws disk ssd nvme（physic）

[root@es-4-node-1_subnet-1 ec2-user]# fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
fio-2.14
Starting 1 process
Jobs: 1 (f=1): [R(1)] [18.4% done] [246.0MB/0KB/0KB /s] [246/0/0 iops] [eta 00m:31s]
Jobs: 1 (f=1): [R(1)] [30.0% done] [246.0MB/0KB/0KB /s] [246/0/0 iops] [eta 00m:28s]
Jobs: 1 (f=1): [R(1)] [41.5% done] [245.0MB/0KB/0KB /s] [245/0/0 iops] [eta 00m:24s]
Jobs: 1 (f=1): [R(1)] [53.7% done] [239.0MB/0KB/0KB /s] [239/0/0 iops] [eta 00m:19s]
Jobs: 1 (f=1): [R(1)] [65.9% done] [247.0MB/0KB/0KB /s] [247/0/0 iops] [eta 00m:14s]
Jobs: 1 (f=1): [R(1)] [78.0% done] [242.0MB/0KB/0KB /s] [242/0/0 iops] [eta 00m:09s]
Jobs: 1 (f=1): [R(1)] [88.1% done] [241.0MB/0KB/0KB /s] [241/0/0 iops] [eta 00m:05s]
Jobs: 1 (f=1): [R(1)] [100.0% done] [251.0MB/0KB/0KB /s] [251/0/0 iops] [eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=29174: Thu Apr 14 04:52:41 2022
  read : io=10240MB, bw=255246KB/s, iops=249, runt= 41081msec
    slat (usec): min=26, max=41738, avg=3994.68, stdev=6172.41
    clat (msec): min=9, max=181, avg=123.92, stdev=22.70
     lat (msec): min=9, max=189, avg=127.91, stdev=23.31
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   99], 10.00th=[  106], 20.00th=[  116],
     | 30.00th=[  123], 40.00th=[  126], 50.00th=[  128], 60.00th=[  129],
     | 70.00th=[  131], 80.00th=[  137], 90.00th=[  145], 95.00th=[  151],
     | 99.00th=[  159], 99.50th=[  167], 99.90th=[  180], 99.95th=[  180],
     | 99.99th=[  182]
    lat (msec) : 10=0.02%, 20=2.03%, 50=0.87%, 100=2.49%, 250=94.59%
  cpu          : usr=0.11%, sys=1.15%, ctx=9640, majf=0, minf=8204
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=10240/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32
 
Run status group 0 (all jobs):
   READ: io=10240MB, aggrb=255245KB/s, minb=255245KB/s, maxb=255245KB/s, mint=41081msec, maxt=41081msec
 
Disk stats (read/write):
  nvme0n1: ios=46378/222, merge=0/30, ticks=1544352/5552, in_queue=1500556, util=99.15%

，我的问题在这款机器上7.1在OCI上使用Elastic 8，但是请求响应时间太长了，我不知道问题是否是OCI使用的虚拟化磁盘，而我的Elasticsearch慢速r/w具有2TB的大小。 +50亿个文件 Oracle 3x节点机器

16gb RAM 4 CPUS Disk ISCSI - JVM heapsize 10gb
Elastic version 8 *Query time up to 10 seconds / 20 seconds / 30 seconds / +1 minute*
(It only increases the time and does not return the answer or take too long)
iowait (Oracle 39,71% Disk: ISCSI "network storage")

Oracle ISCSI（网络存储磁盘）

root@es-master-1:/home# fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
 
Jobs: 1 (f=1): [R(1)][16.7%][r=239MiB/s][r=239 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][31.0%][r=234MiB/s][r=234 IOPS][eta 00m:29s] 
Jobs: 1 (f=1): [R(1)][45.2%][r=196MiB/s][r=196 IOPS][eta 00m:23s] 
Jobs: 1 (f=1): [R(1)][59.5%][r=237MiB/s][r=237 IOPS][eta 00m:17s] 
Jobs: 1 (f=1): [R(1)][73.8%][r=264MiB/s][r=264 IOPS][eta 00m:11s] 
Jobs: 1 (f=1): [R(1)][88.1%][r=251MiB/s][r=251 IOPS][eta 00m:05s] 
Jobs: 1 (f=1): [R(1)][100.0%][r=190MiB/s][r=190 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=14554: Thu Apr 14 04:52:48 2022
  read: IOPS=238, BW=239MiB/s (250MB/s)(10.0GiB/42923msec)
    slat (usec): min=12, max=275, avg=26.39, stdev=12.34
    clat (msec): min=15, max=350, avg=134.02, stdev=99.43
     lat (msec): min=15, max=350, avg=134.05, stdev=99.43
    clat percentiles (msec):
     |  1.00th=[   24],  5.00th=[   40], 10.00th=[   51], 20.00th=[   53],
     | 30.00th=[   55], 40.00th=[   58], 50.00th=[   73], 60.00th=[   94],
     | 70.00th=[  245], 80.00th=[  259], 90.00th=[  266], 95.00th=[  288],
     | 99.00th=[  313], 99.50th=[  330], 99.90th=[  347], 99.95th=[  347],
     | 99.99th=[  351]
   bw (  KiB/s): min=151552, max=417792, per=99.70%, avg=243557.27, stdev=53597.76, samples=85
   iops        : min=  148, max=  408, avg=237.84, stdev=52.35, samples=85
  lat (msec)   : 20=0.31%, 50=10.01%, 100=51.48%, 250=10.07%, 500=28.12%
  cpu          : usr=0.14%, sys=0.88%, ctx=8661, majf=0, minf=8203
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
 
Run status group 0 (all jobs):
   READ: bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=10.0GiB (10.7GB), run=42923-42923msec
 
Disk stats (read/write):
  sda: ios=10521/236, merge=0/544, ticks=1399849/35181, in_queue=1435030, util=96.10%

是什么会导致我这些放缓，这是由于低速磁盘而引起的OCI中的问题？响应时间仅根据连接的数量增加，但是AWS机器较低，但它会非常快地返回信息，在OCI中，它永远使用，我该如何确定问题是否有弹性的配置，或机器有问题吗？

在OCI上，我的基准测试可以通过多个连接运行，但是当我将流量从AWS上的旧版本重定向到OCI时，应用程序开始需要很长时间才能响应，直到弹性完全冷冻，并且最多需要10分钟才能返回答案

。是拉力赛基准的结果，我不知道这是否好。 rally Benchmark lasticsearch https://pastebin.com/vjhdetr4

原文

I'm doing a migration from Elastic version 7.1 from AWS to Oracle Cloud using elastic 8,
I made the snapshot the index was restored successfully, but the elastic is taking a long time to return the answer when it has many simultaneous connections.

These machine on AWS are perfect and working properly, here is her information

AWS 3x Nodes Machine

8gb RAM 2 CPUS Disk SSD NVME JVM heapsize 5gb
Elastic version 7.1 *Query Time 500ms*
iowait (AWS 15,92% Disk: SSD NVME)

AWS DISK SSD NVME (PHYSIC)

[root@es-4-node-1_subnet-1 ec2-user]# fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
fio-2.14
Starting 1 process
Jobs: 1 (f=1): [R(1)] [18.4% done] [246.0MB/0KB/0KB /s] [246/0/0 iops] [eta 00m:31s]
Jobs: 1 (f=1): [R(1)] [30.0% done] [246.0MB/0KB/0KB /s] [246/0/0 iops] [eta 00m:28s]
Jobs: 1 (f=1): [R(1)] [41.5% done] [245.0MB/0KB/0KB /s] [245/0/0 iops] [eta 00m:24s]
Jobs: 1 (f=1): [R(1)] [53.7% done] [239.0MB/0KB/0KB /s] [239/0/0 iops] [eta 00m:19s]
Jobs: 1 (f=1): [R(1)] [65.9% done] [247.0MB/0KB/0KB /s] [247/0/0 iops] [eta 00m:14s]
Jobs: 1 (f=1): [R(1)] [78.0% done] [242.0MB/0KB/0KB /s] [242/0/0 iops] [eta 00m:09s]
Jobs: 1 (f=1): [R(1)] [88.1% done] [241.0MB/0KB/0KB /s] [241/0/0 iops] [eta 00m:05s]
Jobs: 1 (f=1): [R(1)] [100.0% done] [251.0MB/0KB/0KB /s] [251/0/0 iops] [eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=29174: Thu Apr 14 04:52:41 2022
  read : io=10240MB, bw=255246KB/s, iops=249, runt= 41081msec
    slat (usec): min=26, max=41738, avg=3994.68, stdev=6172.41
    clat (msec): min=9, max=181, avg=123.92, stdev=22.70
     lat (msec): min=9, max=189, avg=127.91, stdev=23.31
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   99], 10.00th=[  106], 20.00th=[  116],
     | 30.00th=[  123], 40.00th=[  126], 50.00th=[  128], 60.00th=[  129],
     | 70.00th=[  131], 80.00th=[  137], 90.00th=[  145], 95.00th=[  151],
     | 99.00th=[  159], 99.50th=[  167], 99.90th=[  180], 99.95th=[  180],
     | 99.99th=[  182]
    lat (msec) : 10=0.02%, 20=2.03%, 50=0.87%, 100=2.49%, 250=94.59%
  cpu          : usr=0.11%, sys=1.15%, ctx=9640, majf=0, minf=8204
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=10240/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32
 
Run status group 0 (all jobs):
   READ: io=10240MB, aggrb=255245KB/s, minb=255245KB/s, maxb=255245KB/s, mint=41081msec, maxt=41081msec
 
Disk stats (read/write):
  nvme0n1: ios=46378/222, merge=0/30, ticks=1544352/5552, in_queue=1500556, util=99.15%

And my problem is on this machine the elastic snapshot from 7.1 to this on OCI with elastic 8, but the request response time is too long i dont know if the problem is that kind of virtualized disk that OCI uses, with slow R/W my ElasticSearch has 2TB of size. +5 Billions of Documents
Oracle 3x Nodes Machine

16gb RAM 4 CPUS Disk ISCSI - JVM heapsize 10gb
Elastic version 8 *Query time up to 10 seconds / 20 seconds / 30 seconds / +1 minute*
(It only increases the time and does not return the answer or take too long)
iowait (Oracle 39,71% Disk: ISCSI "network storage")

Oracle ISCSI (Network Storage Disk)

root@es-master-1:/home# fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
 
Jobs: 1 (f=1): [R(1)][16.7%][r=239MiB/s][r=239 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][31.0%][r=234MiB/s][r=234 IOPS][eta 00m:29s] 
Jobs: 1 (f=1): [R(1)][45.2%][r=196MiB/s][r=196 IOPS][eta 00m:23s] 
Jobs: 1 (f=1): [R(1)][59.5%][r=237MiB/s][r=237 IOPS][eta 00m:17s] 
Jobs: 1 (f=1): [R(1)][73.8%][r=264MiB/s][r=264 IOPS][eta 00m:11s] 
Jobs: 1 (f=1): [R(1)][88.1%][r=251MiB/s][r=251 IOPS][eta 00m:05s] 
Jobs: 1 (f=1): [R(1)][100.0%][r=190MiB/s][r=190 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=14554: Thu Apr 14 04:52:48 2022
  read: IOPS=238, BW=239MiB/s (250MB/s)(10.0GiB/42923msec)
    slat (usec): min=12, max=275, avg=26.39, stdev=12.34
    clat (msec): min=15, max=350, avg=134.02, stdev=99.43
     lat (msec): min=15, max=350, avg=134.05, stdev=99.43
    clat percentiles (msec):
     |  1.00th=[   24],  5.00th=[   40], 10.00th=[   51], 20.00th=[   53],
     | 30.00th=[   55], 40.00th=[   58], 50.00th=[   73], 60.00th=[   94],
     | 70.00th=[  245], 80.00th=[  259], 90.00th=[  266], 95.00th=[  288],
     | 99.00th=[  313], 99.50th=[  330], 99.90th=[  347], 99.95th=[  347],
     | 99.99th=[  351]
   bw (  KiB/s): min=151552, max=417792, per=99.70%, avg=243557.27, stdev=53597.76, samples=85
   iops        : min=  148, max=  408, avg=237.84, stdev=52.35, samples=85
  lat (msec)   : 20=0.31%, 50=10.01%, 100=51.48%, 250=10.07%, 500=28.12%
  cpu          : usr=0.14%, sys=0.88%, ctx=8661, majf=0, minf=8203
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
 
Run status group 0 (all jobs):
   READ: bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=10.0GiB (10.7GB), run=42923-42923msec
 
Disk stats (read/write):
  sda: ios=10521/236, merge=0/544, ticks=1399849/35181, in_queue=1435030, util=96.10%

What could be causing me these slowdowns, is the problem in the OCI due to a low speed disk? The response time only increases according to the number of connections, however the AWS machine is inferior but it returns the information very fast and in the OCI it is taking forever, How can I determine the problem is there any configuration for elastic, or is the problem with the machine?

On OCI my benchmark runs fine with multiple connections but when I redirect traffic from the old version on AWS to OCI the application starts to take a long time to respond until the elastic is totally frozen and it takes up to 10 minutes to return the answer

This is the result from Rally Benchmark, i dont know if this is a good.
Rally Benchmark ElasticSearch
https://pastebin.com/vjhDEtR4

分享到QQ

分享到微博