Elasticsearch滤波器按属性计数的数字计数,其价值小于数字

发布于 2025-01-22 06:15:48 字数 1625 浏览 1 评论 0 原文

我有一个结构化的索引,就像

{
  "took": 301,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4270,
      "relation": "eq"
    },
    "max_score": 2.0,
    "hits": [
      {
        "_index": "asset_revision_structured_data",
        "_type": "_doc",
        "_id": "2931293",
        "_score": 2.0,
        "_source": {
          "doc": {
            "prediction": {
              "drugs": {
                "document_metadata": {},
                "predictions": {
                  "relevant_drugs": [
                    {
                      "confidence_score": 0.9946682341655051
                    }
                  ]
                }
              }
            }
          }
        }
      }
    ]
  }
}

我想过滤结果一样返回所有 hits ,其中50%或更多相关_DRUGS 具有 profest> profeste> profest> profess_score &lt ; 0.6。

我知道,这将使我在包含 a seacteant_drugs 条目的情况下给我所有的命中。 0.6:

{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "doc.prediction.drugs"
          }
        },
        {
          "range": {
            "doc.prediction.drugs.predictions.relevant_drugs.confidence_score": {
              "lt": 0.6
            }
          }
        }
      ]
    }
  },
  "_source": ["doc.prediction.drugs"]
}

但是我只想返回该子句适用于 seconcess_drugs 的一半以上的命中。我该怎么做?

谢谢

I have an index that is structured like

{
  "took": 301,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4270,
      "relation": "eq"
    },
    "max_score": 2.0,
    "hits": [
      {
        "_index": "asset_revision_structured_data",
        "_type": "_doc",
        "_id": "2931293",
        "_score": 2.0,
        "_source": {
          "doc": {
            "prediction": {
              "drugs": {
                "document_metadata": {},
                "predictions": {
                  "relevant_drugs": [
                    {
                      "confidence_score": 0.9946682341655051
                    }
                  ]
                }
              }
            }
          }
        }
      }
    ]
  }
}

I would like to filter the results to return all hits where 50% or more relevant_drugs have a confidence_score < 0.6.

I know that this would give me all hits where there contains a relevant_drugs entry with confidence_score < 0.6:

{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "doc.prediction.drugs"
          }
        },
        {
          "range": {
            "doc.prediction.drugs.predictions.relevant_drugs.confidence_score": {
              "lt": 0.6
            }
          }
        }
      ]
    }
  },
  "_source": ["doc.prediction.drugs"]
}

but I would like to only return back hits where that clause applies to greater than half the relevant_drugs. How would I do this?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

北凤男飞 2025-01-29 06:15:48

tldr;

我不相信 elasticsearch 有一个特定的查询。
但是您可以使用 runtime fields 我可以将过滤器应用于一个字段。

在这里重现的

查询以下的数据

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.9946682341655051
    },
    {
      "confidence_score": 0.8946682341655051
    }
  ]
}

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.9946682341655051
    },
    {
      "confidence_score": 0.02
    },
    {
      "confidence_score": 0.1
    }
  ]
}

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.1
    }
  ]
}

是我用来运行测试以求解

,其中一个运行时字段在您的文档中获得了所有信心_score的中位数。
然后过滤以低置信度得分。

GET /71916396/_search
{
  "runtime_mappings": {
    "confidence_median": {
      "type": "double",
      "script": {
        "source": """
        def drugs = params['_source']['relevant_drugs'];
        
        def sorted_drugs = drugs.stream().sorted((d1, d2) -> d1.get('confidence_score').compareTo(d2.get('confidence_score'))).collect(Collectors.toList());
        
        def median = -1.0;
        if (sorted_drugs.length % 2 == 0)
        {
          median = ((double)sorted_drugs[sorted_drugs.length/2]['confidence_score'] + (double)sorted_drugs[sorted_drugs.length/2 - 1]['confidence_score'])/2;
        }
        else
        {
          median = (double) sorted_drugs[sorted_drugs.length/2]['confidence_score'];
        }
        
        
        emit(median)
        
        """
      }
    }
  },
  "query": {
    "range": {
      "confidence_median": {
        "lte": 0.6
      }
    }
  }, 
  "size": 10
}

Tldr;

I don't believe Elasticsearch has a specific query to do so.
But you can use Painless. It allow for scripted behaviour in your queries. I also leverage the RuntimeFields to create on the fly a field I can apply a filter to.

To Reproduce

Here is the data I used to run my tests

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.9946682341655051
    },
    {
      "confidence_score": 0.8946682341655051
    }
  ]
}

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.9946682341655051
    },
    {
      "confidence_score": 0.02
    },
    {
      "confidence_score": 0.1
    }
  ]
}

POST /71916396/_doc
{
  "relevant_drugs": [
    {
      "confidence_score": 0.1
    }
  ]
}

To Solve

Below the query, with a runtime field getting the median of all the confidence_score in your documents.
And then filtering for low confidence score.

GET /71916396/_search
{
  "runtime_mappings": {
    "confidence_median": {
      "type": "double",
      "script": {
        "source": """
        def drugs = params['_source']['relevant_drugs'];
        
        def sorted_drugs = drugs.stream().sorted((d1, d2) -> d1.get('confidence_score').compareTo(d2.get('confidence_score'))).collect(Collectors.toList());
        
        def median = -1.0;
        if (sorted_drugs.length % 2 == 0)
        {
          median = ((double)sorted_drugs[sorted_drugs.length/2]['confidence_score'] + (double)sorted_drugs[sorted_drugs.length/2 - 1]['confidence_score'])/2;
        }
        else
        {
          median = (double) sorted_drugs[sorted_drugs.length/2]['confidence_score'];
        }
        
        
        emit(median)
        
        """
      }
    }
  },
  "query": {
    "range": {
      "confidence_median": {
        "lte": 0.6
      }
    }
  }, 
  "size": 10
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文