当前位置：文江博客话题详情

Elasticsearch lookup

在弹性搜索中丰富数据

发布于 2025-01-22 12:00:08 字数 495 浏览 1 评论 0 原文

我们将将数据摄入索引（index1），但是文档中的字段之一（field1）是枚举值，需要通过REST API调用来将其转换为值（字符串）。 REST API调用给出了这样的响应的JSON，该响应具有所有枚举的字符串值。

{
values : {
"ENUMVALUE1" : "StringValue1",
"ENUMVALUE2" : "StringValue2"
}
}

我正在考虑从此响应文档中索引，并将其用于查找。输入文档将field1作为enumvalue1或enumvalue2（其中只有一个），我们最终希望在field1下的文档中保存stringvalue1或stringvalue2，而不是enumvalue1。

我浏览了富集处理器的文档，但是我不确定这是否是处理此情况的正确方法。在构建匹配范围的策略时，我不确定如何配置Match_field和Enrich_fields。

您能建议您是否可以在弹性中完成此操作，如果是，如果以上不是最佳方法，我有什么可能的选择。

原文

We will be ingesting data into an Index (Index1), however one of the fields in the document(field1) is an ENUM value, which needs to be converted into a value (string) using a lookup through a rest api call.
the rest api call gives a JSON in response like this which has string values for all the ENUMS.

{
values : {
"ENUMVALUE1" : "StringValue1",
"ENUMVALUE2" : "StringValue2"
}
}

I am thinking of making an index from this response document and use that for the lookup.
The incoming document has field1 as ENUMVALUE1 or ENUMVALUE2 (only one of them) and we want to eventually save StringValue1 or StringValue2 in the document under field1 and not ENUMVALUE1.

I went through the documentation of enrichment processor however I am not sure if that is the correct approach to handle this scenario.
While forming the match enrich policy I am not sure how match_field and enrich_fields should be configured.

Could you please advise if this can be done in Elastic and if yes what possible options do I have if the above one is not an optimal approach.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独陪着我 2025-01-29 12:00:08

好的，150-200枚举可能不足以使用富集索引，但这是一个潜在的解决方案。

您首先需要包含所有枚举映射，看起来像这样：

POST enums/_doc/_bulk
{"index":{}}
{"enum_id": "ENUMVALUE1", "string_value": "StringValue1"}
{"index":{}}
{"enum_id": "ENUMVALUE2", "string_value": "StringValue2"}

然后您需要创建一个丰富的策略

PUT /_enrich/policy/enum-policy
{
  "match": {
    "indices": "enums",
    "match_field": "enum_id",
    "enrich_fields": [
      "string_value"
    ]
  }
}
POST /_enrich/policy/enum-policy/_execute

构建（有200个值，应该花费几秒钟），您可以开始构建您的使用摄入处理器：

PUT _ingest/pipeline/enum-pipeline
{
  "description": "Enum enriching pipeline",
  "processors": [
    {
      "enrich" : {
        "policy_name": "enum-policy",
        "field" : "field1",
        "target_field": "tmp"
      }
    },
    {
      "set": {
        "if": "ctx.tmp != null",
        "field": "field1",
        "value": "{{tmp.string_value}}"
      }
    },
    {
      "remove": {
        "if": "ctx.tmp != null",
        "field": "tmp"
      }
    }
  ]
}

测试此管道，我们得到了：

POST _ingest/pipeline/enum-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "field1": "ENUMVALUE1"
      }
    },
    {
      "_source": {
        "field1": "ENUMVALUE4"
      }
    }
  ]
}

结果=＆gt;

{
  "docs" : [
    {
      "doc" : {
        "_source" : {
          "field1" : "StringValue1"        <--- value has been replaced
        }
      }
    },
    {
      "doc" : {
        "_source" : {
          "field1" : "ENUMVALUE4"          <--- value has NOT been replaced
        }
      }
    }
  ]
}

为了完整的目的，我在没有丰富索引的情况下共享其他解决方案，因此您可以测试并使用最有意义的方法。

在第二个选项中，我们只需使用A 脚本处理器其参数包含您的枚举映射。 field1 将被映射到其包含的枚举值的任何值代替，如果没有相应的枚举值，则将保留其值。

PUT _ingest/pipeline/enum-pipeline
{
  "description": "Enum enriching pipeline",
  "processors": [
    {
      "script": {
        "source": """
          ctx.field1 = params.getOrDefault(ctx.field1, ctx.field1);
        """,
        "params": {
          "ENUMVALUE1": "StringValue1",
          "ENUMVALUE2": "StringValue2",
          ... // add all your enums here
        }
      }
    }
  ]
}

测试此管道，我们得到此

POST _ingest/pipeline/enum-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "field1": "ENUMVALUE1"
      }
    },
    {
      "_source": {
        "field1": "ENUMVALUE4"
      }
    }
  ]
}

结果=＆gt;

{
  "docs" : [
    {
      "doc" : {
        "_source" : {
          "field1" : "StringValue1"        <--- value has been replaced
        }
      }
    },
    {
      "doc" : {
        "_source" : {
          "field1" : "ENUMVALUE4"          <--- value has NOT been replaced
        }
      }
    }
  ]
}

因此，这两种解决方案都适用于您的情况，您只需要购买最合适的解决方案即可。只需知道，在第一个选项中，如果您的枚举发生了变化，则需要重建源索引和丰富策略，而在第二种情况下，您只需要修改管道的参数图即可。

OK, 150-200 enums might not be enough to use an enrich index, but here is a potential solution.

You first need to build the source index containing all enum mappings, it would look like this:

POST enums/_doc/_bulk
{"index":{}}
{"enum_id": "ENUMVALUE1", "string_value": "StringValue1"}
{"index":{}}
{"enum_id": "ENUMVALUE2", "string_value": "StringValue2"}

Then you need to create an enrich policy out of this index:

PUT /_enrich/policy/enum-policy
{
  "match": {
    "indices": "enums",
    "match_field": "enum_id",
    "enrich_fields": [
      "string_value"
    ]
  }
}
POST /_enrich/policy/enum-policy/_execute

Once it's built (with 200 values it should take a few seconds), you can start building your ingest pipeline using an ingest processor:

PUT _ingest/pipeline/enum-pipeline
{
  "description": "Enum enriching pipeline",
  "processors": [
    {
      "enrich" : {
        "policy_name": "enum-policy",
        "field" : "field1",
        "target_field": "tmp"
      }
    },
    {
      "set": {
        "if": "ctx.tmp != null",
        "field": "field1",
        "value": "{{tmp.string_value}}"
      }
    },
    {
      "remove": {
        "if": "ctx.tmp != null",
        "field": "tmp"
      }
    }
  ]
}

Testing this pipeline, we get this:

POST _ingest/pipeline/enum-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "field1": "ENUMVALUE1"
      }
    },
    {
      "_source": {
        "field1": "ENUMVALUE4"
      }
    }
  ]
}

Results =>

{
  "docs" : [
    {
      "doc" : {
        "_source" : {
          "field1" : "StringValue1"        <--- value has been replaced
        }
      }
    },
    {
      "doc" : {
        "_source" : {
          "field1" : "ENUMVALUE4"          <--- value has NOT been replaced
        }
      }
    }
  ]
}

For the sake of completeness, I'm sharing the other solution without enrich index, so you can test both and use whichever makes most sense for you.

In this second option, we're simply going to use an ingest pipeline with a script processor whose parameters contain a map of your enums. field1 will be replaced by whatever value is mapped to the enum value it contains, or will keep its value if there's no corresponding enum value.

PUT _ingest/pipeline/enum-pipeline
{
  "description": "Enum enriching pipeline",
  "processors": [
    {
      "script": {
        "source": """
          ctx.field1 = params.getOrDefault(ctx.field1, ctx.field1);
        """,
        "params": {
          "ENUMVALUE1": "StringValue1",
          "ENUMVALUE2": "StringValue2",
          ... // add all your enums here
        }
      }
    }
  ]
}

Testing this pipeline, we get this

POST _ingest/pipeline/enum-pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "field1": "ENUMVALUE1"
      }
    },
    {
      "_source": {
        "field1": "ENUMVALUE4"
      }
    }
  ]
}

Results =>

{
  "docs" : [
    {
      "doc" : {
        "_source" : {
          "field1" : "StringValue1"        <--- value has been replaced
        }
      }
    },
    {
      "doc" : {
        "_source" : {
          "field1" : "ENUMVALUE4"          <--- value has NOT been replaced
        }
      }
    }
  ]
}

So both solutions would work for your case, you just need to pick up the one that is the best fit. Just know that in the first option, if your enums change, you'll need to rebuild your source index and enrich policy, while in the second case, you just need to modify the parameters map of your pipeline.

回复收藏 0 原文

~没有更多了~