如何翻译和更新不同语言分析仪字段的Azure认知搜索索引文档？

发布于 2025-01-26 00:54:12 字数 4258 浏览 5 评论 0原文

我正在研究Azure认知搜索索引的配置，该索引将从不同语言的网站查询。我创建了特定语言字段，并在创建索引时添加了适当的语言分析仪。例如：

{
    "id": "",
    "Description": "some_value",
    "Description_es": null, 
    "Description_fr": null,
    "Region": [ "some_value", "some_value" ],
    "SpecificationData": [
        {
            "name": "some_key1",
            "value": "some_value1",
            "name_es": null,
            "value_es": null,
            "name_fr": null,
            "value_fr": null
        },
        {
            "name": "some_key2",
            "value": "some_value2",
            "name_pt": null,
            "value_pt": null,
            "name_fr": null,
            "value_fr": null
        }
    ]
}

字段 description ， specificationdata.name 和 specificationdata.value 是英语，来自cosmos db。字段 description_es ， specificationdata.name_es 和 specificationdata.value_es 将从西班牙网站查询，应在西班牙语中翻译。法语领域类似。但是，由于Cosmos DB仅具有英语字段，因此语言特定字段，例如 description_es ， specificationdata.name_es 和 specificationdata.value_es 默认。我尝试使用技能集并将索引链接到“ Azure认知翻译服务”，但技能集一次仅翻译一个字段。有什么方法可以翻译多个字段并将特定翻译保存在特定字段中？

编辑：添加我尝试过的索引，技能集和索引代码：

index （片段）：

{
    "name": "SpecificationData",
    "type": "Collection(Edm.ComplexType)",
    "analyzer": null,
    "synonymMaps": [],
    "fields": [
        {
            "name": "name",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "standard.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        },
        {
            "name": "value",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "standard.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        },
        {
            "name": "name_fr",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "fr.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        },
        {
            "name": "value_fr",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "fr.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        }
    ]
}

skillseet ：

{
    "@odata.type": "#Microsoft.Skills.Text.TranslationSkill",
    "name": "psd_name_fr",
    "description": null,
    "context": "/document/SpecificationData",
    "defaultFromLanguageCode": null,
    "defaultToLanguageCode": "fr",
    "suggestedFrom": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/*/name"
        }
    ],
    "outputs": [
        {
            "name": "translatedText",
            "targetName": "name_fr"
        }
    ]
}

indexer ：

"outputFieldMappings": [
    {
        "sourceFieldName": "/document/SpecificationData/*/name/name_fr",
        "targetFieldName": "/name_fr" //I get an error message as "Output field mapping specifies target field 'name_fr' that doesn't exist in the index". I have tried accessing the full path as /document/SpecificationData/name_fr but it still gives same error. It looks for the specified field inside root structure and gives the error if the field is nested array object.
    }
]

原文

I am working on configuration of Azure Cognitive Search Index which will be queried from websites in different languages. I have created language specific fields and have added appropriate language analyzers while Index creation.
For example:

{
    "id": "",
    "Description": "some_value",
    "Description_es": null, 
    "Description_fr": null,
    "Region": [ "some_value", "some_value" ],
    "SpecificationData": [
        {
            "name": "some_key1",
            "value": "some_value1",
            "name_es": null,
            "value_es": null,
            "name_fr": null,
            "value_fr": null
        },
        {
            "name": "some_key2",
            "value": "some_value2",
            "name_pt": null,
            "value_pt": null,
            "name_fr": null,
            "value_fr": null
        }
    ]
}

The fields Description, SpecificationData.name and SpecificationData.value are in English and coming from Cosmos DB. Fields Description_es, SpecificationData.name_es and SpecificationData.value_es will be queried from the Spanish website and should be fields translated in Spanish. And similar for the French language fields.
But since, Cosmos DB is having fields only in English, language specific fields such as Description_es, SpecificationData.name_es and SpecificationData.value_es are Null by default.
I have tried using Skillsets and linking Index to "Azure Cognitive Translate Service" but Skillsets are translating only one field at a time.
Is there any way to translate multiple fields and save the specific translation in particular fields?

Edit: Adding Index, Skillset and Indexer code that I have tried:

Index (snippet):

{
    "name": "SpecificationData",
    "type": "Collection(Edm.ComplexType)",
    "analyzer": null,
    "synonymMaps": [],
    "fields": [
        {
            "name": "name",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "standard.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        },
        {
            "name": "value",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "standard.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        },
        {
            "name": "name_fr",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "fr.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        },
        {
            "name": "value_fr",
            "type": "Edm.String",
            "facetable": true,
            "filterable": true,
            "key": false,
            "retrievable": true,
            "searchable": true,
            "sortable": false,
            "analyzer": "fr.lucene",
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "synonymMaps": [],
            "fields": []
        }
    ]
}

Skillset:

{
    "@odata.type": "#Microsoft.Skills.Text.TranslationSkill",
    "name": "psd_name_fr",
    "description": null,
    "context": "/document/SpecificationData",
    "defaultFromLanguageCode": null,
    "defaultToLanguageCode": "fr",
    "suggestedFrom": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/*/name"
        }
    ],
    "outputs": [
        {
            "name": "translatedText",
            "targetName": "name_fr"
        }
    ]
}

Indexer:

"outputFieldMappings": [
    {
        "sourceFieldName": "/document/SpecificationData/*/name/name_fr",
        "targetFieldName": "/name_fr" //I get an error message as "Output field mapping specifies target field 'name_fr' that doesn't exist in the index". I have tried accessing the full path as /document/SpecificationData/name_fr but it still gives same error. It looks for the specified field inside root structure and gives the error if the field is nested array object.
    }
]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生死何惧 2025-02-02 00:54:12

您可以使用文本合并技能如果您想为每种语言获得一个大合并翻译字段，则要合并要翻译的所有字段。尽管您说您仍然希望单独的字段作为输出，但这可能不符合您的确切情况。要使它们分开，我认为您必须一一翻译它们，然后用一个翻译技能每个字段和语言。在技能集中拥有不止一种翻译技能没有问题，因此应该可以正常工作，这可能有点乏味。

更新5/18/22

好的，因此，由于您没有定义复杂的specificationdata索引字段，而是顶级级别的“ name_fr”等等输出字段映射很好。输出字段映射按名称将富集文档中的路径映射到索引字段。因此，targetFieldName应为“ name_fr”，而没有领先的斜线。 source fieldname应指向您的翻译技能的输出，name_fr在上下文路径下，是/document/document/specificationdata您的技能的输出是/document/specificationdata/name_fr。

但是还有另一个问题，那就是您确实有一个值数组作为技能技能的输出，因为输入路径中的*（/*/name ）。由于索引字段是字符串，而不是数组，这可能不会起作用。

似乎您的目的是为每个specificationData条目的每个名称进行翻译。为此，您的上下文可能应进行枚举（/document/specificationdata/*），并使输入路径为/document/document/specificationdata/*/name。这样，specificationdata数组中的每个项目都将在一个name_fr。

然后，如果将索引定义为以这种方式定义，则需要将这些多个值作为索引单字符串。最简单的方法是使用 text Merger技能，可能是这样的：

{
  "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
  "context": "/document",
  "inputs": [
    {
      "name": "itemsToInsert", 
      "source": "/document/SpecificationData/*/name_fr"
    }
  ],
  "outputs": [
    {
      "name": "mergedText", 
      "targetName" : "name_fr"
    }
  ]
}

然后，由于此新技能的输出将是>/document/name_fr与所有法式翻译名称的空间分隔串联的串联，所以您不适合t完全需要输出字段映射，该值将自动映射到您的索引。

最后，为了更好地理解和调试技能，您应该看调试会议。

You could use a text merge skill first to merge all the fields you want to translate if you wanted to get one big merged translation field for each language. That probably wouldn't fit your exact scenario though since you said you still wanted separate fields as the output. To keep them separate, I think you'll have to translate them one by one, with one translation skill per field and language. There's no problem in having more than one translation skill in a skillset so that should work fine, it just may be a little tedious to setup.

UPDATE 5/18/22

OK, so since you're not defining a complex SpecificationData index field, but instead top-level "name_fr" and so on, then yes, output field mappings are fine. Output field mappings map a path in the enriched document to an index field, by name. So targetFieldName should be "name_fr" with no leading slash. sourceFieldName should point to the output of your translation skill, name_fr under the context path, which is /document/SpecificationData, so the full path to your skill's output is /document/SpecificationData/name_fr.

But then there's another issue, which is that you really have an array of values as the output of the skill of the skill because of the * in the input path (/*/name). That probably won't work as the index field is a string and not an array.

It seems like your intent is to get a translation for each name of each SpecificationData entry. For that, your context should probably do the enumeration (/document/SpecificationData/*) and have the input path be /document/SpecificationData/*/name. This way, one name_fr will be under each item in the SpecificationData array.

Then you'll need to make those multiple values into a single string for the index, if you keep the index defined that way. The simplest way to do this is by using a text merger skill, probably something like this:

{
  "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
  "context": "/document",
  "inputs": [
    {
      "name": "itemsToInsert", 
      "source": "/document/SpecificationData/*/name_fr"
    }
  ],
  "outputs": [
    {
      "name": "mergedText", 
      "targetName" : "name_fr"
    }
  ]
}

And then, since the output of this new skill will be /document/name_fr with the space-separated concatenation of all French-translated names, you don't need the output field mapping at all, the value will get automatically mapped to your index.

Finally, to better understand and debug skillsets, you should take a look at debug sessions.

回复收藏 0 原文

~没有更多了~