使用Azure认知搜索索引从BLOB存储内容中剥离HTML标签

发布于 2025-02-07 12:50:37 字数 1316 浏览 2 评论 0原文

我已经设置了HTML文件的BLOB存储数据源的索引器，以及以下配置 parsingMode =“ default” 和 datatoExtract =“ contentAndMetAdata” 场值始终是原始的HTML 但是我想要没有HTML标签的内容。

对于文档和内容类型的文件extension是“ .html”，是“ text/html”

{
  "name" : "my-blob-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-search-index",
  "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "base64EncodeKeys": null,
      "configuration": {
          "indexedFileNameExtensions" : ".html",
          "excludedFileNameExtensions" : ".png,.jpeg",
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default"
      }
  },
  "schedule" : { },
  "fieldMappings" : [ ]
}

https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-indexing-azure-blob-storage 我正在使用此文档作为索引数据的参考blob存储

原文

I have setup an indexer with blob storage data source for the html file and with the following configuration of parsingMode="default" and dataToExtract="contentAndMetadata" the content field value is always the raw HTML
But I want the content without html tags.

The file extension is ".html" for the document and content type is "text/html"

{
  "name" : "my-blob-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-search-index",
  "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "base64EncodeKeys": null,
      "configuration": {
          "indexedFileNameExtensions" : ".html",
          "excludedFileNameExtensions" : ".png,.jpeg",
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "default"
      }
  },
  "schedule" : { },
  "fieldMappings" : [ ]
}

https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage
I'm using this document as reference to index data the blob storage

分享到QQ

分享到微博