使用Azure认知搜索索引从BLOB存储内容中剥离HTML标签
我已经设置了HTML文件的BLOB存储数据源的索引器,以及以下配置 parsingMode =“ default” 和 datatoExtract =“ contentAndMetAdata” 场值始终是原始的HTML 但是我想要没有HTML标签的内容。
对于文档和内容类型的文件extension是“ .html”,是“ text/html”
{
"name" : "my-blob-indexer",
"dataSourceName" : "my-blob-datasource",
"targetIndexName" : "my-search-index",
"parameters": {
"batchSize": null,
"maxFailedItems": null,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"indexedFileNameExtensions" : ".html",
"excludedFileNameExtensions" : ".png,.jpeg",
"dataToExtract": "contentAndMetadata",
"parsingMode": "default"
}
},
"schedule" : { },
"fieldMappings" : [ ]
}
https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-indexing-azure-blob-storage 我正在使用此文档作为索引数据的参考blob存储
I have setup an indexer with blob storage data source for the html file and with the following configuration of parsingMode="default" and dataToExtract="contentAndMetadata" the content field value is always the raw HTML
But I want the content without html tags.
The file extension is ".html" for the document and content type is "text/html"
{
"name" : "my-blob-indexer",
"dataSourceName" : "my-blob-datasource",
"targetIndexName" : "my-search-index",
"parameters": {
"batchSize": null,
"maxFailedItems": null,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"indexedFileNameExtensions" : ".html",
"excludedFileNameExtensions" : ".png,.jpeg",
"dataToExtract": "contentAndMetadata",
"parsingMode": "default"
}
},
"schedule" : { },
"fieldMappings" : [ ]
}
https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage
I'm using this document as reference to index data the blob storage
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论