媒体维基 + Lucene:如何剥离标记?
我有 Lucene 搜索扩展 (http://www.mediawiki.org/wiki/Extension_talk :Lucene-search) 与我的 mediawiki 安装集成。 它一切都工作得很好,但是 - lucene 似乎也索引了所有 mediawiki /html 标记,并且它显示在结果中。
即搜索“绿色”将返回带有标记的结果,例如 style="background:green; color:white
有没有办法去除所有标记的搜索结果?我相信维基百科使用相同的搜索插件,它们怎么样正在做?
I have the Lucene search extension (http://www.mediawiki.org/wiki/Extension_talk:Lucene-search) integrated with my mediawiki installation. Its all working really well, however- lucene seems to have indexed all the mediawiki /html markup as well and it is showing up in the results.
i.e. searching for "green" will return results with markup such as, style="background:green; color:white
Is there a way to strip the search results of all the markup? I believe wikipedia uses the same search plugin, how are they doing it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在使用 Lucene 对其进行索引之前,您可能必须先转换原始 wiki 标记。 处理纯 XML 内容时,可以仅使用带有
的 XSL 转换来提取文本内容。恐怕这不适用于 wiki 标记,但也许您可以捕获 HTML 转换后的页面?
You will probably have to transform the raw wiki markup before indexing it with Lucene. When dealing with pure XML content, it's possible to just use an XSL transform with
<xsl:value-of select="text()"/>
to extract the text content.I'm afraid that won't work for wiki markup, but maybe you can capture the page post-HTML transformation?
我找到了部分问题的解决方案。 以下更改将从搜索结果中删除 HTML 标记。 我还无法删除维基文本标记。 任何有关这方面的提示将不胜感激。 请注意,我不使用 Lucene 搜索扩展。
要解决此问题,只需进入 SearchEngine.php 并找到名为 getTextSnippet() 的方法,然后在“if”之前添加以下行:
$this->mText = strip_tags( $this->mText );
我在这个随机维基上找到了这个解决方案: http://www.myrandomwiki.com/wiki/MediaWiki_Notes# Strip_HTML_From_Search_Results
I found a solution to part of the problem. The following change will remove HTML markup from the search results. I have not been able to remove Wikitext markup yet. Any tips on that would be appreciated. Note that I do not use the Lucene search extension.
To fix the problem, simply go into SearchEngine.php and find the method called getTextSnippet(), then add the following line before the "if":
$this->mText = strip_tags( $this->mText );
I found this solution on this random Wiki: http://www.myrandomwiki.com/wiki/MediaWiki_Notes#Strip_HTML_From_Search_Results