Dbpedia 提取框架 - 如何去除 mediawiki 格式标记
我正在使用 dbpedia 提取框架。这看起来非常好,我很高兴构建维基百科页面的 AST 并提取链接(使用 WikiParser)。然而,尽管我从解析中得到了一个很好的结构化树,但我注意到文本节点仍然包含大量格式标记(例如用于斜体、粗体等的撇号)。就我的目的而言,这些没有帮助 - 我只想要纯文本。
我可以花一些时间编写自己的代码来删除它,但我假设这样的东西对 dbpedia 很有用 - 并且它存在于库中的某个地方。我说得对吗?如果是这样 - 剥离为裸文本的额外功能在哪里?
否则 - 有人知道任何其他(最好是 scala)软件包可以去除 mediawiki 标记吗?
编辑
响应更多详细信息的请求。以下标记:
''An italicised '''bit''' of text'', <b>Some markup</b>
作为 TextNode 的内容通过 dbpedia 提供,但未受影响。我希望能够将其简化为:
An italicised bit of text, Some markup
或者可能是一个更结构化的 AST,其中包含表示原始文本每个部分的附加节点,可能用要应用的格式类型(例如斜体、粗体等)进行注释(在每个节点上) )。
事实上,dbpedia 解析的最终结果仍然充满了标记。
希望有帮助。
I'm playing around with the dbpedia extraction framework. It seems very nice, and I'm happily building ASTs of wikipedia pages and extracting links (using WikiParser). However although I get a nice structured tree from the parse, I notice that the text nodes still contain lots of formatting markup (e.g. apostrophes used for italicisation, bolding etc.). For my purposes these are not helpful - I just want the plain text.
I can spend some time writing my own code to strip this out, but I'm presuming that something like this would be useful for dbpedia - and that it exists somewhere in the library. Am I right? And if so - where is the extra functionality to strip down to bare text?
Otherwise - does anyone know of any other (preferably scala) packages to strip out mediawiki markup?
Edit
In response to a request for greater detail. The following markup:
''An italicised '''bit''' of text'', <b>Some markup</b>
Comes through dbpedia as contents of a TextNode but untouched. I would like the ability either to strip it down to:
An italicised bit of text, Some markup
Or possibly to a more structured AST with additional nodes representing each section of raw text, perhaps annotated (on each node) with the type of formatting to be applied (e.g. italics, bold etc).
As is, the end result of a dbpedia parse is still quite full of markup.
Hope that helps.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
快速浏览一下
SimpleWikiParser
源代码 表明,截至 2011 年 1 月 29 日,解析器处理以下实体:想必所有 wiki 其他内容最终都会出现在
TextNode
对象中。查看 wiki 标记功能集,需要进行大量工作维基语法元素,更不用说将它们进一步转换为结构化元素了。对于您可以利用的替代方案或代码,请查看以下替代解析器页。
对于一个独立但不完美的解决方案,您可以在
node.text
上执行一堆正则表达式替换。So a quick look at the
SimpleWikiParser
source code on sourceforge suggests that as of 1/29/2011 the parser handles the following entities:Presumably all wiki other content ends up in
TextNode
objects. Looking at the wiki markup feature set, there would be a non trivial amount of work to strip out the wiki syntax elements let alone convert them further into structured elements.For alternative or code you can leverage, look at the following Alternate Parsers page.
For a self contained but imperfect solution, you could perform a bunch of regular expression replace on
node.text
.gwtwiki (bliki) 项目处理 mediawiki 格式 -> pdf/html/等。它是一个相当完整的框架,用于解析和重新格式化 mediawiki 文本。
The gwtwiki (bliki) project handles mediawiki formatting -> pdf/html/etc. It is a fairly complete framework for parsing and reformatting mediawiki text.
您可以通过使用 WikiUtil.removeWikiEmphasis 并添加一些额外规则来启动此过程。
就我而言,我将文本映射到 toWikiText 并将节点链接到其目标名称。
You can start this process by using WikiUtil.removeWikiEmphasis and adding a few extra rules.
In my case, I map the text to toWikiText and link nodes to their destination name.