Dbpedia 提取框架 - 如何去除 mediawiki 格式标记

发布于 2024-10-20 20:33:15 字数 711 浏览 9 评论 0原文

我正在使用 dbpedia 提取框架。这看起来非常好，我很高兴构建维基百科页面的 AST 并提取链接（使用 WikiParser）。然而，尽管我从解析中得到了一个很好的结构化树，但我注意到文本节点仍然包含大量格式标记（例如用于斜体、粗体等的撇号）。就我的目的而言，这些没有帮助 - 我只想要纯文本。

我可以花一些时间编写自己的代码来删除它，但我假设这样的东西对 dbpedia 很有用 - 并且它存在于库中的某个地方。我说得对吗？如果是这样 - 剥离为裸文本的额外功能在哪里？

否则 - 有人知道任何其他（最好是 scala）软件包可以去除 mediawiki 标记吗？

编辑

响应更多详细信息的请求。以下标记：

''An italicised '''bit''' of text'', <b>Some markup</b>

作为 TextNode 的内容通过 dbpedia 提供，但未受影响。我希望能够将其简化为：

 An italicised bit of text, Some markup

或者可能是一个更结构化的 AST，其中包含表示原始文本每个部分的附加节点，可能用要应用的格式类型（例如斜体、粗体等）进行注释（在每个节点上））。

事实上，dbpedia 解析的最终结果仍然充满了标记。

希望有帮助。

原文

I'm playing around with the dbpedia extraction framework. It seems very nice, and I'm happily building ASTs of wikipedia pages and extracting links (using WikiParser). However although I get a nice structured tree from the parse, I notice that the text nodes still contain lots of formatting markup (e.g. apostrophes used for italicisation, bolding etc.). For my purposes these are not helpful - I just want the plain text.

I can spend some time writing my own code to strip this out, but I'm presuming that something like this would be useful for dbpedia - and that it exists somewhere in the library. Am I right? And if so - where is the extra functionality to strip down to bare text?

Otherwise - does anyone know of any other (preferably scala) packages to strip out mediawiki markup?

Edit

In response to a request for greater detail. The following markup:

''An italicised '''bit''' of text'', <b>Some markup</b>

Comes through dbpedia as contents of a TextNode but untouched. I would like the ability either to strip it down to:

 An italicised bit of text, Some markup

Or possibly to a more structured AST with additional nodes representing each section of raw text, perhaps annotated (on each node) with the type of formatting to be applied (e.g. italics, bold etc).

As is, the end result of a dbpedia parse is still quite full of markup.

Hope that helps.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

溇涏 2024-10-27 20:33:15

快速浏览一下 SimpleWikiParser 源代码表明，截至 2011 年 1 月 29 日，解析器处理以下实体：

注释
引用
代码块
内部链接和外部链接
属性
表。

想必所有 wiki 其他内容最终都会出现在 TextNode 对象中。查看 wiki 标记功能集，需要进行大量工作维基语法元素，更不用说将它们进一步转换为结构化元素了。

对于您可以利用的替代方案或代码，请查看以下替代解析器页。

对于一个独立但不完美的解决方案，您可以在 node.text 上执行一堆正则表达式替换。

回复收藏 0 原文

潇烟暮雨 2024-10-27 20:33:15

gwtwiki (bliki) 项目处理 mediawiki 格式 -> pdf/html/等。它是一个相当完整的框架，用于解析和重新格式化 mediawiki 文本。

回复收藏 0 原文

春风十里 2024-10-27 20:33:15

您可以通过使用 WikiUtil.removeWikiEmphasis 并添加一些额外规则来启动此过程。

就我而言，我将文本映射到 toWikiText 并将节点链接到其目标名称。

case text:TextNode => text.toWikiText
case link:LinkNode => {
link match {
   case external:ExternalLinkNode =>  (external.destination.toString)
   case internal:InternalLinkNode =>  (internal.destination.decodedWithNamespace)
   case inter:InterWikiLinkNode   =>  (inter.destination.decodedWithNamespace)
}

You can start this process by using WikiUtil.removeWikiEmphasis and adding a few extra rules.

In my case, I map the text to toWikiText and link nodes to their destination name.

case text:TextNode => text.toWikiText
case link:LinkNode => {
link match {
   case external:ExternalLinkNode =>  (external.destination.toString)
   case internal:InternalLinkNode =>  (internal.destination.decodedWithNamespace)
   case inter:InterWikiLinkNode   =>  (inter.destination.decodedWithNamespace)
}

回复收藏 0 原文

~没有更多了~