我可以使用 pywikipedia 只获取页面的文本吗?

发布于 2024-07-25 06:01:44 字数 57 浏览 7 评论 0原文

是否可以使用 pywikipedia 只获取页面的文本,而不需要任何内部链接或模板? 没有图片等?

Is it possible, using pywikipedia, to get just the text of the page, without any of the internal links or templates & without the pictures etc.?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

木落 2024-08-01 06:01:44

如果您的意思是“我只想获取 wikitext”,那么请查看 wikipedia.Page 类和 get 方法。

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

这样您就可以从文章中获得完整的原始维基文本。

如果你想去掉 wiki 语法,比如将 [[Concept inventory]] 转换为 Concept inventory 等,那就会有点痛苦。

造成这个麻烦的主要原因是 MediaWiki wiki 语法没有定义的语法。 这使得它很难解析和剥离。 目前我不知道任何软件可以让你准确地做到这一点。 当然还有 MediaWiki Parser 类,但它是 PHP,有点难以掌握,而且其目的非常不同。

但是,如果您只想删除链接,或者非常简单的 wiki 结构,请使用正则表达式:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

然后用于管道链接:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

等等。

但是,例如,没有可靠的简单方法可以从页面中删除嵌套模板。 对于评论中带有链接的图像也是如此。 这非常困难,需要递归地删除最内部的链接并用标记替换它并重新开始。 如果您愿意,可以查看 wikipedia.py 中的 templateWithParams 函数,但它并不漂亮。

If you mean "I want to get the wikitext only", then look at the wikipedia.Page class, and the get method.

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

This way you get the complete, raw wikitext from the article.

If you want to strip out the wiki syntax, as is transform [[Concept inventory]] into Concept inventory and so on, it is going to be a bit more painful.

The main reason for this trouble is that the MediaWiki wiki syntax has no defined grammar. Which makes it really hard to parse, and to strip. I currently know no software that allows you to do this accurately. There's the MediaWiki Parser class of course, but it's PHP, a bit hard to grasp, and its purpose is very very different.

But if you only want to strip out links, or very simple wiki constructs use regexes:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and then for piped links:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

and so on.

But for example, there is no reliable easy way to strip out nested templates from a page. And the same goes for Images that have links in their comments. It's quite hard, and involves recursively removing the most internal link and replacing it by a marker and start over. Have a look at the templateWithParams function in wikipedia.py if you want, but it's not pretty.

烟雨凡馨 2024-08-01 06:01:44

Github 上有一个名为 mwparserfromhell 的模块,可以根据您的需要让您非常接近您想要的内容。 它有一个名为 strip_code() 的方法,可以去除大量标记。

import pywikibot
import mwparserfromhell

test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()

full = mwparserfromhell.parse(text)
stripped = full.strip_code()

print full
print '*******************'
print stripped

比较片段:

{{db-foreign}}
<!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   

==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


*******************

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   

Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 

There is a module called mwparserfromhell on Github that can get you very close to what you want depending on what you need. It has a method called strip_code(), that strips a lot of the markup.

import pywikibot
import mwparserfromhell

test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()

full = mwparserfromhell.parse(text)
stripped = full.strip_code()

print full
print '*******************'
print stripped

Comparison snippet:

{{db-foreign}}
<!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   

==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


*******************

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   

Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 
清风夜微凉 2024-08-01 06:01:44

您可以使用 wikitextparser。 例如:

import pywikibot
import wikitextparser
en_wikipedia = pywikibot.Site('en', 'wikipedia')
text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
print(wikitextparser.parse(text).sections[0].plain_text())

会给你:

"Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.

You can use wikitextparser. For example:

import pywikibot
import wikitextparser
en_wikipedia = pywikibot.Site('en', 'wikipedia')
text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
print(wikitextparser.parse(text).sections[0].plain_text())

will give you:

"Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.
奢望 2024-08-01 06:01:44

Pywikibot 能够删除任何 wiki 文本或 html 标签。 textlib 中有两个函数:

  1. removeHTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> 字符串:

    返回不包含禁用 HTML 标记的部分的文本,但在 html 标记之间保留文本。 例如:

     from pywikibot import textlib 
       text = '这是<小>小;   文本' 
       打印(textlib.removeHTMLParts(文本,keeptags = [])) 
      

    这将打印:

     这是小文本 
      
  2. removeDisabledParts(text: str, Tags=None, include=[], site=None) -> 字符串:
    返回不包含禁用 wiki 标记部分的文本。 这删除 wikitext 文本中的文本。 例如:

     from pywikibot import textlib 
       text = '这是<小>小;   文本' 
       打印(textlib.removeDisabledPartsParts(文本,标签= ['小'])) 
      

    这将打印:

     这是文本 
      

    有很多预定义的标签需要删除或保留,例如
    '评论'、'标题'、'链接'、'模板';

    标签参数的默认值为['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

    其他一些例子:

    removeDisabledPartsParts('See [[this link]]',tags=['link']) 给出 'See '
    removeDisabledPartsParts('', Tags=['comment']) 给出 ''
    removeDisabledPartsParts('{{Infobox}}',tags=['template']) 给出 '',但仅适用于 Pywikibot 6.0.0 或更高版本

Pywikibot is able to remove any wikitext or html tags. There are two functions inside textlib:

  1. removeHTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> str:

    Return text without portions where HTML markup is disabled but keeps text between html tags. For example:

     from pywikibot import textlib
     text = 'This is <small>small</small> text'
     print(textlib.removeHTMLParts(text, keeptags=[]))
    

    this will print:

     This is small text
    
  2. removeDisabledParts(text: str, tags=None, include=[], site=None) -> str:
    Return text without portions where wiki markup is disabled. This removes text inside wikitext text. For example:

     from pywikibot import textlib
     text = 'This is <small>small</small> text'
     print(textlib.removeDisabledPartsParts(text, tags=['small']))
    

    this will print:

     This is  text
    

    There are a lot of predefined tags to be removed or to be kept like
    'comment', 'header', 'link', 'template';

    default for tags parameter is ['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

    Some other examples:

    removeDisabledPartsParts('See [[this link]]', tags=['link']) gives 'See '
    removeDisabledPartsParts('<!-- no comments -->', tags=['comment']) gives ''
    removeDisabledPartsParts('{{Infobox}}', tags=['template']) gives '', but works only for Pywikibot 6.0.0 or higher

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文