帮助使用 PHP 中的正则表达式(解析维基百科标记)

发布于 2024-08-18 08:58:41 字数 2902 浏览 7 评论 0原文

我想从维基百科获取的页面中删除这段文本。

{{Historical populations|type=USA
| 1698|4937
| 1712|5840
| 1723|7248
| 1737|10664
| 1746|11717
| 1756|13046
| 1771|21863
| 1790|33131
| 1800|60515
| 1810|96373
| 1820|123706
| 1830|202589
| 1840|312710
| 1850|515547
| 1860|813669
| 1870|942292
| 1880|1206299
| 1890|1515301
| 1900|3437202
| 1910|4766883
| 1920|5620048
| 1930|6930446
| 1940|7454995
| 1950|7891957
| 1960|7781984
| 1970|7894862
| 1980|7071639
| 1990|7322564
| 2000|8008288
| 2008*|8363710
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null&reg=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007.
}}

以下部分我也希望保留为纯文本(但不包括用“{{”和“}}”包裹的部分,

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}}

谢谢。

I have this bit of text that I want to remove from a page I am fetching from Wikipedia.

{{Historical populations|type=USA
| 1698|4937
| 1712|5840
| 1723|7248
| 1737|10664
| 1746|11717
| 1756|13046
| 1771|21863
| 1790|33131
| 1800|60515
| 1810|96373
| 1820|123706
| 1830|202589
| 1840|312710
| 1850|515547
| 1860|813669
| 1870|942292
| 1880|1206299
| 1890|1515301
| 1900|3437202
| 1910|4766883
| 1920|5620048
| 1930|6930446
| 1940|7454995
| 1950|7891957
| 1960|7781984
| 1970|7894862
| 1980|7071639
| 1990|7322564
| 2000|8008288
| 2008*|8363710
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null®=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007.
}}

The following part I wish to keep as plain text also (but not including parts wrapped with "{{" and "}}"

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}}

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

━╋う一瞬間旳綻放 2024-08-25 08:58:41

我当前使用的代码如下来清理 Wiki 页面,例如:

http:// en.wikipedia.org/wiki/Tel_Aviv(您可以通过单击“编辑此页面”查看标记,

我得到此返回:

“并以其“不夜城地中海大都市”的声誉而闻名国土报社论 特拉维夫是该国的金融首都和主要的表演艺术和商业中心,是中东第二大城市经济体,在《外交政策》2008 年全球城市指数中排名第 42 位。以色列的生活成本很高,其中特拉维夫是该地区最昂贵的城市,也是世界上第 17 位最昂贵的城市。根据纽约人力资源咨询公司美世的数据。截至 2008 年,特拉维夫是中东最昂贵的城市,在世界上排名第 14 位,在这方面仅次于新加坡和巴黎,领先于悉尼和都柏林。相比之下,纽约市排名第 22。”

这是不正确的,预期结果应该是:

Tel Aviv-Yafo(希伯来语:תֵּל־אָבִйב-йָפוֹ;阿拉伯语:Эֵּל־אָבִйב-йָפוֹ;阿拉伯语:Эֵָפוֹ;阿拉伯语:Эֵּל־אָפוֹ;阿拉伯语:Эֵָפוֹ;阿拉伯语:Эֵּל־אָפוֹ;阿拉伯语:Эֵָפוֹ;阿拉伯语:Эֵָפוֹ通常称为特拉维夫,是以色列第二大城市,人口约 393,900 人,位于以色列地中海海岸线,土地面积 51.8 平方公里(20.0 平方英里),是以色列最大的城市。古什丹 (Gush Dan) 都市区人口最多的城市,截至 2008 年拥有 315 万人口。该城市由罗恩·胡尔代 (Ron Huldai) 领导的特拉维夫-雅法市管辖。

对于此 PHP 代码:

function clean_wiki_text($text)
  {
    // first get rid of UGC HTML tags
    $text = strip_tags($text);

    // keep convert tag
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text);

    // remove large blocks (treat as tags)
    $text = preg_replace("/(<![^>]+>)/", '', $text);
    $text = preg_replace('/\{\{\s?/', '<', $text);
    $text = str_replace('}}', ' />', $text);

    $text = str_replace('<! />', '', $text);

    // more wiki formatting
    $text = preg_replace("/'{2,6}/", '', $text);
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text);
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text);

    // drop page link text
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text);
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text);

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text);
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text);
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text);
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text);
    $text = preg_replace('/\n(\*+\s?)/', '', $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text);
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text);

    $text = preg_replace('/={2,}/', '', $text);
    $text = preg_replace('/{?class="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text);

    $text = trim($text);

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text);
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text);
/*
    $config = array(
      'show-body-only' => true,
      'clean'          => false, 
      'wrap'           => 0, 
      'show-warnings'  => 0,
      'show-errors'    => 0,
      'enclose-block-text'   => false,
      'vertical-space' => true,
      'output-html'    => true
    );

    // Tidy
    $tidy = new tidy;
    $tidy->parseString($text, $config, 'utf8');
    $tidy->cleanRepair();

    $text = $tidy->value;
*/
    $extras = array(
  //  "/\((.*?)\)/is" => "",
      "/\[(.*?)\]/is" => ""
    );
    $text = preg_replace(array_keys($extras), array_values($extras), $text);

    $text = str_replace(" ,", ',', $text);
    $text = str_replace(", ", ',', $text);
    $text = str_replace(",", ', ', $text);
    $text = str_replace("(, ", '(', $text);
    $text = str_replace(";,", ',', $text);

    // lets keep it plain plain plain
    $text = strip_tags($text);
//    $text = preg_replace('/\s\s+/', ' ', $text);

    $text = str_replace("|-", '', $text);
    $text = str_replace("|}", '', $text);
    $text = str_replace("|", '', $text);
    $text = str_replace('()', '', $text);
    $text = str_replace(' ', ' ', $text);

    $text = trim($text);

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    foreach ($text_arr as $paragraph) {
      if ( mb_strlen(trim($paragraph)) > 30 ) {
      $result[] = $paragraph;
      }
    }
    return $result;
  }

The current code I am using is the following to clean a Wiki Page, for example this one:

http://en.wikipedia.org/wiki/Tel_Aviv (you can see the markup by clicking "edit this page"

I get this returned:

"and given way to its reputation as a "Mediterranean metropolis that never sleeps". Haaretz Editorial It is the country's financial capital and a major performing arts and business center. Tel Aviv's urban area is the Middle East's second biggest city economy, and is ranked 42nd among global cities by Foreign Policys 2008 Global Cities Index. It is also the most expensive city in the region, and 17th most expensive city in the world. The cost of living in Israel is high, with Tel Aviv being its most expensive city to live in. According to Mercer, a human resources consulting firm based in New York, as of 2008 Tel Aviv is the most expensive city in the Middle East and the 14th most expensive in the world. It falls just behind Singapore and Paris and just ahead of Sydney and Dublin in this respect. By comparison, New York City is 22nd."

Which isn't correct, expected result should be:

Tel Aviv-Yafo (Hebrew: תֵּל־אָבִיב-יָפוֹ; Arabic: تل أبيب‎, Tall ʼAbīb), usually called Tel Aviv, is the second largest city in Israel, with an estimated population of 393,900. The city is situated on the Israeli Mediterranean coastline, with a land area of 51.8 square kilometres (20.0 sq mi). It is the largest and most populous city in the metropolitan area of Gush Dan, home to 3.15 million people as of 2008. The city is governed by the Tel Aviv-Yafo municipality, headed by Ron Huldai.

For this PHP code:

function clean_wiki_text($text)
  {
    // first get rid of UGC HTML tags
    $text = strip_tags($text);

    // keep convert tag
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text);

    // remove large blocks (treat as tags)
    $text = preg_replace("/(<![^>]+>)/", '', $text);
    $text = preg_replace('/\{\{\s?/', '<', $text);
    $text = str_replace('}}', ' />', $text);

    $text = str_replace('<! />', '', $text);

    // more wiki formatting
    $text = preg_replace("/'{2,6}/", '', $text);
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text);
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text);
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text);

    // drop page link text
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text);
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text);

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text);
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text);
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text);
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text);
    $text = preg_replace('/\n(\*+\s?)/', '', $text);
    $text = preg_replace('/\n{3,}/', "\n\n", $text);
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text);
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text);

    $text = preg_replace('/={2,}/', '', $text);
    $text = preg_replace('/{?class="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text);
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text);

    $text = trim($text);

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text);
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text);
/*
    $config = array(
      'show-body-only' => true,
      'clean'          => false, 
      'wrap'           => 0, 
      'show-warnings'  => 0,
      'show-errors'    => 0,
      'enclose-block-text'   => false,
      'vertical-space' => true,
      'output-html'    => true
    );

    // Tidy
    $tidy = new tidy;
    $tidy->parseString($text, $config, 'utf8');
    $tidy->cleanRepair();

    $text = $tidy->value;
*/
    $extras = array(
  //  "/\((.*?)\)/is" => "",
      "/\[(.*?)\]/is" => ""
    );
    $text = preg_replace(array_keys($extras), array_values($extras), $text);

    $text = str_replace(" ,", ',', $text);
    $text = str_replace(", ", ',', $text);
    $text = str_replace(",", ', ', $text);
    $text = str_replace("(, ", '(', $text);
    $text = str_replace(";,", ',', $text);

    // lets keep it plain plain plain
    $text = strip_tags($text);
//    $text = preg_replace('/\s\s+/', ' ', $text);

    $text = str_replace("|-", '', $text);
    $text = str_replace("|}", '', $text);
    $text = str_replace("|", '', $text);
    $text = str_replace('()', '', $text);
    $text = str_replace(' ', ' ', $text);

    $text = trim($text);

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    foreach ($text_arr as $paragraph) {
      if ( mb_strlen(trim($paragraph)) > 30 ) {
      $result[] = $paragraph;
      }
    }
    return $result;
  }
[旋木] 2024-08-25 08:58:41

只是在这里猜测,但是使用 Wikipedia 的标记库(与 Mediawiki 捆绑在一起),将其转换为 HTML,然后使用您碰巧熟悉的任何 XML 库解析它,不是更容易、更安全吗?

API 文档可以在 http://svn.wikimedia.org/doc/ 找到(在Parser 模块),看起来并不是很复杂。基本上,您所要做的就是如下所示:

<?php

require_once '/path/to/mediawiki/Parser.php';
// also include whatver classes Parser depends on or use Mediawiki's autoload
// mechanism if it has any

// retrieve the content of your page in $content

$parser = new Parser();
$html   = $parser->parse($content);

$simplexml = simplexml_load_string($html);

现在您有了一个非常方便的 SimpleXML 对象可以使用。当然,这只有在 Mediawiki 的解析器生成有效的 XML 时才有效(我敢打赌它确实如此)。

另外,如果 Mediawiki 包含某种自动加载机制,通过在 Mediawiki 的代码库中查找 __autoload 或 spl_autoload_register 很容易找到它。

希望有帮助!

Just guessing here, but wouldn't it be easier and safer to use Wikipedia's markup library (bundled with Mediawiki), turn that into HTML then parse it using whatever XML library you happen to be comfortable with ?

API documentation can be found at http://svn.wikimedia.org/doc/ (in the Parser module) and it doesn't look very complicated. Basically, all you'd have to do is something like the following:

<?php

require_once '/path/to/mediawiki/Parser.php';
// also include whatver classes Parser depends on or use Mediawiki's autoload
// mechanism if it has any

// retrieve the content of your page in $content

$parser = new Parser();
$html   = $parser->parse($content);

$simplexml = simplexml_load_string($html);

Now you have a very handy SimpleXML object to play with. Of course, this only works if Mediawiki's parser produces valid XML (which I bet it does).

Also, should Mediawiki include some kind of autoload mechanism, it would be easy to find it by looking for __autoload or spl_autoload_register in Mediawiki's codebase.

Hope it helps!

泅渡 2024-08-25 08:58:41

当只提供一个示例时,制作正则表达式确实很困难 - 根据我自己清理维基百科页面的经验,我知道其他页面很可能看起来有点不同。只是为了匹配您的示例:

{{.+?}}\n

仅当要删除的部分后面有换行符并且指定 DOTALLMULTILINE 时,此方法才有效。将所有双大括号对和里面的内容与:

{{[^}]+}}

您可以尝试进行多次运行,每次删除另一个不需要的部分 - 我怀疑在单个正则表达式中匹配您需要的所有内容是否可行。

It's really hard to make a regex when only one example is provided - from my own experience with cleeaning wikipedia pages I know that other pages will very probably look a little different. Just to match your example is simply:

{{.+?}}\n

This only works if there is a newline after the part to be removed and if you specifiy DOTALL and MULTILINE. Match all pairs of double curly brackets and the stuff inside with:

{{[^}]+}}

You may try to do several runs, each removing another unwanted part - I doubt it's well doable to match all you need inside a single regex.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文