当前位置：文江博客话题详情

wolfram-mathematica

使用 Mathematica 从 HTML 中提取信息

发布于 2024-12-25 21:33:49 字数 247 浏览 3 评论 0 原文

有没有一种简单的方法可以使用 Mathematica 从特定 HTML 表中提取数据？ Import 似乎非常强大，并且 Mathematica 似乎能够很好地处理 XML 等格式。

下面是一个示例：http://en.wikipedia.org/wiki/Unemployment_by_country

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

究竟谁懂我的在乎 2025-01-01 21:33:49

对于这方面的一般示例，有以下操作方法：

如何| 清理从 ZIP 文件导入的数据
如何 | 清理从网站导入的数据

对于这个特定示例，只需将其导入

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

清理即可这种导入相当简单。该表有 3 列，因此从其余内容中提取它：

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

您可能想要删除方括号引用 (??)：

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

另请注意，如果您希望表中包含标题，您可以将其添加回来，您可能会这样做

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

纯粹主义者可能会反对最后一步，但当您抓取数据时，通常您只想完成工作，并且每个站点都是针对具体情况的潜在客户。因此，一些手动检查和灵活性可以让您获得最快的总体结果。

编辑

如果您想要标志，您也可以从CountryData获取它们。需要进一步清理，否则会发生很多遗漏。清理工作包括删除括号中对“主权国家”的引用。例如“关岛（美国）”-> “高姆”。

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

这仍然会产生一些 CountryData 无法识别的输出。

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

190 次中有 6 次缺失。从输出中删除这些缺失：

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

请注意，这需要一段时间才能渲染。

在此处输入图像描述

您显然可以使用 Grid 根据需要设置 Grid 的样式code> 选项，如果需要的话还可以调整图像大小。

For general examples of this there are these How tos:

How to | Clean Up Data Imported from a ZIP File
How to | Clean Up Data Imported from a Website

For this specific example just import it

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

You will presumably want to remove the square bracket references (??):

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

Note also you can add the header back if you want it in your table, which you probably do

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.

Edit

if you wanted the flags you could also get them from CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

This will still produce some output that CountryData does not recognize.

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

6 misses out of 190. Remove those misses from the output:

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

Note that this takes a while to render.

enter image description here

You can obviously style the Grid as desired using Grid options and also resize the images if needed.

回复收藏 0 原文

热鲨 2025-01-01 21:33:49

虽然使用 Import 可能是一种更好、更稳健的方法，但我发现，至少对于这个特定问题，我自己的 HTML 解析器（发布于此线程），可以在少量的情况下正常工作后处理。如果您从那里获取代码并执行它，并使用此函数对其进行扩充：

Clear[findAndParseTables];
findAndParseTables[text_String] :=
  Module[{parsed = postProcess@parseText[text]},
    DeleteCases[
      Cases[parsed, _tableContainer, Infinity],
      _attribContainer | _spanContainer, Infinity
    ] //.
    {(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
        iContainer[x___] :> x,
        aContainer[x_] :> x,
        "\n" :> Sequence[],
       divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];

那么我认为，您可以通过此代码获得相当完整的数据：

text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];

结果如下：

In[92]:= Short[myData,5]
Out[92]//Short= 
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]

我喜欢这种方法的原因（而不是比如说，Import->XMLObject），因为我使用最少的语法将网页转换为 Mathematica 表达式（与 XML 对象不同），所以通常很容易建立一组替换规则在每个给定的情况下进行正确的后处理案件。最后的免责声明是，我的解析器并不健壮，并且肯定包含许多错误，因此请注意。

While the use of Import is probably a better and more robust way, I found that, at least for this particular problem, my own HTML parser (published in this thread), works fine with a small amount of post-processing. If you take the code from there and execute it, augmenting it with this function:

Clear[findAndParseTables];
findAndParseTables[text_String] :=
  Module[{parsed = postProcess@parseText[text]},
    DeleteCases[
      Cases[parsed, _tableContainer, Infinity],
      _attribContainer | _spanContainer, Infinity
    ] //.
    {(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
        iContainer[x___] :> x,
        aContainer[x_] :> x,
        "\n" :> Sequence[],
       divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];

Then you get, I think, a pretty much complete data by this code:

text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];

Here is how the result looks:

In[92]:= Short[myData,5]
Out[92]//Short= 
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]

What I like about this approach (as opposed to say, Import->XMLObject) is that, since I convert the web page into Mathematica expression with minimal syntax (unlike e.g. XML objects), it is often very easy to establish a set of replacement rules which does the right post-processing in each given case. A final disclaimer is that my parser is not robust and does for sure contain a number of bugs, so be warned.

回复收藏 0 原文

悲欢浪云 2025-01-01 21:33:49

不是如何导入 HTML 的直接答案（其他人已经很好地解释了），但从 HTML 表获取数据正是我最初制作表格粘贴调色板。

如果您的目标只是获取数据，这可能比尝试解析页面更容易、更快。

使用调色板的说明

评估创建调色板的表达式，转到调色板 ->安装调色板...并永久保存以供以后使用（如果您愿意）。
选择网页上表格的一部分。如果您使用的是 Firefox，请按住 CTRL 选择表格的任意矩形部分（非常有用！）复制它。
如果您使用的是 Firefox 或 Chrome，请按调色板上的 TSV 按钮将数据粘贴到笔记本中的当前插入点处。我不确定其他浏览器在复制时是否也用制表符分隔项目。

结果将如下所示：

{{"Afghanistan", 35.`, "2008[3]"}, {"Albania", 13.49`, 
  "2010 (Q4)[4]"}, {"Algeria", 10.`, 
  "2010 (September)[5]"}, {"American Samoa (United States)", 23.8`, 
  "2010[3]"}, {"Andorra", 2.9`, 2009}}

如您所见，需要进行一些后处理才能将年份转换为正确的格式（字符串或整数？）

这是旧的调色板代码。我意识到它需要清理，但它照常工作，而且我还没有时间修复它。在下面的评论中报告任何问题。

CreatePalette@Column@{Button["TSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "TSV"], {2}]]
      ]
     ]
    ],
   Button["CSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "CSV"], {2}]]
      ]
     ]
    ],
   Button["Table",
    Module[{data},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@ImportString[data, "Table"]]
      ]
     ]
    ]}

Not a direct answer to how to import HTML (which others have explained nicely), but getting data from HTML tables is precisely why I originally made my table paste palette.

If your aim is to just get the data, this is probably going to be easier and faster than trying to parse the page.

Instructions on using the palette

Evaluate the expression that creates the palette, go to Palettes -> Install Palette... and save it permanently for later use (if you wish).
Select a part of the table on the webpage. If you are working with Firefox, hold down CTRL to select any rectangular section of the table (very useful!) Copy it.
If you are using Firefox or Chrome, press the TSV button on the palette to paste the data into the notebook at the current insertion point. I'm not sure if other browsers also separate items with tabs when copying.

The result will look like this:

{{"Afghanistan", 35.`, "2008[3]"}, {"Albania", 13.49`, 
  "2010 (Q4)[4]"}, {"Algeria", 10.`, 
  "2010 (September)[5]"}, {"American Samoa (United States)", 23.8`, 
  "2010[3]"}, {"Andorra", 2.9`, 2009}}

As you can see, some post-processing is needed to convert years to a proper format (string or integer?)

This is the old palette code. I realize it's in need of cleanup, but it works as it is, and I haven't had time to fix it up yet. Report any issues in comments below.

CreatePalette@Column@{Button["TSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "TSV"], {2}]]
      ]
     ]
    ],
   Button["CSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "CSV"], {2}]]
      ]
     ]
    ],
   Button["Table",
    Module[{data},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@ImportString[data, "Table"]]
      ]
     ]
    ]}

回复收藏 0 原文

如何视而不见 2025-01-01 21:33:49

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Data"]

当然，结果通常需要进一步处理。你想如何形象化它？

您可以使用以下命令查找所有 Import 类型

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Elements"]

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Data"]

Of course, the result will frequently need further processing. How do you want to visualize it?

You can find all Import types using

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Elements"]

回复收藏 0 原文

心欲静而疯不止 2025-01-01 21:33:49

如果您想采用 Import[ ... , "XMLObject" ] 路线，这里概述了您可以执行的操作。

首先，获取页面：

page = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "XMLObject"];

接下来，获取感兴趣的表（在本例中，大表也恰好是此页面上七个表中的第一个）：

table = Cases[page, XMLElement["table", ___], \[Infinity]][[1]]

接下来，从 中获取一行 >table，我选择了与阿尔及利亚对应的第四行：

row = Cases[table, XMLElement["tr", ___], [Infinity]][[4]]

接下来，提取表数据元素 ()从这一行：

data = Cases[row, XMLElement["td", ___], \[Infinity]]

从这些元素中，您可以选择例如国旗缩略图，如下所示：

image = Cases[data, XMLElement["img", {___, "src" -> src_, ___}, _] :> src, \[Infinity]]

最后导入该图像缩略图（由于某种原因需要在前面添加“http:”）：

Import["http:" <> image]

这就是笔记本的样子（缩略图，加上其他输入）：

Mathematica 图形

If you want to go the Import[ ... , "XMLObject" ] route, here is an outline of what you can do.

First, get the page:

page = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "XMLObject"];

Next, get the table of interest (in this case the big table also happens to be the first of seven tables on this page):

table = Cases[page, XMLElement["table", ___], \[Infinity]][[1]]

Next, get a row from the table, I picked the fourth row which corresponds with Algeria:

row = Cases[table, XMLElement["tr", ___], [Infinity]][[4]]

Next, extract the table data elements () from this row:

data = Cases[row, XMLElement["td", ___], \[Infinity]]

Out of those elements, you can pick for example the country flag thumbnail, like so:

image = Cases[data, XMLElement["img", {___, "src" -> src_, ___}, _] :> src, \[Infinity]]

Finally import that image thumbnail (it needed "http:" prepended for some reason):

Import["http:" <> image]

This is what the notebook looks like (the thumbnail, plus the other inputs):

Mathematica graphics

回复收藏 0 原文

没有心的人 2025-01-01 21:33:49

对于某些“简单”的值，是的。请参阅此处：Mathematica 8 的 HTML 导入文档。

您可以从表格导入使用"Data" 格式选项，例如Import["file.hml", "Data"]。这是一个开始，但是您的链接是整个 DOM 树的表、div 和其他内容。它已被记录下来，但内容很薄弱，您必须进行实验。但它确实可以与 URL 一起使用。

这确实有效。经过一些清理，您可以使用此处的数据：

Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

For certain values of 'easy', yes. See here: HTML Import documentation for Mathematica 8.

You can import from tables using the "Data" format option, e.g. Import["file.hml", "Data"]. That's a start, but your link is a whole DOM-tree's worth of tables, divs and other things. It's documented, but thinly, and you'd have to experiment. It does work with URLs though.

This actually works. With a bit of cleaning you could use the data here:

Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

回复收藏 0 原文

~没有更多了~

关于作者

醉南桥

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

使用 Mathematica 从 HTML 中提取信息

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

动次打次papapa

我是有多爱你

原来分手还会想你

linces

霓裳挽歌倾城醉

玍銹的英雄夢

友情链接

使用 Mathematica 从 HTML 中提取信息

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

动次打次papapa

我是有多爱你

原来分手还会想你

linces

霓裳挽歌倾城醉

玍銹的英雄夢

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。