将给定 URL 中的 HTML 表格抓取到 CSV 中

发布于 2024-08-28 20:50:44 字数 1256 浏览 5 评论 0原文

我寻找一种可以在命令行上运行的工具,如下所示:

tablescrape 'http://someURL.foo.com' [n]

如果未指定 n 并且页面上有多个 HTML 表格,它应该总结它们(标题行、总行数) )在编号列表中。 如果指定了 n 或只有一个表,它应该解析该表并将其以 CSV 或 TSV 格式输出到 stdout。

潜在的附加功能:

  • 如果你想真正想象一下,你可以解析表中的表,但就我的目的而言——从维基百科页面等获取数据——这太过分了。
  • 将任何 unicode 关联起来的选项。
  • 应用任意正则表达式替换来修复解析表中的怪异现象的选项。

你会用什么来拼凑这样的东西? Perl 模块 HTML::TableExtract可能是一个很好的起点,甚至可以处理嵌套表的情况。 这也可能是一个非常短的 Python 脚本,其中包含 BeautifulSoupYQL 会是一个很好的起点? 或者,理想情况下,您是否写过类似的内容并有指向它的指针? (我肯定不是第一个需要这个的人。)

相关问题:

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

  • To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
  • An option to asciify any unicode.
  • An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

What would you use to cobble something like this together?
The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables.
This might also be a pretty short Python script with BeautifulSoup.
Would YQL be a good starting point?
Or, ideally, have you written something similar and have a pointer to it?
(I'm surely not the first person to need this.)

Related questions:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夏尔 2024-09-04 20:50:44

这是我的第一次尝试:

http://yootles.com/outbox/tablescrape.py

它需要需要做更多的工作,比如更好的 asciifying,但它是可用的。例如,如果您将其指向奥林匹克记录列表

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics

它会告诉您有 8 项显然,第二个和第三个表(男性和女性记录)是您想要的:

1: [  1 cols,   1 rows] Contents 1 Men's rec
2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]

然后,如果您再次运行它,要求第二个表,

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2

您将得到一个合理的明文数据表:

100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]

This is my first attempt:

http://yootles.com/outbox/tablescrape.py

It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics

it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:

1: [  1 cols,   1 rows] Contents 1 Men's rec
2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]

Then if you run it again, asking for the 2nd table,

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2

You get a reasonable plaintext data table:

100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]
却一份温柔 2024-09-04 20:50:44

使用 TestPlan 我制作了一个粗略的脚本。考虑到网络表格的复杂性,它可能需要在所有网站上进行定制。

第一个脚本列出了页面上的表:

# A simple table scraping example. It lists the tables on a page
#
# Cmds.Site = the URL to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
GotoURL %Cmds.Site%

set %Count% 1
foreach %Table% in (response //table)
    Notice Table #%Count%
    # find a suitable name, look back for a header
    set %Check% ./preceding::*[name()='h1' or name()='h2' or name()='h3'][1]
    if checkIn %Table% %Check%
        Notice (selectIn %Table% %Check%)
    end

    set %Count% as binOp %Count% + 1
end

然后第二个脚本将一个表的数据提取到 CSV 文件中。

# Generic extract of contents of a table in a webpage
# Use list_tables to get the list of table and indexes
#
# Cmds.Site = the URL to scan
# Cmds.Index = Table index to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
default %Cmds.Index% 2

GotoURL %Cmds.Site%

set %Headers% //table[%Cmds.Index%]/tbody/tr[1]
set %Rows% //table[%Cmds.Index%]/tbody/tr[position()>1]

# Get an cleanup the header fields 
set %Fields% withvector
end
foreach %Header% in (response %Headers%/*)
    putin %Fields% (trim %Header%)
end
Notice %Fields%

# Create an output CSV
call unit.file.CreateDataFile with
    %Name% %This:Dir%/extract_table.csv
    %Format% csv
    %Fields% %Fields%
end
set %DataFile% %Return:Value%

# Now extract each row
foreach %Row% in (response %Rows%)
    set %Record% withvector
    end
    foreach %Cell% in (selectIn %Row% ./td)
        putin %Record% (trim %Cell%)
    end

    call unit.file.WriteDataFile with
        %DataFile% %DataFile%
        %Record% %Record%
    end
end

call unit.file.CloseDataFile with
    %DataFile% %DataFile%
end

我的 CSV 文件如下所示。请注意,维基百科在每个单元格中都提取了信息。有很多方法可以摆脱它,但不是通用的方式。

Shot put,22.47 m,"Timmermann, UlfUlf Timmermann",East Germany (GDR),1988 1988 Seoul,"01988-09-23 September 23, 1988",[25]
Discus throw,69.89 m,"Alekna, VirgilijusVirgilijus Alekna",Lithuania (LTU),2004 2004 Athens,"02004-08-23 August 23, 2004",[26]
Hammer throw,84.80 m,"Litvinov, SergeySergey Litvinov",Soviet Union (URS),1988 1988 Seoul,"01988-09-26 September 26, 1988",[27]
Javelin throw,90.57 m,"Thorkildsen, AndreasAndreas Thorkildsen",Norway (NOR),2008 2008 Beijing,"02008-08-23 August 23, 2008",[28]

Using TestPlan I produced a rough script. Given the complexity of web tables it'll likely need to be tailored on all sites.

This first script lists the tables on the page:

# A simple table scraping example. It lists the tables on a page
#
# Cmds.Site = the URL to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
GotoURL %Cmds.Site%

set %Count% 1
foreach %Table% in (response //table)
    Notice Table #%Count%
    # find a suitable name, look back for a header
    set %Check% ./preceding::*[name()='h1' or name()='h2' or name()='h3'][1]
    if checkIn %Table% %Check%
        Notice (selectIn %Table% %Check%)
    end

    set %Count% as binOp %Count% + 1
end

The second script then extracts the data of one table into a CSV file.

# Generic extract of contents of a table in a webpage
# Use list_tables to get the list of table and indexes
#
# Cmds.Site = the URL to scan
# Cmds.Index = Table index to scan
default %Cmds.Site% http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
default %Cmds.Index% 2

GotoURL %Cmds.Site%

set %Headers% //table[%Cmds.Index%]/tbody/tr[1]
set %Rows% //table[%Cmds.Index%]/tbody/tr[position()>1]

# Get an cleanup the header fields 
set %Fields% withvector
end
foreach %Header% in (response %Headers%/*)
    putin %Fields% (trim %Header%)
end
Notice %Fields%

# Create an output CSV
call unit.file.CreateDataFile with
    %Name% %This:Dir%/extract_table.csv
    %Format% csv
    %Fields% %Fields%
end
set %DataFile% %Return:Value%

# Now extract each row
foreach %Row% in (response %Rows%)
    set %Record% withvector
    end
    foreach %Cell% in (selectIn %Row% ./td)
        putin %Record% (trim %Cell%)
    end

    call unit.file.WriteDataFile with
        %DataFile% %DataFile%
        %Record% %Record%
    end
end

call unit.file.CloseDataFile with
    %DataFile% %DataFile%
end

My CSV file looks like below. Note that wikipedia has extract information in each cell. There are many ways to get rid of it, but not in a generic fashion.

Shot put,22.47 m,"Timmermann, UlfUlf Timmermann",East Germany (GDR),1988 1988 Seoul,"01988-09-23 September 23, 1988",[25]
Discus throw,69.89 m,"Alekna, VirgilijusVirgilijus Alekna",Lithuania (LTU),2004 2004 Athens,"02004-08-23 August 23, 2004",[26]
Hammer throw,84.80 m,"Litvinov, SergeySergey Litvinov",Soviet Union (URS),1988 1988 Seoul,"01988-09-26 September 26, 1988",[27]
Javelin throw,90.57 m,"Thorkildsen, AndreasAndreas Thorkildsen",Norway (NOR),2008 2008 Beijing,"02008-08-23 August 23, 2008",[28]
2024-09-04 20:50:44

使用 jqpup,并向 这个答案

#!/bin/bash
# tablescrape - convert nth HTML table on a page to CSV or tab-delimited
# author: https://stackoverflow.com/users/785213
# source: https://stackoverflow.com/a/77031218
set -u

input=${1:?"Expected a file, URL, or '-' as the first argument."}
nth=${2:-1}
mode=${3:-csv}

(
    if [[ -r $input || $input == - ]]; then
        cat "$input"
    else
        # '--location' means "follow redirects"
        curl --silent --show-error --location "$input"
    fi
) \
  | pup "table.wikitable:nth-of-type($nth) tr json{}" \
  | jq --raw-output '.[]
      | [
          .children[]                            # all .children of <tr>s
            | select(.tag=="td" or .tag=="th")   # that are <td>s or <th>s
            | [ .. | .text? ]                    # recurse, looking for .text
            | map(select(.))                     # filter out empty nodes
            | join(" ")                          # concatenate .texts together
        ]                                        # all <td>s/<th>s as an array
      | @'$mode                                  # join together as CSV or TSV

jqpup都是自己超级有用。它看起来就像这些工具之一(或者xidel) 应该能够将 HTML 表直接提取到分隔文本文件中,但我想事实并非如此。幸运的是,管道,伙计。他们太好了!

用法

RECORDS='https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics'

# read from a URL
./tablescrape $RECORDS 2

# read from a pipe or redirection
curl -sS $RECORDS | ./tablescrape - 1 tsv

curl -sS $RECORDS > records.html
< records.html ./tablescrape - 1 tsv

# read from a file
./tablescrape records.html 1 tsv

更新:结果是Xidel (0.9.8) 可以做到。兼容的 CSV 会很棘手(转义分隔符和引号,天哪),但制表符分隔非常简单,并且可以通过其他工具进行转换,例如 Miller,或 LibreOffice Calc。制表符分隔格式的一个优点是许多其他 Unix 文本处理工具已经理解它(cut -d $'\t'sort -t $'\t' >, awk -F '\t'),在紧要关头,您可以自己在 shell 脚本中编写一个近乎万无一失的解析器,例如

# NB: uses Bash's "ANSI-C quoting" feature for a literal tab character
xidel -s https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics \
  -e 

输出:

Event                 Record      Athlete(s)                                                   Nation                Games                Date                Ref(s)
100 metres            10.61       Elaine Thompson Herah                                         Jamaica (JAM)        2020 Tokyo           July 31, 2021       [32]
200 metres            ♦21.34      Florence Griffith Joyner                                      United States (USA)  1988 Seoul           September 29, 1988  [33][34]
400 metres            48.25       Marie-José Pérec                                              France (FRA)         1996 Atlanta         July 29, 1996       [35]
⋮
3,000 m steeplechase  8:58.81     Gulnara Galkina-Samitova                                      Russia (RUS)         2008 Beijing         August 17, 2008     [41]
4×100 m relay         ♦40.82      Tianna MadisonAllyson FelixBianca KnightCarmelita Jeter       United States (USA)  2012 London          August 10, 2012     [42]
⋮

这将接力赛中的名字融合在一起,并且“国家”栏中有一些令人讨厌的前导空格,但它让你非常接近。

//table[2]//tr/string-join(td|th,"\t")' \ | column -t -s

输出:


这将接力赛中的名字融合在一起,并且“国家”栏中有一些令人讨厌的前导空格,但它让你非常接近。

\t'

输出:

这将接力赛中的名字融合在一起,并且“国家”栏中有一些令人讨厌的前导空格,但它让你非常接近。

Using jq and pup, and a tip of the hat to this SO answer:

#!/bin/bash
# tablescrape - convert nth HTML table on a page to CSV or tab-delimited
# author: https://stackoverflow.com/users/785213
# source: https://stackoverflow.com/a/77031218
set -u

input=${1:?"Expected a file, URL, or '-' as the first argument."}
nth=${2:-1}
mode=${3:-csv}

(
    if [[ -r $input || $input == - ]]; then
        cat "$input"
    else
        # '--location' means "follow redirects"
        curl --silent --show-error --location "$input"
    fi
) \
  | pup "table.wikitable:nth-of-type($nth) tr json{}" \
  | jq --raw-output '.[]
      | [
          .children[]                            # all .children of <tr>s
            | select(.tag=="td" or .tag=="th")   # that are <td>s or <th>s
            | [ .. | .text? ]                    # recurse, looking for .text
            | map(select(.))                     # filter out empty nodes
            | join(" ")                          # concatenate .texts together
        ]                                        # all <td>s/<th>s as an array
      | @'$mode                                  # join together as CSV or TSV

Both jq and pup are super-useful on their own. It seemed like one of those tools (or else xidel) should be able to extract HTML tables directly to a delimited text file, but I guess it isn't so. Fortunately, pipes, man. They're so good!

Usage

RECORDS='https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics'

# read from a URL
./tablescrape $RECORDS 2

# read from a pipe or redirection
curl -sS $RECORDS | ./tablescrape - 1 tsv

curl -sS $RECORDS > records.html
< records.html ./tablescrape - 1 tsv

# read from a file
./tablescrape records.html 1 tsv

Update: Turns out Xidel (0.9.8) can do it. Compliant CSV would be tricky (escaping delimiters and quoting quotes, oh my), but tab-delimited is pretty straightforward, and could be converted by another tool like Miller, or LibreOffice Calc. An advantage of the tab-delimited format is many other Unix text-processing tools already understand it (cut -d $'\t', sort -t $'\t', awk -F '\t'), and in a pinch, you can write a nearly-foolproof parser yourself, e.g., in shell script.

# NB: uses Bash's "ANSI-C quoting" feature for a literal tab character
xidel -s https://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics \
  -e 

Output:

Event                 Record      Athlete(s)                                                   Nation                Games                Date                Ref(s)
100 metres            10.61       Elaine Thompson Herah                                         Jamaica (JAM)        2020 Tokyo           July 31, 2021       [32]
200 metres            ♦21.34      Florence Griffith Joyner                                      United States (USA)  1988 Seoul           September 29, 1988  [33][34]
400 metres            48.25       Marie-José Pérec                                              France (FRA)         1996 Atlanta         July 29, 1996       [35]
⋮
3,000 m steeplechase  8:58.81     Gulnara Galkina-Samitova                                      Russia (RUS)         2008 Beijing         August 17, 2008     [41]
4×100 m relay         ♦40.82      Tianna MadisonAllyson FelixBianca KnightCarmelita Jeter       United States (USA)  2012 London          August 10, 2012     [42]
⋮

That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close.

//table[2]//tr/string-join(td|th,"\t")' \ | column -t -s

Output:


That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close.

\t'

Output:

That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文