当前位置：文江博客话题详情

如何将 HTML 表格抓取为 CSV？

发布于 2024-07-07 23:19:23 字数 425 浏览 7 评论 0原文

问题

我在工作中使用一个工具，可以让我进行查询并获取 HTML 信息表。我没有任何后端访问权限。

这些信息将会更加有用。如何将这些数据通过屏幕抓取到 CSV 文件？

我的第一个想法

如果我可以将这些信息放入电子表格中进行排序、求平均值等，那么我知道 jQuery，我想我可以用它来去掉屏幕上的表格格式，插入逗号和换行符，然后将整个乱七八糟的内容复制到记事本中并另存为 CSV。 还有更好的想法吗？

解决方案

是的，伙计们，这确实就像复制和粘贴一样简单。我不觉得自己很傻吗。

具体来说，当我粘贴到电子表格中时，我必须选择“选择性粘贴”并选择格式“文本”。否则，即使我突出显示整个电子表格，它也会尝试将所有内容粘贴到单个单元格中。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

终难遇 2024-07-14 23:19:23

在工具的 UI 中选择 HTML 表格并将其复制到剪贴板（如果可能的话）
将其粘贴到 Excel 中。
另存为 CSV 文件

但是，这是一种手动解决方案，而不是自动解决方案。

回复收藏 0 原文

泪冰清 2024-07-14 23:19:23

使用 python：

例如，假设您想从某些网站以 csv 形式抓取外汇报价，例如：fxquotes

然后...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

编辑：从表中获取值：
示例来自： palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

edit: to get values from a table:
example from: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

回复收藏 0 原文

此生挚爱伱 2024-07-14 23:19:23

这是我使用（当前）最新版本的 BeautifulSoup 的 python 版本，可以使用例如

$ sudo easy_install beautifulsoup4

脚本从标准输入读取 HTML，并以正确的 CSV 格式输出在所有表中找到的文本。

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

$ sudo easy_install beautifulsoup4

The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

回复收藏 0 原文

客…行舟 2024-07-14 23:19:23

甚至更容易（因为它会为您保存下次）...

在 Excel

数据/导入外部数据/新 Web 查询

中将带您到 url 提示。输入您的网址，它将界定要导入的页面上的可用表格。瞧。

回复收藏 0 原文

南街女流氓 2024-07-14 23:19:23

我想到了两种方法（特别是对于我们这些没有 Excel 的人来说）：

Google 电子表格有优秀的importHTML功能：
- =importHTML("http://example.com/page/with/table", "table", 索引
- 索引从 1 开始
- 我建议在导入后立即复制并粘贴值
- 文件 -> 下载为-> CSV
Python 的出色Pandas 库有方便的read_html 和 to_csv 函数
- 这里有一个基本的 Python3 脚本，它提示输入 URL、该 URL 上的哪个表，以及CSV 的文件名。

回复收藏 0 原文

淡淡の花香 2024-07-14 23:19:23

快速而肮脏：

从浏览器复制到 Excel，另存为 CSV。

更好的解决方案（适合长期使用）：

用您选择的语言编写一些代码，将 html 内容拉下来，并刮掉您想要的部分。您可能可以在数据检索之上添加所有数据操作（排序、平均等）。这样，您只需运行代码即可获得所需的实际报告。

这完全取决于您执行此特定任务的频率。

回复收藏 0 原文

时光与爱终年不遇 2024-07-14 23:19:23

Excel可以打开http页面。

例如：

单击文件，打开
在文件名下，粘贴 URL，即：如何将 HTML 表格抓取到 CSV？
单击“确定”

Excel 会尽力将 html 转换为表格。

它不是最优雅的解决方案，但确实有效！

回复收藏 0 原文

七秒鱼° 2024-07-14 23:19:23

使用 BeautifulSoup 的基本 Python 实现，同时考虑 rowspan 和 colspan：

from BeautifulSoup import BeautifulSoup

def table2csv(html_txt):
   csvs = []
   soup = BeautifulSoup(html_txt)
   tables = soup.findAll('table')

   for table in tables:
       csv = ''
       rows = table.findAll('tr')
       row_spans = []
       do_ident = False

       for tr in rows:
           cols = tr.findAll(['th','td'])

           for cell in cols:
               colspan = int(cell.get('colspan',1))
               rowspan = int(cell.get('rowspan',1))

               if do_ident:
                   do_ident = False
                   csv += ','*(len(row_spans))

               if rowspan > 1: row_spans.append(rowspan)

               csv += '"{text}"'.format(text=cell.text) + ','*(colspan)

           if row_spans:
               for i in xrange(len(row_spans)-1,-1,-1):
                   row_spans[i] -= 1
                   if row_spans[i] < 1: row_spans.pop()

           do_ident = True if row_spans else False

           csv += '\n'

       csvs.append(csv)
       #print csv

   return '\n\n'.join(csvs)

Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:

from BeautifulSoup import BeautifulSoup

def table2csv(html_txt):
   csvs = []
   soup = BeautifulSoup(html_txt)
   tables = soup.findAll('table')

   for table in tables:
       csv = ''
       rows = table.findAll('tr')
       row_spans = []
       do_ident = False

       for tr in rows:
           cols = tr.findAll(['th','td'])

           for cell in cols:
               colspan = int(cell.get('colspan',1))
               rowspan = int(cell.get('rowspan',1))

               if do_ident:
                   do_ident = False
                   csv += ','*(len(row_spans))

               if rowspan > 1: row_spans.append(rowspan)

               csv += '"{text}"'.format(text=cell.text) + ','*(colspan)

           if row_spans:
               for i in xrange(len(row_spans)-1,-1,-1):
                   row_spans[i] -= 1
                   if row_spans[i] < 1: row_spans.pop()

           do_ident = True if row_spans else False

           csv += '\n'

       csvs.append(csv)
       #print csv

   return '\n\n'.join(csvs)

回复收藏 0 原文

黑白记忆 2024-07-14 23:19:23

下面是一个经过测试的示例，它结合了 grequest 和 soup 从结构化网站下载大量页面：

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
    return " ".join(cell.stripped_strings)

def parse_table(body_html):
    soup = BeautifulSoup(body_html)
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            print(col)

def process_a_page(response, *args, **kwargs): 
    parse_table(response.content)

def download_a_chunk(k):
    chunk_size = 10 #number of html pages
    x = "http://www.blahblah....com/inclusiones.php?p="
    x2 = "&name=..."
    URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
    reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
    resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
    print("downloading chunk ",str(k))
    download_a_chunk(k)
    time.sleep(11)

Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
    return " ".join(cell.stripped_strings)

def parse_table(body_html):
    soup = BeautifulSoup(body_html)
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            print(col)

def process_a_page(response, *args, **kwargs): 
    parse_table(response.content)

def download_a_chunk(k):
    chunk_size = 10 #number of html pages
    x = "http://www.blahblah....com/inclusiones.php?p="
    x2 = "&name=..."
    URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
    reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
    resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
    print("downloading chunk ",str(k))
    download_a_chunk(k)
    time.sleep(11)

回复收藏 0 原文