如何将 HTML 表格抓取为 CSV?

发布于 2024-07-07 23:19:23 字数 425 浏览 7 评论 0原文

问题

我在工作中使用一个工具,可以让我进行查询并获取 HTML 信息表。 我没有任何后端访问权限。

这些信息将会更加有用。如何将这些数据通过屏幕抓取到 CSV 文件?

我的第一个想法

如果我可以将这些信息放入电子表格中进行排序、求平均值等,那么 我知道 jQuery,我想我可以用它来去掉屏幕上的表格格式,插入逗号和换行符,然后将整个乱七八糟的内容复制到记事本中并另存为 CSV。 还有更好的想法吗?

解决方案

是的,伙计们,这确实就像复制和粘贴一样简单。 我不觉得自己很傻吗。

具体来说,当我粘贴到电子表格中时,我必须选择“选择性粘贴”并选择格式“文本”。 否则,即使我突出显示整个电子表格,它也会尝试将所有内容粘贴到单个单元格中。

The Problem

I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.

A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?

My First Idea

Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?

The Solution

Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.

Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

终难遇 2024-07-14 23:19:23
  • 在工具的 UI 中选择 HTML 表格并将其复制到剪贴板(如果可能的话)
  • 将其粘贴到 Excel 中。
  • 另存为 CSV 文件

但是,这是一种手动解决方案,而不是自动解决方案。

  • Select the HTML table in your tools's UI and copy it into the clipboard (if that's possible
  • Paste it into Excel.
  • Save as CSV file

However, this is a manual solution not an automated one.

泪冰清 2024-07-14 23:19:23

使用 python:

例如,假设您想从某些网站以 csv 形式抓取外汇报价,例如:fxquotes

然后...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

编辑:从表中获取值:
示例来自: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)

using python:

for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes

then...

from BeautifulSoup import BeautifulSoup
import urllib,string,csv,sys,os
from string import replace

date_s = '&date1=01/01/08'
date_f = '&date=11/10/08'
fx_url = 'http://www.oanda.com/convert/fxhistory?date_fmt=us'
fx_url_end = '&lang=en&margin_fixed=0&format=CSV&redirected=1'
cur1,cur2 = 'USD','AUD'
fx_url = fx_url + date_f + date_s + '&exch=' + cur1 +'&exch2=' + cur1
fx_url = fx_url +'&expr=' + cur2 +  '&expr2=' + cur2 + fx_url_end
data = urllib.urlopen(fx_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('pre', limit=1))
data = replace(data,'[<pre>','')
data = replace(data,'</pre>]','')
file_location = '/Users/location_edit_this'
file_name = file_location + 'usd_aus.csv'
file = open(file_name,"w")
file.write(data)
file.close()

edit: to get values from a table:
example from: palewire

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()

url = "http://www.palewire.com/scrape/albums/2007.html"
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)

table = soup.find("table", border=1)

for row in table.findAll('tr')[1:]:
    col = row.findAll('td')

    rank = col[0].string
    artist = col[1].string
    album = col[2].string
    cover_link = col[3].img['src']

    record = (rank, artist, album, cover_link)
    print "|".join(record)
此生挚爱伱 2024-07-14 23:19:23

这是我使用(当前)最新版本的 BeautifulSoup 的 python 版本,可以使用例如

$ sudo easy_install beautifulsoup4

脚本从标准输入读取 HTML,并以正确的 CSV 格式输出在所有表中找到的文本。

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])

This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,

$ sudo easy_install beautifulsoup4

The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.

#!/usr/bin/python
from bs4 import BeautifulSoup
import sys
import re
import csv

def cell_text(cell):
    return " ".join(cell.stripped_strings)

soup = BeautifulSoup(sys.stdin.read())
output = csv.writer(sys.stdout)

for table in soup.find_all('table'):
    for row in table.find_all('tr'):
        col = map(cell_text, row.find_all(re.compile('t[dh]')))
        output.writerow(col)
    output.writerow([])
客…行舟 2024-07-14 23:19:23

甚至更容易(因为它会为您保存下次)...

在 Excel

数据/导入外部数据/新 Web 查询

中将带您到 url 提示。 输入您的网址,它将界定要导入的页面上的可用表格。 瞧。

Even easier (because it saves it for you for next time) ...

In Excel

Data/Import External Data/New Web Query

will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.

南街女流氓 2024-07-14 23:19:23

我想到了两种方法(特别是对于我们这些没有 Excel 的人来说):

Two ways come to mind (especially for those of us that don't have Excel):

  • Google Spreadsheets has an excellent importHTML function:
    • =importHTML("http://example.com/page/with/table", "table", index
    • Index starts at 1
    • I recommend a copy and paste values shortly after import
    • File -> Download as -> CSV
  • Python's superb Pandas library has handy read_html and to_csv functions
淡淡の花香 2024-07-14 23:19:23

快速而肮脏:

从浏览器复制到 Excel,另存为 CSV。

更好的解决方案(适合长期使用):

用您选择的语言编写一些代码,将 html 内容拉下来,并刮掉您想要的部分。 您可能可以在数据检索之上添加所有数据操作(排序、平均等)。 这样,您只需运行代码即可获得所需的实际报告。

这完全取决于您执行此特定任务的频率。

Quick and dirty:

Copy out of browser into Excel, save as CSV.

Better solution (for long term use):

Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.

It all depends on how often you will be performing this particular task.

时光与爱终年不遇 2024-07-14 23:19:23

Excel可以打开http页面。

例如:

  1. 单击文件,打开

  2. 在文件名下,粘贴 URL,即:如何将 HTML 表格抓取到 CSV?

  3. 单击“确定”

Excel 会尽力将 html 转换为表格。

它不是最优雅的解决方案,但确实有效!

Excel can open a http page.

Eg:

  1. Click File, Open

  2. Under filename, paste the URL ie: How can I scrape an HTML table to CSV?

  3. Click ok

Excel does its best to convert the html to a table.

Its not the most elegant solution, but does work!

七秒鱼° 2024-07-14 23:19:23

使用 BeautifulSoup 的基本 Python 实现,同时考虑 rowspan 和 colspan:

from BeautifulSoup import BeautifulSoup

def table2csv(html_txt):
   csvs = []
   soup = BeautifulSoup(html_txt)
   tables = soup.findAll('table')

   for table in tables:
       csv = ''
       rows = table.findAll('tr')
       row_spans = []
       do_ident = False

       for tr in rows:
           cols = tr.findAll(['th','td'])

           for cell in cols:
               colspan = int(cell.get('colspan',1))
               rowspan = int(cell.get('rowspan',1))

               if do_ident:
                   do_ident = False
                   csv += ','*(len(row_spans))

               if rowspan > 1: row_spans.append(rowspan)

               csv += '"{text}"'.format(text=cell.text) + ','*(colspan)

           if row_spans:
               for i in xrange(len(row_spans)-1,-1,-1):
                   row_spans[i] -= 1
                   if row_spans[i] < 1: row_spans.pop()

           do_ident = True if row_spans else False

           csv += '\n'

       csvs.append(csv)
       #print csv

   return '\n\n'.join(csvs)

Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:

from BeautifulSoup import BeautifulSoup

def table2csv(html_txt):
   csvs = []
   soup = BeautifulSoup(html_txt)
   tables = soup.findAll('table')

   for table in tables:
       csv = ''
       rows = table.findAll('tr')
       row_spans = []
       do_ident = False

       for tr in rows:
           cols = tr.findAll(['th','td'])

           for cell in cols:
               colspan = int(cell.get('colspan',1))
               rowspan = int(cell.get('rowspan',1))

               if do_ident:
                   do_ident = False
                   csv += ','*(len(row_spans))

               if rowspan > 1: row_spans.append(rowspan)

               csv += '"{text}"'.format(text=cell.text) + ','*(colspan)

           if row_spans:
               for i in xrange(len(row_spans)-1,-1,-1):
                   row_spans[i] -= 1
                   if row_spans[i] < 1: row_spans.pop()

           do_ident = True if row_spans else False

           csv += '\n'

       csvs.append(csv)
       #print csv

   return '\n\n'.join(csvs)
黑白记忆 2024-07-14 23:19:23

下面是一个经过测试的示例,它结合了 grequest 和 soup 从结构化网站下载大量页面:

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
    return " ".join(cell.stripped_strings)

def parse_table(body_html):
    soup = BeautifulSoup(body_html)
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            print(col)

def process_a_page(response, *args, **kwargs): 
    parse_table(response.content)

def download_a_chunk(k):
    chunk_size = 10 #number of html pages
    x = "http://www.blahblah....com/inclusiones.php?p="
    x2 = "&name=..."
    URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
    reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
    resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
    print("downloading chunk ",str(k))
    download_a_chunk(k)
    time.sleep(11)

Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:

#!/usr/bin/python

from bs4 import BeautifulSoup
import sys
import re
import csv
import grequests
import time

def cell_text(cell):
    return " ".join(cell.stripped_strings)

def parse_table(body_html):
    soup = BeautifulSoup(body_html)
    for table in soup.find_all('table'):
        for row in table.find_all('tr'):
            col = map(cell_text, row.find_all(re.compile('t[dh]')))
            print(col)

def process_a_page(response, *args, **kwargs): 
    parse_table(response.content)

def download_a_chunk(k):
    chunk_size = 10 #number of html pages
    x = "http://www.blahblah....com/inclusiones.php?p="
    x2 = "&name=..."
    URLS = [x+str(i)+x2 for i in range(k*chunk_size, k*(chunk_size+1)) ]
    reqs = [grequests.get(url, hooks={'response': process_a_page}) for url in URLS]
    resp = grequests.map(reqs, size=10)

# download slowly so the server does not block you
for k in range(0,500):
    print("downloading chunk ",str(k))
    download_a_chunk(k)
    time.sleep(11)
心清如水 2024-07-14 23:19:23

你试过用excel打开吗?
如果您将 Excel 中的电子表格另存为 html,您将看到 Excel 使用的格式。
从我编写的一个网络应用程序中,我吐出这个 html 格式,以便用户可以导出到 Excel。

Have you tried opening it with excel?
If you save a spreadsheet in excel as html you'll see the format excel uses.
From a web app I wrote I spit out this html format so the user can export to excel.

葬心 2024-07-14 23:19:23

如果您进行屏幕抓取并且您尝试转换的表具有给定的 ID,则您始终可以对 html 进行正则表达式解析以及一些脚本来生成 CSV。

If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文