使用 BeautifulSoup 或 LXML.HTML 进行网页抓取

发布于 2024-10-27 15:32:38 字数 484 浏览 5 评论 0原文

我看过一些网络广播，需要帮助来尝试做到这一点：我一直在使用lxml.html。雅虎最近改变了网络结构。

目标页面；

http://finance.yahoo.com/quote/IBM/options ?date=1469750400&straddle=true

中看到数据

 //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table

在 Chrome 中使用检查器：我在更多代码

如何将此数据放入列表中。我想将“LLY”更改为“Msft”吗？
我如何在日期之间切换......并获取所有月份。

原文

I have seen some webcasts and need help in trying to do this:
I have been using lxml.html. Yahoo recently changed the web structure.

target page;

http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true

In Chrome using inspector: I see the data in

 //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table

then some more code

How Do get this data out into a list.
I want to change to other stock from "LLY" to "Msft"?
How do I switch between dates....And get all months.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉墨 2024-11-03 15:32:38

我知道您说过不能使用 lxml.html。但这里是如何使用该库来做到这一点，因为它是非常好的库。因此，为了完整起见，我提供了使用它的代码，因为我不再使用 BeautifulSoup ——它不受维护、速度慢且 API 丑陋。

下面的代码解析页面并将结果写入 csv 文件。

import lxml.html
import csv

doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[@class='yfnc_tabledata1']]")[0]

with open('results.csv', 'wb') as f:
    cf = csv.writer(f)
    # find all trs inside that table:
    for tr in table.xpath('./tr'):
        # add the text of all tds inside each tr to a list
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        # write the list to the csv file:
        cf.writerow(row)

就是这样！ lxml.html 是如此简单又漂亮！可惜你不能使用它。

以下是生成的 results.csv 文件中的一些行：

LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50

I know you said you can't use lxml.html. But here is how to do it using that library, because it is very good library. So I provide the code using it, for completeness, since I don't use BeautifulSoup anymore -- it's unmaintained, slow and has ugly API.

The code below parses the page and writes the results in a csv file.

import lxml.html
import csv

doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[@class='yfnc_tabledata1']]")[0]

with open('results.csv', 'wb') as f:
    cf = csv.writer(f)
    # find all trs inside that table:
    for tr in table.xpath('./tr'):
        # add the text of all tds inside each tr to a list
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        # write the list to the csv file:
        cf.writerow(row)

That's it! lxml.html is so simple and nice!! Too bad you can't use it.

Here's some lines from the results.csv file that was generated:

LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50

回复收藏 0 原文

枉心 2024-11-03 15:32:38

下面是一个从股票表中提取所有数据的简单示例：

import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[@class="details-table quote-table Fz-m"]'):
    rows = []
    for tr in table.xpath('./tbody/tr'):
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        rows.append(row)
    print rows

然后，要提取不同的股票和日期，您需要更改 URL。这是前一天的 MSFT：
http://finance.yahoo.com/q/op ?s=msft&m=2014-11-14

Here is a simple example to extract all data from the stock tables:

import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[@class="details-table quote-table Fz-m"]'):
    rows = []
    for tr in table.xpath('./tbody/tr'):
        row = [td.text_content().strip() for td in tr.xpath('./td')]
        rows.append(row)
    print rows

Then to extract for different stocks and dates you need to change the URL. Here is Msft for the previous day:
http://finance.yahoo.com/q/op?s=msft&m=2014-11-14

回复收藏 0 原文

紫罗兰の梦幻 2024-11-03 15:32:38

如果您想要原始 json 尝试 MSN

http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/

您还可以指定到期日期 ?date=11/14/2014

http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/?date=11/14/2014

如果您更喜欢 Yahoo json

http://finance.yahoo.com/q/op?s=LLY

但您必须从 html 中提取它

import re

m = re.search('<script>.+({"applet_type":"td-applet-options-table".+);</script>', resp.content)

data = json.loads(m.group(1))
as_dicts = data['models']['applet_model']['data']['optionData']['_options'][0]['straddles']

到期时间在这里

data['models']['applet_model']['data']['optionData']['expirationDates']

转换 iso将 unix 时间戳设置为此处

然后使用 unix 时间戳重新请求其他过期时间

http://finance.yahoo.com/q/op?s=LLY&date=1414713600

If you'd like raw json try MSN

http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/

You can also specify an expiration date ?date=11/14/2014

http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/?date=11/14/2014

If you prefer Yahoo json

http://finance.yahoo.com/q/op?s=LLY

But you have to extract it from the html

import re

m = re.search('<script>.+({"applet_type":"td-applet-options-table".+);</script>', resp.content)

data = json.loads(m.group(1))
as_dicts = data['models']['applet_model']['data']['optionData']['_options'][0]['straddles']

Expirations are here

data['models']['applet_model']['data']['optionData']['expirationDates']

Convert iso to unix timestamp as here

Then re-request the other expirations with the unix timestamp

http://finance.yahoo.com/q/op?s=LLY&date=1414713600

回复收藏 0 原文

柠檬色的秋千 2024-11-03 15:32:38

基于@hoju 的答案：

import lxml.html
import calendar
from datetime import datetime

exDate  = "2014-11-22"
symbol  = "LLY"
dt      = datetime.strptime(exDate, '%Y-%m-%d')
ym      = calendar.timegm(dt.utctimetuple())

url     = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc     = lxml.html.parse(url)
table   = doc.xpath('//table[@class="details-table quote-table Fz-m"]/tbody/tr')

rows    = []        
for tr in table:
     d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
     rows.append(d)

print rows

Basing the Answer on @hoju:

import lxml.html
import calendar
from datetime import datetime

exDate  = "2014-11-22"
symbol  = "LLY"
dt      = datetime.strptime(exDate, '%Y-%m-%d')
ym      = calendar.timegm(dt.utctimetuple())

url     = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc     = lxml.html.parse(url)
table   = doc.xpath('//table[@class="details-table quote-table Fz-m"]/tbody/tr')

rows    = []        
for tr in table:
     d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
     rows.append(d)

print rows

回复收藏 0 原文

~没有更多了~