机械化提交表单字符编码问题

发布于 2024-11-19 01:20:25 字数 2274 浏览 9 评论 0原文

我正在尝试抓取 http://www.nscb.gov.ph/ggi/database。 asp，特别是您通过选择直辖市/省获得的所有表格。我将 python 与 lxml.html 和 mechanize 结合使用。到目前为止，我的抓取工具工作正常，但是在提交市政当局[19]“Peñarrubia，Abra”时，我收到 HTTP 错误 500：内部服务器错误。我怀疑这是由于字符编码造成的。我的猜测是 ene 字符（上面带有波浪号的 n）导致了这个问题。我该如何解决这个问题？

我的脚本的这一部分的工作示例如下所示。由于我刚刚开始使用 python（并且经常使用我在 SO 上找到的片段），因此非常感谢任何进一步的评论。

from BeautifulSoup import BeautifulSoup
import mechanize
import lxml.html
import csv



class PrettifyHandler(mechanize.BaseHandler):
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = mechanize.response_seek_wrapper(response)
        # only use BeautifulSoup if response is html
        if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
            soup = BeautifulSoup(response.get_data())
            response.set_data(soup.prettify())
        return response

site = "http://www.nscb.gov.ph/ggi/database.asp"

output_mun = csv.writer(open(r'output-municipalities.csv','wb'))
output_prov = csv.writer(open(r'output-provinces.csv','wb'))

br = mechanize.Browser()
br.add_handler(PrettifyHandler())


# gets municipality stats
response = br.open(site)
br.select_form(name="form2")
muns = br.find_control("strMunicipality2", type="select").items
# municipality #19 is not working, those before do
for pos, item in enumerate(muns[19:]): 
    br.select_form(name="form2")
    br["strMunicipality2"] = [item.name]
    print pos, item.name 
    response = br.submit(id="button2", type="submit")
    html = response.read()
    root = lxml.html.fromstring(html)
    table = root.xpath('//table')[1]
    data = [
               [td.text_content().strip() for td in row.findall("td")] 
               for row in table.findall("tr")
           ]
    print data, "\n"
    for row in data[2:]:
        if row: 
            row.append(item.name)
            output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
    response = br.open(site) #go back button not working

# provinces follow here

非常感谢！

编辑：具体来说，错误发生在这一行

response = br.submit(id="button2", type="submit")

原文

I am trying to scrape http://www.nscb.gov.ph/ggi/database.asp, specifically all the tables you get from selecting the municipalities/provinces. I am using python with lxml.html and mechanize. my scraper works fine so far, however I get HTTP Error 500: Internal Server Error when submitting the municipality[19] "Peñarrubia, Abra". I suspect this is due to the character encoding. My guess is that the ene character (n with a tilde above) causes this problem. How can I fix this?

A working example of this part of my script is shown below. As I am just starting out in python (and often use snippets I find on SO), any further comments are greatly appreciated.

from BeautifulSoup import BeautifulSoup
import mechanize
import lxml.html
import csv



class PrettifyHandler(mechanize.BaseHandler):
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = mechanize.response_seek_wrapper(response)
        # only use BeautifulSoup if response is html
        if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
            soup = BeautifulSoup(response.get_data())
            response.set_data(soup.prettify())
        return response

site = "http://www.nscb.gov.ph/ggi/database.asp"

output_mun = csv.writer(open(r'output-municipalities.csv','wb'))
output_prov = csv.writer(open(r'output-provinces.csv','wb'))

br = mechanize.Browser()
br.add_handler(PrettifyHandler())


# gets municipality stats
response = br.open(site)
br.select_form(name="form2")
muns = br.find_control("strMunicipality2", type="select").items
# municipality #19 is not working, those before do
for pos, item in enumerate(muns[19:]): 
    br.select_form(name="form2")
    br["strMunicipality2"] = [item.name]
    print pos, item.name 
    response = br.submit(id="button2", type="submit")
    html = response.read()
    root = lxml.html.fromstring(html)
    table = root.xpath('//table')[1]
    data = [
               [td.text_content().strip() for td in row.findall("td")] 
               for row in table.findall("tr")
           ]
    print data, "\n"
    for row in data[2:]:
        if row: 
            row.append(item.name)
            output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
    response = br.open(site) #go back button not working

# provinces follow here

Thank you very much!

edit: to be specific, the error occur on this line

response = br.submit(id="button2", type="submit")

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

醉城メ夜风 2024-11-26 01:20:25

好的，找到了。这是一个漂亮的汤，可以转换为 unicode，并且 prettify 默认返回 utf-8。
你应该使用：

response.set_data(soup.prettify(encoding='latin-1'))

Ok ,found it. It's beautiful soup that converts to unicode and prettify returns utf-8 by default.
You should use:

response.set_data(soup.prettify(encoding='latin-1'))

回复收藏 0 原文

迷爱 2024-11-26 01:20:25

快速而肮脏的黑客：

def _pairs(self):
    return [(k, v.decode('utf-8').encode('latin-1')) for (i, k, v, c_i) in self._pairs_and_controls()]

from mechanize import HTMLForm
HTMLForm._pairs = _pairs

或侵入性较小的东西（我认为没有其他解决方案，因为类 Item 保护“名称”字段）

item.__dict__['name'] = item.name.decode('utf-8').encode('latin-1')

之前

br["strMunicipality2"] = [item.name]

quick and dirty hack:

def _pairs(self):
    return [(k, v.decode('utf-8').encode('latin-1')) for (i, k, v, c_i) in self._pairs_and_controls()]

from mechanize import HTMLForm
HTMLForm._pairs = _pairs

or something less invasive (I think there are no other solutions because the class Item protects 'name' field)

item.__dict__['name'] = item.name.decode('utf-8').encode('latin-1')

before

br["strMunicipality2"] = [item.name]

回复收藏 0 原文

~没有更多了~

关于作者

傲影

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

机械化提交表单字符编码问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

机械化提交表单字符编码问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。