我正在尝试使用 feedparser 解析 RSS 提要,并使用 SQLAlchemy 将其插入到 mySQL 表中。我实际上能够很好地运行它,但今天提要的描述中有一个带有省略号字符的项目,我收到以下错误:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in位置 35:序数不在范围内(256)
如果我向引擎添加 Convert_unicode=True 选项,我可以让插入通过,但省略号不会显示这只是一些奇怪的角色。这似乎是有道理的,因为据我所知,latin-1 中没有水平省略号。即使我将编码设置为 utf-8 似乎也没有什么区别。如果我使用 phpmyadmin 进行插入并包含省略号,那么它会顺利进行。
我想我只是不理解字符编码或如何让 SQLAlchemy 使用我指定的编码。有谁知道如何让文本进入而不出现奇怪的字符?
更新
我想我已经弄清楚了这一点,但我不太确定为什么它很重要...
这是代码:
import sys
import feedparser
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table
COMMON_CHANNEL_PROPERTIES = [
('Channel title:','title', None),
('Channel description:', 'description', 100),
('Channel URL:', 'link', None),
]
COMMON_ITEM_PROPERTIES = [
('Item title:', 'title', None),
('Item description:', 'description', 100),
('Item URL:', 'link', None),
]
INDENT = u' '*4
def feedinfo(url, output=sys.stdout):
feed_data = feedparser.parse(url)
channel, items = feed_data.feed, feed_data.entries
#adding charset=utf8 here is what fixed the problem
db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
metadata = MetaData(db)
rssItems = Table('rss_items', metadata,autoload=True)
i = rssItems.insert();
for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
value = channel[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, label, value
print >> output
print >> output, "Feed items:"
for item in items:
i.execute({'title':item['title'], 'description': item['description'][:100]})
for label, prop, trunc in COMMON_ITEM_PROPERTIES:
value = item[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, INDENT, label, value
print >> output, INDENT, u'---'
return
if __name__=="__main__":
url = sys.argv[1]
feedinfo(url)
这是在不使用字符集选项的情况下运行代码的输出/回溯:
Channel title: [H]ardOCP News/Article Feed
Channel description: News/Article Feed for [H]ardOCP...
Channel URL: http://www.hardocp.com
Feed items:
Item title: Windows 8 UI is Dropping the 'Start' Button
Item description: After 15 years of occupying a place of honor on the desktop, the "Start" button will disappear from ...
Item URL: http://www.hardocp.com/news/2012/02/05/windows_8_ui_dropping_lsquostartrsquo_button/
---
Item title: Which Crashes More? Apple Apps or Android Apps
Item description: A new study of smartphone apps between Android and Apple conducted over a two month period came up w...
Item URL: http://www.hardocp.com/news/2012/02/05/which_crashes_more63_apple_apps_or_android/
---
Traceback (most recent call last):
File "parse.py", line 47, in <module>
feedinfo(url)
File "parse.py", line 36, in feedinfo
i.execute({'title':item['title'], 'description': item['description'][:100]})
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2758, in execute
return e._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2304, in _execute_clauseelement
return connection._execute_clauseelement(elem, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1538, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_context
context)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 330, in do_execute
cursor.execute(statement, parameters)
File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 159, in execute
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 264, in literal
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 202, in unicode_literal
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
所以它看起来像是将字符集添加到mysql 连接字符串做到了。我想它默认为 latin-1 吗?我曾尝试将 content_engine 上的编码标志设置为 utf8 但没有任何作用。有人知道为什么当表和字段设置为 utf8 unicode 时它会使用 latin-1 吗?我还尝试在发送之前使用 .encode('cp1252') 对 item['description] 进行编码,即使没有将字符集选项添加到连接字符串,效果也很好。这不应该与 latin-1 一起使用,但显然它可以吗?我已经找到了解决方案,但希望得到答案:)
I am trying to parse an RSS feed with feedparser and insert it into a mySQL table using SQLAlchemy. I was actually able to get this running just fine but today the feed had an item with an ellipsis character in the description and I get the following error:
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
If I add the convert_unicode=True option to the engine I am able to get the insert to go through but the ellipsis doesn't show up it's just weird characters. This seems to make sense since to the best of my knowledge there is no horizontal ellipsis in latin-1. Even if I set the encoding to utf-8 it doesn't seem to make a difference. If I do an insert using phpmyadmin and include the ellipsis it goes through fine.
I'm thinking I just don't understand character encodings or how to get SQLAlchemy to use one I specify. Does anyone know how to get the text to go in without weird characters?
UPDATE
I think I have figured this one out but I'm not really sure why it matters...
Here is the code:
import sys
import feedparser
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table
COMMON_CHANNEL_PROPERTIES = [
('Channel title:','title', None),
('Channel description:', 'description', 100),
('Channel URL:', 'link', None),
]
COMMON_ITEM_PROPERTIES = [
('Item title:', 'title', None),
('Item description:', 'description', 100),
('Item URL:', 'link', None),
]
INDENT = u' '*4
def feedinfo(url, output=sys.stdout):
feed_data = feedparser.parse(url)
channel, items = feed_data.feed, feed_data.entries
#adding charset=utf8 here is what fixed the problem
db = create_engine('mysql://user:pass@localhost/db?charset=utf8')
metadata = MetaData(db)
rssItems = Table('rss_items', metadata,autoload=True)
i = rssItems.insert();
for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
value = channel[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, label, value
print >> output
print >> output, "Feed items:"
for item in items:
i.execute({'title':item['title'], 'description': item['description'][:100]})
for label, prop, trunc in COMMON_ITEM_PROPERTIES:
value = item[prop]
if trunc:
value = value[:trunc] + u'...'
print >> output, INDENT, label, value
print >> output, INDENT, u'---'
return
if __name__=="__main__":
url = sys.argv[1]
feedinfo(url)
Here's the output/traceback from running the code without the charset option:
Channel title: [H]ardOCP News/Article Feed
Channel description: News/Article Feed for [H]ardOCP...
Channel URL: http://www.hardocp.com
Feed items:
Item title: Windows 8 UI is Dropping the 'Start' Button
Item description: After 15 years of occupying a place of honor on the desktop, the "Start" button will disappear from ...
Item URL: http://www.hardocp.com/news/2012/02/05/windows_8_ui_dropping_lsquostartrsquo_button/
---
Item title: Which Crashes More? Apple Apps or Android Apps
Item description: A new study of smartphone apps between Android and Apple conducted over a two month period came up w...
Item URL: http://www.hardocp.com/news/2012/02/05/which_crashes_more63_apple_apps_or_android/
---
Traceback (most recent call last):
File "parse.py", line 47, in <module>
feedinfo(url)
File "parse.py", line 36, in feedinfo
i.execute({'title':item['title'], 'description': item['description'][:100]})
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/expression.py", line 2758, in execute
return e._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2304, in _execute_clauseelement
return connection._execute_clauseelement(elem, multiparams, params)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1538, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_context
context)
File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 330, in do_execute
cursor.execute(statement, parameters)
File "build/bdist.linux-i686/egg/MySQLdb/cursors.py", line 159, in execute
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 264, in literal
File "build/bdist.linux-i686/egg/MySQLdb/connections.py", line 202, in unicode_literal
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2026' in position 35: ordinal not in range(256)
So it looks like adding the charset to the mysql connect string did it. I suppose it defaults to latin-1? I had tried setting the encoding flag on content_engine to utf8 and that did nothing. Anyone know why it would use latin-1 when the tables and fields are set to utf8 unicode? I also tried encoding item['description] using .encode('cp1252') before sending it off and that worked as well even without adding the charset option to the connection string. That shouldn't have worked with latin-1 but apparently it did? I've got the solution but would love an answer :)
发布评论
评论(2)
该错误消息
似乎表明某些 Python 语言代码正在尝试将字符
\u2026
转换为 Latin-1 (ISO8859-1) 字符串,但失败。毫不奇怪,该字符是U+2026 HORIZONTAL ELLIPSIS
,在 ISO8859-1 中没有单个等效字符。您通过在 SQLAlchemy 连接调用中添加查询
?charset=utf8
解决了该问题:数据库 URL 告诉我们,以
mysql
开头的 URL 表示 MySQL 方言,使用mysql -python
司机。以下部分,自定义 DBAPI connect() 参数,告诉我们查询参数被传递到底层 DBAPI。
那么,
mysql-python
驱动程序如何使用参数{charset : 'utf8'}
?他们的文档的 函数和属性 部分提到了 < code>charset 属性“...如果存在,则连接字符集将更改为该字符集(如果它们不相等)。”要了解连接字符集的含义,我们转向 10.1.4。 MySQL 5.6 参考手册的连接字符集和排序规则。长话短说,MySQL 可以将传入查询解释为不同于数据库字符集的编码,并且不同于返回查询结果的编码。
由于您报告的错误消息看起来像 Python 而不是 SQL 错误消息,因此我推测 SQLAlchemy 或 mysql-python 中的某些内容正在尝试将查询转换为默认连接编码
latin-1
发送之前。这就是触发错误的原因。但是,connect()
调用中的查询字符串?charset=utf8
会更改连接编码,并且U+2026 HORIZONTAL ELLIPSIS
能够才能通过。更新:您还会问,“如果我删除字符集选项,然后使用 .encode('cp1252') 对描述进行编码,它将顺利完成。省略号如何能够通过 cp1252但不是unicode?”
编码
cp1252
在字节值\x85<处有
一个水平省略号字符/代码>。因此,可以将包含
U+2026 HORIZONTAL ELLIPSIS
的 Unicode 字符串编码为 cp1252,而不会出现错误。还要记住,在 Python 中,Unicode 字符串和字节字符串是两种不同的数据类型。可以合理地推测 MySQLdb 可能有一个通过 SQL 连接仅发送字节字符串的策略。因此,它将把作为 Unicode 字符串接收的查询编码为字节字符串,但将单独保留作为字节字符串接收的查询。 (这是猜测,我没有查看源代码。)
在您发布的回溯中,最后两行(最接近错误发生的位置)显示方法名称
literal
,后跟 <代码>unicode_literal。这往往支持这样的理论:MySQLdb 将其接收到的查询作为 Unicode 字符串编码为字节字符串。当您自己对查询字符串进行编码时,您会绕过 MySQLdb 中以不同方式进行此编码的部分。但请注意,如果您对查询字符串的编码与 MySQL 连接字符集要求的不同,那么您将遇到编码不匹配的情况,并且您的文本可能会被错误存储。
The error message
seems to indicate that some Python language code is trying to convert the character
\u2026
into a Latin-1 (ISO8859-1) string, and it is failing. Not surprising, that character isU+2026 HORIZONTAL ELLIPSIS
, which has no single equivalent character in ISO8859-1.You fixed the problem by adding the query
?charset=utf8
in your SQLAlchemy connection call:The section Database Urls of the SQLAlchemy documentation tells us that a URL beginning with
mysql
indicates a MySQL dialect, using themysql-python
driver.The following section, Custom DBAPI connect() arguments, tells us that query arguments are passed to the underlying DBAPI.
So, what does the
mysql-python
driver make of a parameter{charset: 'utf8'}
? Section Functions and attributes of their documentation says of thecharset
attribute "...If present, the connection character set will be changed to this character set, if they are not equal."To find out what the connection character set means, we turn to 10.1.4. Connection Character Sets and Collations of the MySQL 5.6 reference manual. To make a long story short, MySQL can have interpret incoming queries as an encoding different than the database's character set, and different than the encoding of the returned query results.
Since the error message you reported looks like a Python rather than a SQL error message, I'll speculate that something in SQLAlchemy or mysql-python is attempting to convert the query to a default connection encoding of
latin-1
before sending it. This is what triggers the error. However, the query string?charset=utf8
in yourconnect()
call changes the connection encoding, and theU+2026 HORIZONTAL ELLIPSIS
is able to get through.Update: you also ask, "if I remove the charset option and then encode the description using .encode('cp1252') it will go through just fine. How is an ellipsis able to get through with cp1252 but not unicode?"
The encoding
cp1252
has a horizontal ellipsis character at byte value\x85
. Thus it is possible to encode a Unicode string containingU+2026 HORIZONTAL ELLIPSIS
into cp1252 without error.Remember also that in Python, Unicode strings and byte strings are two different data types. It's reasonable to speculate that MySQLdb might have a policy of sending only byte strings over a SQL connection. Thus it would encode a query received as a Unicode string into a byte string, but would leave a query received as a byte string alone. (This is speculation, I haven't looked at the source code.)
In the traceback you posted, the last two lines (closest to where the error occur) show the method names
literal
, followed byunicode_literal
. That tends to support the theory that MySQLdb is encoding the query it receives as a Unicode string into a byte string.When you encode the query string yourself, you bypass the part of MySQLdb that does this encoding differently. Note, however, that if you encode the query string differently than the MySQL connection charset calls for, then you'll have an encoding mismatch, and your text will likely be stored wrong.
在连接字符串中添加
charset=utf8
肯定有帮助,但我在 Python 2.7 中遇到了将convert_unicode=True
添加到create_engine
也是必要的情况。 SQLAlchemy 文档说这只是为了提高性能,但就我而言,它实际上解决了使用错误编码器的问题。Adding
charset=utf8
in the connection string definitely helps, but I encountered situations in Python 2.7 when addingconvert_unicode=True
tocreate_engine
was also necessary. SQLAlchemy documentation says it's only to boost performance, but in my case it actually solved the problem of the wrong encoder being used.