如何在 Python 和 Postgres 中处理批量数据库导入中的重音字符

发布于 2024-10-05 20:08:55 字数 6148 浏览 1 评论 0原文

在 Python (openblock) 中运行批量导入脚本时,我收到以下用于编码“UTF8”的无效字节序列:重音字符的 0xca4e 错误:

它显示为: GRAND-CHÊNE, COUR DU

但实际上是“GRAND-CHÊNE, COUR DU”

处理这个问题的最佳方法是什么?理想情况下,我想保留重音字符。我怀疑我需要以某种方式对其进行编码?

编辑:?实际上应该是 Ê。另请注意,该变量来自 ESRI Shapefile。当我尝试 davidcrow 的解决方案时,我得到“不支持 Unicode”,因为大概没有重音字符的字符串已经是 Unicode 字符串。

这是我正在使用的 ESRIImporter 代码:

from django.contrib.gis.gdal import DataSource

class EsriImporter(object):
    def __init__(self, shapefile, city=None, layer_id=0):
        print >> sys.stderr, 'Opening %s' % shapefile
        ds = DataSource(shapefile)

        self.layer = ds[layer_id]
        self.city = "OTTAWA" #city and city or Metro.objects.get_current().name
        self.fcc_pat = re.compile('^(' + '|'.join(VALID_FCC_PREFIXES) + ')\d$')

    def save(self, verbose=False):
        alt_names_suff = ('',)
        num_created = 0
        for i, feature in enumerate(self.layer):
            #if not self.fcc_pat.search(feature.get('FCC')):
            #    continue
            parent_id = None
            fields = {}
            for esri_fieldname, block_fieldname in FIELD_MAP.items():
                value = feature.get(esri_fieldname)
                #print >> sys.stderr, 'Looking at %s' % esri_fieldname

                if isinstance(value, basestring):
                    value = value.upper()
                elif isinstance(value, int) and value == 0:
                    value = None
                fields[block_fieldname] = value
            if not ((fields['left_from_num'] and fields['left_to_num']) or
                    (fields['right_from_num'] and fields['right_to_num'])):
                continue
            # Sometimes the "from" number is greater than the "to"
            # number in the source data, so we swap them into proper
            # ordering
            for side in ('left', 'right'):
                from_key, to_key = '%s_from_num' % side, '%s_to_num' % side
                if fields[from_key] > fields[to_key]:
                    fields[from_key], fields[to_key] = fields[to_key], fields[from_key]
            if feature.geom.geom_name != 'LINESTRING':
                continue
            for suffix in alt_names_suff:
                name_fields = {}
                for esri_fieldname, block_fieldname in NAME_FIELD_MAP.items():
                    key = esri_fieldname + suffix
                    name_fields[block_fieldname] = feature.get(key).upper()
                    #if block_fieldname == 'postdir':
                        #print >> sys.stderr, 'Postdir block %s' % name_fields[block_fieldname]


                if not name_fields['street']:
                    continue
                # Skip blocks with bare number street names and no suffix / type
                if not name_fields['suffix'] and re.search('^\d+$', name_fields['street']):
                    continue
                fields.update(name_fields)
                block = Block(**fields)
                block.geom = feature.geom.geos
                print repr(fields['street'])
                print >> sys.stderr, 'Looking at block %s' % unicode(fields['street'], errors='replace' )

                street_name, block_name = make_pretty_name(
                    fields['left_from_num'],
                    fields['left_to_num'],
                    fields['right_from_num'],
                    fields['right_to_num'],
                    '',
                    fields['street'],
                    fields['suffix'],
                    fields['postdir']
                )
                block.pretty_name = unicode(block_name)
                #print >> sys.stderr, 'Looking at block pretty name %s' % fields['street']

                block.street_pretty_name = street_name
                block.street_slug = slugify(' '.join((unicode(fields['street'], errors='replace' ), fields['suffix'])))
                block.save()
                if parent_id is None:
                    parent_id = block.id
                else:
                    block.parent_id = parent_id
                    block.save()
                num_created += 1
                if verbose:
                    print >> sys.stderr, 'Created block %s' % block
        return num_created

输出:

'GRAND-CH\xcaNE, COUR DU'
Looking at block GRAND-CH�NE, COUR DU
Traceback (most recent call last):

  File "../blocks_ottawa.py", line 144, in <module>
    sys.exit(main())
  File "../blocks_ottawa.py", line 139, in main
    num_created = esri.save(options.verbose)
  File "../blocks_ottawa.py", line 114, in save
    block.save()
  File "/home/chris/openblock/src/django/django/db/models/base.py", line 434, in save
    self.save_base(using=using, force_insert=force_insert, force_update=force_update)
  File "/home/chris/openblock/src/django/django/db/models/base.py", line 527, in save_base
    result = manager._insert(values, return_id=update_pk, using=using)
  File "/home/chris/openblock/src/django/django/db/models/manager.py", line 195, in _insert
    return insert_query(self.model, values, **kwargs)
  File "/home/chris/openblock/src/django/django/db/models/query.py", line 1479, in insert_query
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/home/chris/openblock/src/django/django/db/models/sql/compiler.py", line 783, in execute_sql
    cursor = super(SQLInsertCompiler, self).execute_sql(None)
  File "/home/chris/openblock/src/django/django/db/models/sql/compiler.py", line 727, in execute_sql
    cursor.execute(sql, params)
  File "/home/chris/openblock/src/django/django/db/backends/util.py", line 15, in execute
    return self.cursor.execute(sql, params)
  File "/home/chris/openblock/src/django/django/db/backends/postgresql_psycopg2/base.py", line 44, in execute
    return self.cursor.execute(query, args)

django.db.utils.DatabaseError: invalid byte sequence for encoding "UTF8": 0xca4e
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".

When running a batch import script in Python (openblock), I'm getting the following invalid byte sequence for encoding "UTF8": 0xca4e error for an accented character:

It shows up as:
GRAND-CH?NE, COUR DU

But is actually "GRAND-CHÊNE, COUR DU"

What is the best way to handle this? Ideally I'd like to keep the accented character. I suspect I need to encode it somehow?

Edit: the ? is actually supposed to be Ê. Also note that the variable is coming from an ESRI Shapefile. When I try davidcrow's solution, I get "Unicode not supported", because presumably the strings that don't have accented characters are already Unicode strings.

Here's the ESRIImporter code I'm using:

from django.contrib.gis.gdal import DataSource

class EsriImporter(object):
    def __init__(self, shapefile, city=None, layer_id=0):
        print >> sys.stderr, 'Opening %s' % shapefile
        ds = DataSource(shapefile)

        self.layer = ds[layer_id]
        self.city = "OTTAWA" #city and city or Metro.objects.get_current().name
        self.fcc_pat = re.compile('^(' + '|'.join(VALID_FCC_PREFIXES) + ')\d

Output:

'GRAND-CH\xcaNE, COUR DU'
Looking at block GRAND-CH�NE, COUR DU
Traceback (most recent call last):

  File "../blocks_ottawa.py", line 144, in <module>
    sys.exit(main())
  File "../blocks_ottawa.py", line 139, in main
    num_created = esri.save(options.verbose)
  File "../blocks_ottawa.py", line 114, in save
    block.save()
  File "/home/chris/openblock/src/django/django/db/models/base.py", line 434, in save
    self.save_base(using=using, force_insert=force_insert, force_update=force_update)
  File "/home/chris/openblock/src/django/django/db/models/base.py", line 527, in save_base
    result = manager._insert(values, return_id=update_pk, using=using)
  File "/home/chris/openblock/src/django/django/db/models/manager.py", line 195, in _insert
    return insert_query(self.model, values, **kwargs)
  File "/home/chris/openblock/src/django/django/db/models/query.py", line 1479, in insert_query
    return query.get_compiler(using=using).execute_sql(return_id)
  File "/home/chris/openblock/src/django/django/db/models/sql/compiler.py", line 783, in execute_sql
    cursor = super(SQLInsertCompiler, self).execute_sql(None)
  File "/home/chris/openblock/src/django/django/db/models/sql/compiler.py", line 727, in execute_sql
    cursor.execute(sql, params)
  File "/home/chris/openblock/src/django/django/db/backends/util.py", line 15, in execute
    return self.cursor.execute(sql, params)
  File "/home/chris/openblock/src/django/django/db/backends/postgresql_psycopg2/base.py", line 44, in execute
    return self.cursor.execute(query, args)

django.db.utils.DatabaseError: invalid byte sequence for encoding "UTF8": 0xca4e
HINT:  This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
) def save(self, verbose=False): alt_names_suff = ('',) num_created = 0 for i, feature in enumerate(self.layer): #if not self.fcc_pat.search(feature.get('FCC')): # continue parent_id = None fields = {} for esri_fieldname, block_fieldname in FIELD_MAP.items(): value = feature.get(esri_fieldname) #print >> sys.stderr, 'Looking at %s' % esri_fieldname if isinstance(value, basestring): value = value.upper() elif isinstance(value, int) and value == 0: value = None fields[block_fieldname] = value if not ((fields['left_from_num'] and fields['left_to_num']) or (fields['right_from_num'] and fields['right_to_num'])): continue # Sometimes the "from" number is greater than the "to" # number in the source data, so we swap them into proper # ordering for side in ('left', 'right'): from_key, to_key = '%s_from_num' % side, '%s_to_num' % side if fields[from_key] > fields[to_key]: fields[from_key], fields[to_key] = fields[to_key], fields[from_key] if feature.geom.geom_name != 'LINESTRING': continue for suffix in alt_names_suff: name_fields = {} for esri_fieldname, block_fieldname in NAME_FIELD_MAP.items(): key = esri_fieldname + suffix name_fields[block_fieldname] = feature.get(key).upper() #if block_fieldname == 'postdir': #print >> sys.stderr, 'Postdir block %s' % name_fields[block_fieldname] if not name_fields['street']: continue # Skip blocks with bare number street names and no suffix / type if not name_fields['suffix'] and re.search('^\d+

Output:


, name_fields['street']):
                    continue
                fields.update(name_fields)
                block = Block(**fields)
                block.geom = feature.geom.geos
                print repr(fields['street'])
                print >> sys.stderr, 'Looking at block %s' % unicode(fields['street'], errors='replace' )

                street_name, block_name = make_pretty_name(
                    fields['left_from_num'],
                    fields['left_to_num'],
                    fields['right_from_num'],
                    fields['right_to_num'],
                    '',
                    fields['street'],
                    fields['suffix'],
                    fields['postdir']
                )
                block.pretty_name = unicode(block_name)
                #print >> sys.stderr, 'Looking at block pretty name %s' % fields['street']

                block.street_pretty_name = street_name
                block.street_slug = slugify(' '.join((unicode(fields['street'], errors='replace' ), fields['suffix'])))
                block.save()
                if parent_id is None:
                    parent_id = block.id
                else:
                    block.parent_id = parent_id
                    block.save()
                num_created += 1
                if verbose:
                    print >> sys.stderr, 'Created block %s' % block
        return num_created

Output:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

白云悠悠 2024-10-12 20:08:55

请提供更多信息。什么平台 - Windows / Linux / ???

什么版本的Python?

如果您运行的是 Windows,则您的编码更有可能是 cp1252 或类似于 ISO-8859-1 的编码。它绝对不是 UTF-8

您将需要: (1) 找出您的输入数据是用什么编码的。尝试cp1252;这是通常的嫌疑人。 (2) 将数据解码为 un​​icode (3) 将其编码为 UTF-8。

您如何从 ESRI shapefile 中获取数据?显示您的代码。显示完整的回溯和错误消息。为了避免视觉问题(这是 E-grave!不,这是 E-acute!)print repr(the_suspect_data) 并将结果复制/粘贴到问题的编辑中。尽量使用粗体。

More information please. What platform - Windows / Linux / ???

What version of Python?

If you are running Windows, your encoding is much more likely to be cp1252 or similar than ISO-8859-1. It's definitely not UTF-8.

You will need to: (1) Find out what your input data is encoded with. Try cp1252; it's the usual suspect. (2) decode your data into unicode (3) encode it into UTF-8.

How are you getting the data out of your ESRI shapefile? Show your code. Show the full traceback and error message. To avoid visual problems (it's E-grave! no, it's E-acute!) print repr(the_suspect_data) and copy/paste the result into an edit of your question. Go easy on the bold type.

萌化 2024-10-12 20:08:55

看起来数据没有以 UTF-8 形式发送...因此请检查数据库会话中的 client_encoding 参数是否与您的数据匹配,或者在读取文件时将其转换为 Python 中的 UTF-8/Unicode。

您可以使用“SET client_encoding = 'ISO-8859-1'”或类似命令更改数据库会话的客户端编码。不过,0xca 不是 Latin1 中的 E-with-grave,所以我不确定您的文件采用哪种字符编码?

Looks like the data isn't being sent as UTF-8... so check the client_encoding parameter in your DB session matches your data, or translate it to UTF-8/Unicode within Python when reading the file.

You can change the DB session's client encoding using "SET client_encoding = 'ISO-8859-1'" or similar. 0xca isn't E-with-grave in Latin1, though, so I'm not sure which character encoding your file is in?

枕梦 2024-10-12 20:08:55

您可以尝试以下操作:

uString = unicode(item.field, "utf-8")

请参阅 http://evanjones.ca/python-utf8.html 了解有关 Unicode 和 Python 的更多详细信息。

You can try something like:

uString = unicode(item.field, "utf-8")

See http://evanjones.ca/python-utf8.html for more details about Unicode and Python.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文