如何修复RedShift中用ISO-8859-1解码的UTF-8
我认为一个数据集是ISO-8859-1编码的,而实际上是在UTF-8中编码的。 我写了一个Python脚本,在其中用ISO-8859-1解码数据,并将其写入红移SQL数据库中。 我将混乱的字符写入红移表中,写入桌子时没有发生解码。 (使用错误编码的使用Python和Pandas)
现在数据源不再可用,但是表中的数据具有很多混乱的字符。
例如'HelloGünter' - > 'HelloGă -nter'
解决此问题的最佳方法是什么? 现在,我只能想到收集完整的杂物字符及其翻译清单,但也许有一种我没有想到的方法。 所以我的问题:
首先,我想知道解码发生时是否丢失了信息。 我也想知道,RedShift是否有办法解决这样的解码问题。最后,我一直在寻找一个完整的列表,因此我不必自己创建它。我找不到这样的清单。
谢谢
编辑: 我拉了一部分桌子,发现我必须做以下操作:
“ð\x97ð°ð¼ñ\x83ðq .decode('utf8')
表有数十亿的行,是否可以在红移中这样做?
I assumed a dataset was ISO-8859-1 encoded, while it was actually encoded in utf-8.
I wrote a python script where i decoded the data with ISO-8859-1 and wrote it into a redshift sql database.
I wrote the messed up characters into the redshift table, the decoding did not happen while writing into the table. (used python and pandas with wrong encoding)
Now the datasource is not available anymore but the data in the table has a lot of messed up characters.
E.g. 'Hello Günter' -> 'Hello GĂŒnter'
What is the best way to resolve this issue?
Right now i can only think of collecting a complete list of messed up characters and their translation, but maybe there is a way i have not thought of.
So my questions:
First of all i would like to know if information was lost when the decoding happened..
Also i would like to know if there might be a way in redshift to solve such a decoding issue. Finally i have been searching for a complete list, so i do not have to create it myself. I could not find such list.
Thank you
EDIT:
I pulled a part of the table and found out i have to do the following thing:
"Ð\x97амÑ\x83ж вÑ\x8bÑ\x85оди".encode('iso-8859-1').decode('utf8')
The table has billions of rows, would it be possible to do that in redshift?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论