从特殊字符列表创建字典

发布于 2024-11-19 14:27:21 字数 2113 浏览 7 评论 0原文

我正在编写这个小脚本：基本上它将列表元素（其中包含特殊字符）映射到其索引以创建字典。

#!/usr/bin/env python
#-*- coding: latin-1 -*-

ln1 = '?0>9<8~7|65"4:3}2{1+_)'
ln2 = "(*&^%$£@!/`'\][=-#¢"

refStr = ln2+ln1

keyDict = {}
for i in range(0,len(refStr)):
    keyDict[refStr[i]] = i


print "-" * 32
print "Originl: ",refStr
print "KeyDict: ", keyDict

# added just to test a few special characters
tsChr = ['£','%','\\','¢']

for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

它返回这样的结果：

Originl:  (*&^%$£@!/`'\][=-#¢?0>9<8~7|65"4:3}2{1+_)
KeyDict:  {'!': 9, '\xa3': 7, '\xa2': 20, '%': 4, '$': 5, "'": 12, '&': 2, ')': 42, '(': 0, '+': 40, '*': 1, '-': 17, '/': 10, '1': 39, '0': 22, '3': 35, '2': 37, '5': 31, '4': 33, '7': 28, '6': 30, '9': 24, '8': 26, ':': 34, '=': 16, '<': 25, '?': 21, '>': 23, '@': 8, '\xc2': 19, '#': 18, '"': 32, '[': 15, ']': 14, '\\': 13, '_': 41, '^': 3, '`': 11, '{': 38, '}': 36, '|': 29, '~': 27}

这一切都很好，除了字符 £、% 和 \ 正在转换为 \xa3分别为、\xa2 和 \\。有谁知道为什么打印 ln1/ln2 可以，但字典不行。我该如何解决这个问题？非常感谢任何帮助。干杯！！

Update 1

我添加了额外的特殊字符 - # 和 cent 然后这就是我按照 @Duncan 的建议得到的：

! 9
? 7
? 20
% 4
$ 5
....
....
8 26
: 34
= 16
< 25
? 21
> 23
@ 8
? 19
....
....

注意第 7 个、第 19 个和第 20 个元素，即根本打印不正确。第 21 个元素是实际的 ? 字符。干杯！！

Update 2

只是将此循环添加到我的原始帖子中以实际测试我的目的：

tsChr = ['£','%','\\','¢']
for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

这就是我得到的结果：

£   not in the dic.
%   4
\   13
¢   not in the dic.

运行脚本时，它认为 £ 和 cent 实际上不是在字典里 - 这就是我的问题。任何人都知道如何解决这个问题或我做错了什么/哪里？

最终，我将检查字典中文件（或一行文本）中的字符，看看它是否存在，并且有可能包含像 é 或 < code>£ 等在文本中。干杯！！

原文

I'm working on this small script: basically it's mapping the list elements (with special characters in it) to its index to create a dictionary.

#!/usr/bin/env python
#-*- coding: latin-1 -*-

ln1 = '?0>9<8~7|65"4:3}2{1+_)'
ln2 = "(*&^%$£@!/`'\][=-#¢"

refStr = ln2+ln1

keyDict = {}
for i in range(0,len(refStr)):
    keyDict[refStr[i]] = i


print "-" * 32
print "Originl: ",refStr
print "KeyDict: ", keyDict

# added just to test a few special characters
tsChr = ['£','%','\\','¢']

for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

It returns the result like this:

Originl:  (*&^%$£@!/`'\][=-#¢?0>9<8~7|65"4:3}2{1+_)
KeyDict:  {'!': 9, '\xa3': 7, '\xa2': 20, '%': 4, '
which is all good, except for the characters £, % and \ are converting to \xa3, \xa2 and \\ respectively. Does any one know why printing ln1/ln2 is just fine but the dictionary is not. How can I fix this? Any help greatly appreciated. Cheers!!
Update 1
I've added extra special characters - # and ¢ and then this is what I get following @Duncan's suggestion:
! 9
? 7
? 20
% 4
$ 5
....
....
8 26
: 34
= 16
< 25
? 21
> 23
@ 8
? 19
....
....

Notice that 7th, 19th and 20th elements, which is not printing correctly at all. 21st element is the actual ? character. Cheers!! 
Update 2 
Just added this loop to my original post to actually test my purpose:
tsChr = ['£','%','\\','¢']
for k in tsChr:
    if k in keyDict:
        print k, "\t", keyDict[k]
    else: print k, "\t", "not in the dic."

and this what I get as result:
£   not in the dic.
%   4
\   13
¢   not in the dic.

Whist running the script, it thinks that £ and ¢ are not actually in the dictionary - and that's my problem. Anyone knows how to fix that or what/where am I doing wrong? 
eventually, I'll be checking for the character(s) from a file (or a line of text) in the dictionary to see if it exists and there is a chance of having character like é or £ and so on in the text. Cheers!!
: 5, "'": 12, '&': 2, ')': 42, '(': 0, '+': 40, '*': 1, '-': 17, '/': 10, '1': 39, '0': 22, '3': 35, '2': 37, '5': 31, '4': 33, '7': 28, '6': 30, '9': 24, '8': 26, ':': 34, '=': 16, '<': 25, '?': 21, '>': 23, '@': 8, '\xc2': 19, '#': 18, '"': 32, '[': 15, ']': 14, '\\': 13, '_': 41, '^': 3, '`': 11, '{': 38, '}': 36, '|': 29, '~': 27}

which is all good, except for the characters £, % and \ are converting to \xa3, \xa2 and \\ respectively. Does any one know why printing ln1/ln2 is just fine but the dictionary is not. How can I fix this? Any help greatly appreciated. Cheers!!

Update 1

I've added extra special characters - # and ¢ and then this is what I get following @Duncan's suggestion:

Notice that 7th, 19th and 20th elements, which is not printing correctly at all. 21st element is the actual ? character. Cheers!!

Update 2

Just added this loop to my original post to actually test my purpose:

and this what I get as result:

Whist running the script, it thinks that £ and ¢ are not actually in the dictionary - and that's my problem. Anyone knows how to fix that or what/where am I doing wrong?

eventually, I'll be checking for the character(s) from a file (or a line of text) in the dictionary to see if it exists and there is a chance of having character like é or £ and so on in the text. Cheers!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

芯好空 2024-11-26 14:27:21

当您打印包含字符串的字典或列表时，Python 将显示字符串的 repr()。如果您 print repr(ln2) 您会发现没有任何变化：您的字典键只是 '£' &c 的 latin-1 编码。人物。

如果您这样做：

for k in keyDict:
    print k, keyDict[k]

那么字符将按您的预期显示。

When you print a dictionary or list that contains strings Python will display the repr() of the strings. If you print repr(ln2) you'll see that nothing has changed: your dictionary key is just the latin-1 encoding of '£' &c. characters.

If you do:

for k in keyDict:
    print k, keyDict[k]

then the characters will display as you expect.

回复收藏 0 原文

那请放手 2024-11-26 14:27:21

以我的拙见，了解unicode的一般知识和它在python中的使用会很有用

如果您不有兴趣知道为什么人们必须把事情搞砸，所以您必须处理 '\xa3' 而不是只有一个简单的 £ 那么邓肯在上面回答非常完美，可以告诉您您想知道的一切。

更新（关于您的更新＃2）

请断言您的文件是用 latin-1 编码和 非 utf-8 保存的，因为它现在是这样，您的测试将通过（或者只是更改 #-*-编码：latin-1 -*- 到 #-*- 编码：utf-8 -*-）

这是您可以轻松理解从我的链接阅读（和理解）内容的事情上面：

您的文件另存为utf-8 这意味着对于 char £ 使用 2 个字节，但由于你告诉 python 解释器编码是 latin-1，他将使用 2 个 utf-8 中的每一个密钥的 £ 字节。

事实上，我可以计算 ln2 中的 19 个字符，但如果您发出 len(ln2) ，它将返回 21。

当您在 keyDict.keys( 中测试 '£' 时) 您正在寻找一个 2 个字符的字符串，而每个 2 个字符在字典中都有自己的键，这就是它找不到它的原因。

您还可以测试 len(keyDict) 并发现它比您预期的要长。

我想这解释了一切，请理解并不是所有的故事都很容易在一个网页中解释，但上面的链接，在我看来是一个很好的起点，混合了一些故事和一些编码示例。

干杯

P.S.：我正在使用这段代码，将其保存为 UTF-8 并且它可以完美地工作：

#!/usr/bin/env python
#-*- coding: utf-8 -*-

ln1 = u'?0>9<8~7|65"4:3}2{1+_)'
ln2 = u"(*&^%$£@!/`'\][=-#¢"

refStr = u"%s%s" % (ln2, ln1)

keyDict = {}
for idx, chr_ in enumerate(refStr):
    print chr_,
    keyDict[chr_] = idx

print u"-" * 32
print u"Originl: ", refStr
print u"KeyDict: ", keyDict

tsChr = [u'£', u'%', u'\\', u'¢']
for k in tsChr:
    if k in keyDict.keys():
        print k, "\t", keyDict[k]
    else: print k, repr(k), "\t", "not in the dic."

In my humble opinion it would be useful to learn about unicode in general and it's use in python

if you are not interested to know why people had to mess up things so you have to deal with a '\xa3' instead of having just a plain £ then Duncan answer above is perfect and tells you everything you want to know.

Update (regardin your Update #2)

please assert your file is saved with latin-1 encoding and non utf-8 as it's now and your test will pass (or just change #-*- coding: latin-1 -*- to #-*- coding: utf-8 -*-)

This is a thing you could easily understand reading (and understanding) contents from my link above:

your file is saved as utf-8 this means for char £ 2 bytes are used but since you tell python interpreter encoding is latin-1 he will use each of the 2 utf-8 bytes of £ for a key.

Infact I can count 19 chars in ln2 but if you issue len(ln2) it will return 21.

When you test for '£' in keyDict.keys() you are looking for a 2-char string while each of the 2-chars got its own key in dictionary, that's why it won't find it.

Also you can test len(keyDict) and find it's longer than what you expect.

I guess this explains everything, please understand not all the story is easy to be explained in a single webpage but the link above, in my humble opinion is a nice starting point, mixing some story and some coding examples.

Cheers

P.S.: I'm using this code, saving it as UTF-8 and it works flawlessly:

#!/usr/bin/env python
#-*- coding: utf-8 -*-

ln1 = u'?0>9<8~7|65"4:3}2{1+_)'
ln2 = u"(*&^%$£@!/`'\][=-#¢"

refStr = u"%s%s" % (ln2, ln1)

keyDict = {}
for idx, chr_ in enumerate(refStr):
    print chr_,
    keyDict[chr_] = idx

print u"-" * 32
print u"Originl: ", refStr
print u"KeyDict: ", keyDict

tsChr = [u'£', u'%', u'\\', u'¢']
for k in tsChr:
    if k in keyDict.keys():
        print k, "\t", keyDict[k]
    else: print k, repr(k), "\t", "not in the dic."

回复收藏 0 原文

~没有更多了~