Python 2.x:如何自动执行 unicode 而不是字符串?

发布于 2024-09-29 19:34:55 字数 338 浏览 11 评论 0原文

如何自动化测试以强制 Python 2.x 代码主体不包含字符串实例(仅包含 unicode 实例)?

例如。

我可以从代码中做到这一点吗?

有没有具有此功能的静态分析工具?

编辑:

我希望在 Python 2.5 中的应用程序中使用此功能,但事实证明这实际上是不可能的,因为:

  1. 2.5 不支持 unicode_literals
  2. kwargs 字典键不能unicode 对象,仅字符串

所以我接受说这是不可能的答案,即使它出于不同的原因:)

How can I automate a test to enforce that a body of Python 2.x code contains no string instances (only unicode instances)?

Eg.

Can I do it from within the code?

Is there a static analysis tool that has this feature?

Edit:

I wanted this for an application in Python 2.5, but it turns out this is not really possible because:

  1. 2.5 doesn't support unicode_literals
  2. kwargs dictionary keys can't be unicode objects, only strings

So I'm accepting the answer that says it's not possible, even though it's for different reasons :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

柠檬 2024-10-06 19:34:55

您无法强制所有字符串都是 Unicode;即使在模块中使用 from __future__ import unicode_literals,字节字符串也可以写为 b'...',就像在 Python 3 中一样。

曾经< /em> 一个可用于在全局范围内获得与 unicode_literals 相同效果的选项:命令行选项 -U。然而它在 2.x 系列早期就被放弃了,因为它基本上破坏了所有脚本。

你这样做的目的是什么?废除字节串是不可取的。它们并不是“坏”,Unicode 字符串也不是普遍“更好”;它们是两种不同的动物,您将需要它们。与二进制文件和网络服务通信肯定需要字节字符串。

如果您想准备好过渡到 Python 3,最好的办法是为您真正想要作为字节的所有字符串编写 b'...' ,并为您真正想要的字节编写 u'.. .' 对于本质上是 Unicode 的字符串。默认字符串 '...' 格式可用于其他所有内容、您不关心的地方和/或 Python 3 是否更改默认字符串类型。

You can't enforce that all strings are Unicode; even with from __future__ import unicode_literals in a module, byte strings can be written as b'...', as they can in Python 3.

There was an option that could be used to get the same effect as unicode_literals globally: the command-line option -U. However it was abandoned early in the 2.x series because it basically broke every script.

What is your purpose for this? It is not desirable to abolish byte strings. They are not “bad” and Unicode strings are not universally “better”; they are two separate animals and you will need both of them. Byte strings will certainly be needed to talk to binary files and network services.

If you want to be prepared to transition to Python 3, the best tack is to write b'...' for all the strings you really mean to be bytes, and u'...' for the strings that are inherently Unicode. The default string '...' format can be used for everything else, places where you don't care and/or whether Python 3 changes the default string type.

烟柳画桥 2024-10-06 19:34:55

在我看来,你真的需要用一个诚实的 python 解析器来解析代码。然后,您需要深入分析解析器生成的 AST,看看它是否包含任何字符串文字。

看起来 Python 自带了一个开箱即用的解析器。从这个 文档 我得到了这个代码示例工作:

import parser
from token import tok_name

def checkForNonUnicode(codeString):
    return checkForNonUnicodeHelper(parser.suite(codeString).tolist())

def checkForNonUnicodeHelper(lst):
    returnValue = True
    nodeType = lst[0]
    if nodeType in tok_name and tok_name[nodeType] == 'STRING':
        stringValue = lst[1]
        if stringValue[0] != "u": # Kind of hacky. Does this always work?
            print "%s is not unicode!" % stringValue
            returnValue = False

    else:
        for subNode in [lst[n] for n in range(1, len(lst))]:
            if isinstance(subNode, list):
                returnValue = returnValue and checkForNonUnicodeHelper(subNode)

    return returnValue

print checkForNonUnicode("""
def foo():
    a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
    b = u'although this is ok.'
""")

它打印出

'This should blow up!' is not unicode!
False
True

Now doc字符串不是 unicode,但应该被允许,因此您可能需要执行一些更复杂的操作,例如 from symbol import sym_name ,您可以在其中查找用于类和函数定义的节点类型。然后,第一个子节点只是一个字符串,即不是赋值或其他内容的一部分,应该允许不是 unicode。

好问题!

编辑

只是后续评论。为了方便您的目的,parser.suite 实际上并不评估您的 python 代码。这意味着您可以在 Python 文件上运行此解析器,而不必担心命名或导入错误。例如,假设您有 myObscureUtilityFile.py ,其中包含

from ..obscure.relative.path import whatever

您可以

checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())

It seems to me like you really need to parse the code with an honest to goodness python parser. Then you will need to dig through the AST your parser produces to see if it contains any string literals.

It looks like Python comes with a parser out of the box. From this documentation I got this code sample working:

import parser
from token import tok_name

def checkForNonUnicode(codeString):
    return checkForNonUnicodeHelper(parser.suite(codeString).tolist())

def checkForNonUnicodeHelper(lst):
    returnValue = True
    nodeType = lst[0]
    if nodeType in tok_name and tok_name[nodeType] == 'STRING':
        stringValue = lst[1]
        if stringValue[0] != "u": # Kind of hacky. Does this always work?
            print "%s is not unicode!" % stringValue
            returnValue = False

    else:
        for subNode in [lst[n] for n in range(1, len(lst))]:
            if isinstance(subNode, list):
                returnValue = returnValue and checkForNonUnicodeHelper(subNode)

    return returnValue

print checkForNonUnicode("""
def foo():
    a = 'This should blow up!'
""")
print checkForNonUnicode("""
def bar():
    b = u'although this is ok.'
""")

which prints out

'This should blow up!' is not unicode!
False
True

Now doc strings aren't unicode but should be allowed, so you might have to do something more complicated like from symbol import sym_name where you can look up which node types are for class and function definitions. Then the first sub-node that's simply a string, i.e. not part of an assignment or whatever, should be allowed to not be unicode.

Good question!

Edit

Just a follow up comment. Conveniently for your purposes, parser.suite does not actually evaluate your python code. This means that you can run this parser over your Python files without worrying about naming or import errors. For example, let's say you have myObscureUtilityFile.py that contains

from ..obscure.relative.path import whatever

You can

checkForNonUnicode(open('/whoah/softlink/myObscureUtilityFile.py').read())
多孤肩上扛 2024-10-06 19:34:55

我们的SD源代码搜索引擎(SCSE)可以直接提供这个结果。

SCSE 提供了一种使用某些语言结构在大量文件中极快速地搜索的方法,以实现精确查询并最大限度地减少误报。它处理范围广泛
多种语言,甚至同时存在,包括 Python。 GUI 显示搜索命中以及包含所选命中的文件中的实际文本页面。

它使用源语言中的词汇信息作为查询的基础,由各种语言关键字和匹配不同内容语言元素的模式标记组成。 SCSE 了解语言中可用的词位类型。人们可以搜索通用标识符(使用查询标记 I)或与某些调节器表达式匹配的标识符。类似地,on 可以搜索通用字符串(使用查询标记“S”表示“任何类型的字符串文字”)或特定的字符串
字符串类型(对于 Python 来说,包括“UnicodeStrings”、非 unicode 字符串等,它们共同构成了包含“S”的 Python 事物集合)。

因此,搜索:

 'for' ... I=ij*

在前缀为“ij”的标识符(“...”)附近找到关键字“for”,并显示所有匹配结果。 (特定于语言的空格,包括换行符和注释将被忽略。

一个简单的搜索:

  S

查找所有字符串文字。这通常是一个相当大的集合:-}

搜索查找

 UnicodeStrings

词法上定义为 Unicode 字符串的所有字符串文字 (u".. .")

你想要的是所有不是 UnicodeStrings 的字符串。SCSE 提供了一个“减法”运算符,可以减去一种与另一种类型重叠的命中。因此,你的问题“哪些字符串不是 unicode”得到了简洁的表达as:

  S-UnicodeStrings

显示的所有命中都是非 unicode 字符串的字符串,您的精确问题是

SCSE 提供了日志记录功能,以便您可以从命令行运行 SCSE,从而为您的答案启用脚本化查询。将其写入命令脚本将提供一个工具直接给出您的答案。

Our SD Source Code Search Engine (SCSE) can provide this result directly.

The SCSE provides a way to search extremely quickly across large sets of files using some of the language structure to enable precise queries and minimize false positives. It handles a wide array
of languages, even at the same time, including Python. A GUI shows search hits and a page of actual text from the file containing a selected hit.

It uses lexical information from the source languages as the basis for queries, comprised of various langauge keywords and pattern tokens that match varying content langauge elements. SCSE knows the types of lexemes available in the langauge. One can search for a generic identifier (using query token I) or an identifier matching some regulatr expression. Similar, on can search for a generic string (using query token "S" for "any kind of string literal") or for a specific
type of string (for Python including "UnicodeStrings", non-unicode strings, etc, which collectively make up the set of Python things comprising "S").

So a search:

 'for' ... I=ij*

finds the keyword 'for' near ("...") an identifier whose prefix is "ij" and shows you all the hits. (Language-specific whitespace including line breaks and comments are ignored.

An trivial search:

  S

finds all string literals. This is often a pretty big set :-}

A search

 UnicodeStrings

finds all string literals that are lexically defined as Unicode Strings (u"...")

What you want are all strings that aren't UnicodeStrings. The SCSE provides a "subtract" operator that subtracts hits of one kind that overlap hits of another. So your question, "what strings aren't unicode" is expressed concisely as:

  S-UnicodeStrings

All hits shown will be the strings that aren't unicode strings, your precise question.

The SCSE provides logging facilities so that you can record hits. You can run SCSE from a command line, enabling a scripted query for your answer. Putting this into a command script would provide a tool gives your answer directly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文