如何正确转义单引号和双引号

发布于 2024-12-10 14:40:44 字数 933 浏览 0 评论 0原文

我有一个 lxml etree HTMLParser 对象,我试图用它构建 xpath 来断言 xpath、xpath 的属性和该标签的文本。当标签的文本具有单引号(')或双引号(“)时,我遇到了问题,并且我已经用尽了所有选项。

这是我创建的示例对象

parser = etree.HTMLParser()
tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser)

这是代码片段,然后是不同的在 self.text 中读取的变量的变体

   def getXpath(self)
     xpath += 'starts-with(., \'' + self.text + '\') and '
     xpath += ('count(@*)=' + str(attrsCount) if self.exactMatch else "1=1") + ']'

基本上是标签的预期文本,在这种情况下: 这是我的“测试”“字符串”,

当我尝试使用 HTMLParser 对象的 xpath 方法时,此失败

tree.xpath(self.getXpath())

原因是因为它得到的xpath是这个'/html/body/p[starts-with(.,'Here is my 'test' "string"') and 1=1]'

如何正确转义 self.text 变量中的单引号和双引号? '已经尝试过三重引用,将 self.text 包装在 repr() 中,或者使用 \' 和 \" 进行 re.sub 或 string.replace 转义 ' 和 "

I have a lxml etree HTMLParser object that I'm trying to build xpaths with to assert xpaths, attributes of the xpath and text of that tag. I ran into a problem when the text of the tag has either single-quotes(') or double-quotes(") and I've exhausted all my options.

Here's a sample object I created

parser = etree.HTMLParser()
tree = etree.parse(StringIO(<html><body><p align="center">Here is my 'test' "string"</p></body></html>), parser)

Here is the snippet of code and then different variations of the variable being read in

   def getXpath(self)
     xpath += 'starts-with(., \'' + self.text + '\') and '
     xpath += ('count(@*)=' + str(attrsCount) if self.exactMatch else "1=1") + ']'

self.text is basically the expected text of the tag, in this case: Here is my 'test' "string"

this fails when i try to use the xpath method of the HTMLParser object

tree.xpath(self.getXpath())

Reason is because the xpath that it gets is this '/html/body/p[starts-with(.,'Here is my 'test' "string"') and 1=1]'

How can I properly escape the single and double quotes from the self.text variable? I've tried triple quoting, wrapping self.text in repr(), or doing a re.sub or string.replace escaping ' and " with \' and \"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

∞琼窗梦回ˉ 2024-12-17 14:40:44

根据我们可以在维基百科中看到w3 school,您不应该有 ' 和节点内容中的 ",即使只有 <& 被认为是严格非法的。它们应该被相应的“预定义实体”替换引用”,即 '"

顺便说一下,我使用的 Python 解析器会透明地处理这个问题:在编写时,它们是 进行转换。

读取时进行替换;第二次读取后 根据你的回答,我在Python解释器中测试了一些东西,它会为你转义所有内容,

>>> 'text {0}'.format('blabla "some" bla')
'text blabla "some" bla'
>>> 'ntsnts {0}'.format("ontsi'tns")
"ntsnts ontsi'tns"
>>> 'ntsnts {0}'.format("ontsi'tn' \"ntsis")
'ntsnts ontsi\'tn\' "ntsis'

所以我们可以看到Python可以正确地转义内容。您收到的错误消息(如果有)?

According to what we can see in Wikipedia and w3 school, you should not have ' and " in nodes content, even if only < and & are said to be stricly illegal. They should be replaced by corresponding "predefined entity references", that are ' and ".

By the way, the Python parsers I use will take care of this transparently: when writing, they are replaced; when reading, they are converted.

After a second reading of your answer, I tested some stuff with the ' and so on in Python interpreter. And it will escape everything for you!

>>> 'text {0}'.format('blabla "some" bla')
'text blabla "some" bla'
>>> 'ntsnts {0}'.format("ontsi'tns")
"ntsnts ontsi'tns"
>>> 'ntsnts {0}'.format("ontsi'tn' \"ntsis")
'ntsnts ontsi\'tn\' "ntsis'

So we can see that Python escapes things correctly. Could you then copy-paste the error message you get (if any)?

债姬 2024-12-17 14:40:44

还有更多选项可供选择,尤其是 """''' 可能就是您想要的。

s = "a string with a single ' quote"
s = 'a string with a double " quote'
s = """a string with a single ' and a double " quote"""
s = '''another string with those " quotes '.'''
s = r"raw strings let \ be \"
s = r'''and can be added \ to " any ' of """ those things'''
s = """The three-quote-forms
       may contain
       newlines."""

there are more options to choose from, especially the """ and ''' might be what you want.

s = "a string with a single ' quote"
s = 'a string with a double " quote'
s = """a string with a single ' and a double " quote"""
s = '''another string with those " quotes '.'''
s = r"raw strings let \ be \"
s = r'''and can be added \ to " any ' of """ those things'''
s = """The three-quote-forms
       may contain
       newlines."""
方觉久 2024-12-17 14:40:44

如果您使用 python lxml,则该解决方案适用。
最好将转义留给lxml。我们可以通过使用 lxml 变量来做到这一点。
假设我们有如下的xpath

//tagname[text='some_text']`

如果some_text同时包含单引号和双引号,那么它会导致“Invalid Predicate error”
转义和三引号都不对我有用。因为 xml 不接受三引号。

对我有用的解决方案是 lxml 变量。

我们如下转换 xpath:

//tagname[text = $var]

然后执行

find = etree.XPath(xpath)

然后将这些变量计算为其值

elements = find(root, {'var': text})

The solution is applicable If u r using python lxml.
Its better to leave the escaping for lxml. We can do this by using lxmlvariables.
Suppose We have xpath as below:

//tagname[text='some_text']`

If some_text has both single and double quotes, then it causes "Invalid Predicate error".
Neither escaping work for me nor triple quotes. Because xml won't accept triple quotes.

Solution worked for me is lxml variables.

We convert the xpath as below:

//tagname[text = $var]

Then execute

find = etree.XPath(xpath)

Then evaluate these variable to its value

elements = find(root, {'var': text})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文