字符串缩写中的转义字符不一致

发布于 2025-01-10 08:14:07 字数 1532 浏览 3 评论 0原文

我有正在尝试处理的文本。这里有 2 个例子:

Example 1: <p>An alternative way with <code>*</code>:</p>&#xA;&#xA;<pre><code>puts ["Toronto", "Maple Leafs"] * ', '&#xA;#Toronto, Maple Leafs&#xA;#=&gt; nil&#xA;</code></pre>&#xA;&#xA;<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>&#xA;

Example 2: the thing is that I don't know what's the best way to solve it.

我正在使用 BeautifulSoup 和 repr 来处理文本。它们被清理为:

Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n

Example 2: the thing is that I don't know what's the best way to solve it.

我的问题是 ' 之前的转义字符。为什么示例 1 中的 don't 在 don'\t 处处理,而示例 2 中的 don't 在没有转义字符的情况下被处理为 don't?我怎样才能让他们以同样的方式进行处理?

这是我处理文本的代码:

from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
    
    soup = BeautifulSoup(post,'lxml')
    for e in soup.find_all():
       
    
        if e.name not in ['code']:
            e.unwrap()
            
    
    returnString=str(soup)  
    
    
    post = html.unescape(returnString)
    returnString=repr(post)
    returnString = returnString[1:-1]
    return (returnString)

I have text that I'm trying to process. Here are 2 examples:

Example 1: <p>An alternative way with <code>*</code>:</p>

<pre><code>puts ["Toronto", "Maple Leafs"] * ', '
#Toronto, Maple Leafs
#=> nil
</code></pre>

<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>


Example 2: the thing is that I don't know what's the best way to solve it.

I am using BeautifulSoup and repr to process the text. They are being cleaned as:

Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n

Example 2: the thing is that I don't know what's the best way to solve it.

My issue is with the escape character before the ' . Why is the don't in example 1 being processed at don'\t and the don't in example 2 being processed as don't without the escape character? How would I get them them to be processed the same way?

Here is my code for processing the text:

from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
    
    soup = BeautifulSoup(post,'lxml')
    for e in soup.find_all():
       
    
        if e.name not in ['code']:
            e.unwrap()
            
    
    returnString=str(soup)  
    
    
    post = html.unescape(returnString)
    returnString=repr(post)
    returnString = returnString[1:-1]
    return (returnString)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文