字符串缩写中的转义字符不一致

发布于 2025-01-10 08:14:07 字数 1532 浏览 3 评论 0原文

我有正在尝试处理的文本。这里有 2 个例子：

Example 1: <p>An alternative way with <code>*</code>:</p>&#xA;&#xA;<pre><code>puts ["Toronto", "Maple Leafs"] * ', '&#xA;#Toronto, Maple Leafs&#xA;#=&gt; nil&#xA;</code></pre>&#xA;&#xA;<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>&#xA;

Example 2: the thing is that I don't know what's the best way to solve it.

我正在使用 BeautifulSoup 和 repr 来处理文本。它们被清理为：

Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n

Example 2: the thing is that I don't know what's the best way to solve it.

我的问题是 ' 之前的转义字符。为什么示例 1 中的 don't 在 don'\t 处处理，而示例 2 中的 don't 在没有转义字符的情况下被处理为 don't？我怎样才能让他们以同样的方式进行处理？

这是我处理文本的代码：

from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
    
    soup = BeautifulSoup(post,'lxml')
    for e in soup.find_all():
       
    
        if e.name not in ['code']:
            e.unwrap()
            
    
    returnString=str(soup)  
    
    
    post = html.unescape(returnString)
    returnString=repr(post)
    returnString = returnString[1:-1]
    return (returnString)

原文

I have text that I'm trying to process. Here are 2 examples:

Example 1: <p>An alternative way with <code>*</code>:</p>

<pre><code>puts ["Toronto", "Maple Leafs"] * ', '
#Toronto, Maple Leafs
#=> nil
</code></pre>

<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>


Example 2: the thing is that I don't know what's the best way to solve it.

I am using BeautifulSoup and repr to process the text. They are being cleaned as:

Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n

Example 2: the thing is that I don't know what's the best way to solve it.

My issue is with the escape character before the ' . Why is the don't in example 1 being processed at don'\t and the don't in example 2 being processed as don't without the escape character? How would I get them them to be processed the same way?

Here is my code for processing the text:

from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
    
    soup = BeautifulSoup(post,'lxml')
    for e in soup.find_all():
       
    
        if e.name not in ['code']:
            e.unwrap()
            
    
    returnString=str(soup)  
    
    
    post = html.unescape(returnString)
    returnString=repr(post)
    returnString = returnString[1:-1]
    return (returnString)

分享到QQ

分享到微博