字符串缩写中的转义字符不一致
我有正在尝试处理的文本。这里有 2 个例子:
Example 1: <p>An alternative way with <code>*</code>:</p>

<pre><code>puts ["Toronto", "Maple Leafs"] * ', '
#Toronto, Maple Leafs
#=> nil
</code></pre>

<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>

Example 2: the thing is that I don't know what's the best way to solve it.
我正在使用 BeautifulSoup 和 repr 来处理文本。它们被清理为:
Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n
Example 2: the thing is that I don't know what's the best way to solve it.
我的问题是 ' 之前的转义字符。为什么示例 1 中的 don't 在 don'\t 处处理,而示例 2 中的 don't 在没有转义字符的情况下被处理为 don't?我怎样才能让他们以同样的方式进行处理?
这是我处理文本的代码:
from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
soup = BeautifulSoup(post,'lxml')
for e in soup.find_all():
if e.name not in ['code']:
e.unwrap()
returnString=str(soup)
post = html.unescape(returnString)
returnString=repr(post)
returnString = returnString[1:-1]
return (returnString)
I have text that I'm trying to process. Here are 2 examples:
Example 1: <p>An alternative way with <code>*</code>:</p>
<pre><code>puts ["Toronto", "Maple Leafs"] * ', '
#Toronto, Maple Leafs
#=> nil
</code></pre>
<p>But I don't think anyone uses this notation, so as recommended in another answer use <code>join</code>.</p>
Example 2: the thing is that I don't know what's the best way to solve it.
I am using BeautifulSoup and repr to process the text. They are being cleaned as:
Example 1:An alternative way with <code>*</code>:\n<code>puts ["Toronto", "Maple Leafs"] * \', \'\n#Toronto, Maple Leafs\n#=> nil\n</code>\nBut I don\'t think anyone uses this notation, so as recommended in another answer use <code>join</code>.\n
Example 2: the thing is that I don't know what's the best way to solve it.
My issue is with the escape character before the ' . Why is the don't in example 1 being processed at don'\t and the don't in example 2 being processed as don't without the escape character? How would I get them them to be processed the same way?
Here is my code for processing the text:
from bs4 import BeautifulSoup
import html
def text_preprocessing(post):
soup = BeautifulSoup(post,'lxml')
for e in soup.find_all():
if e.name not in ['code']:
e.unwrap()
returnString=str(soup)
post = html.unescape(returnString)
returnString=repr(post)
returnString = returnString[1:-1]
return (returnString)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论