I got a dataframe as follows:
Title |
Content |
補水法???? |
Skin Care |
???? 現貨 ???? |
รีบจัดด่วน‼️ ราคาเฉพาะรอบนี???? Test |
I tried to use the regex:
df1['Post Title'] = df1['Post Title'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
df1['Post Detail'] = df1['Post Detail'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
It successfully removes the emojis. However, it only saves for the English but not the Chinese.
I would like to remove all the emojis and languages neither Chinese nor English.
Expected result:
Title |
Content |
補水法 |
Skin Care |
現貨 |
Test |
发布评论
评论(3)
这可以使用Python的Demoji库来完成。要使用PIP安装,
而不是
This can be done using the demoji library in python. For using do a pip install
Than
我认为,该解决方案包含两个部分:
emoji
之类的列表。例如,使用emoji
,我们可以编写以下代码。demoji
和CleanText
之类的工具以类似的方式实现了他们的方法。In my opinion, the solution contains two parts:
emoji
. For example, withemoji
, we can write following code. Tools likedemoji
andcleantext
implement their methods in a similar way.langid
,langdetect
and so on. For example, with following code, the content with Thai language will be removed.If you want to acheive a more fine-grained result, say, you desire the result as below. I think you also need to tokenize the content. For Chinese tokenization, you can refer to https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py, which also includes unicode details of Hanzi.
您可以提取所有英语和中文单词,并加入一个空间:
You can extract all English and Chinese words and join with a space:
Output:
What it does is:
df.replace(fr'[{pEmojiEx}]+', '', regex=True)
removes all chars that are marked withEmoji
Unicode property (these include digits, but I removed*
and#
).str.findall(fr'[{pHan}{pLatin}{pPunct}]+').str.join(" ")
extracts chunks of one or more Latin, Han or any punctuation chars and joins them with a space.