熊猫替换了ndarray的缩写

发布于 2025-02-11 18:41:06 字数 4459 浏览 3 评论 0原文

我有一个numpy.ndarray，我想使用以下字典替换其中的所有缩写？我该怎么做，以使输出以与输入相同的格式获得。目前，这是我在做的

X_trying=array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
       ["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
      dtype='<U97064')
X_trying

#notice double quotes
array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
       ["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
      dtype='<U97064')



df_for_abbreviations = pd.DataFrame(X_trying, columns = ['text'])#converting to a dataframe
df_for_abbreviations['text_lower']=df_for_abbreviations['text'].apply(lambda x:x.lower())#converting to lowercase so it works with dictionary
df_for_abbreviations["unabbreviated_text"] = df_for_abbreviations["text_lower"].replace(abbreviations_master, regex=True)
#then when i convert back to ndarray format gets screwed up - quotes change from double to single and it causes in donstream code
x=df_for_abbreviations['unabbreviated_text'].to_numpy(dtype='<U97064').reshape(df_for_abbreviations.shape[0],1)
x

#notice，引用到单语引号

array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
       ['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
      dtype='<U97064')

single quotes affect affect the downstream output

，我有一个我想替换为下面的单词词典

abbreviations_master={}
abbreviations_master["i'm"]="i am"
abbreviations_master["it's"]="it is"
abbreviations_master["that's"]="that is"
abbreviations_master["don't"]="do not"
abbreviations_master["i'll"]="i will"
abbreviations_master["i've"]="i have"
abbreviations_master["we're"]="we are"
abbreviations_master["didn't"]="did not"
abbreviations_master["ma'am"]="madam"
abbreviations_master["you're"]="you are"
abbreviations_master["there's"]="there is "
abbreviations_master["let's"]="let us"
abbreviations_master["they're"]="they are"
abbreviations_master["can't"]="can not"
abbreviations_master["he's"]="he is"
abbreviations_master["doesn't"]="does not"
abbreviations_master["she's"]="she is"
abbreviations_master["what's"]="what is"
abbreviations_master["i'd"]="I would "
abbreviations_master["haven't"]="have not"
abbreviations_master["wasn't"]="was not"
abbreviations_master["we'll"]="we will"
abbreviations_master["won't"]="will not"
abbreviations_master["it'll"]="it will"
abbreviations_master["we've"]="we have"
abbreviations_master["wouldn't"]="would not"
abbreviations_master["that'd"]="that would "
abbreviations_master["you've"]="you have"
abbreviations_master["couldn't"]="could not"
abbreviations_master["that'll"]="that will"
abbreviations_master["y'all"]="you all"
abbreviations_master["isn't"]="is not"
abbreviations_master["it'd"]="it would"
abbreviations_master["would've"]="would have"
abbreviations_master["'cause"]="because"
abbreviations_master["hasn't"]="has not"
abbreviations_master["they've"]="they have"
abbreviations_master["you'll"]="you will"
abbreviations_master["here's"]="here is"
abbreviations_master["name's"]="name is"
abbreviations_master["shouldn't"]="should not"
abbreviations_master["wife's"]="?"
abbreviations_master["driver's"]="?"
abbreviations_master["they'll"]="they will"
abbreviations_master["everything's"]="?"
abbreviations_master["husband's"]="?"
abbreviations_master["there'll"]="there will"
abbreviations_master["should've"]="should have"
abbreviations_master["we'd"]="we would"
abbreviations_master["'bout"]="about"
abbreviations_master["she'll"]="she will"
abbreviations_master["he'll"]="he will"
abbreviations_master["you'd"]="you would"
abbreviations_master["one's"]="?"
abbreviations_master["who's"]="who has"
abbreviations_master["weren't"]="were not"
abbreviations_master["aren't"]="are not"
abbreviations_master["how's"]="how is"
abbreviations_master["how're"]="how are"
abbreviations_master["hadn't"]="had not"

原文

I have a numpy.ndarray and i want to replace all abbrevations in it using the below dictionary? how could i do this such that output i get in the same format as input. Currently this is what i am doing

X_trying=array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
       ["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
      dtype='<U97064')
X_trying

#notice double quotes
array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
       ["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
      dtype='<U97064')



df_for_abbreviations = pd.DataFrame(X_trying, columns = ['text'])#converting to a dataframe
df_for_abbreviations['text_lower']=df_for_abbreviations['text'].apply(lambda x:x.lower())#converting to lowercase so it works with dictionary
df_for_abbreviations["unabbreviated_text"] = df_for_abbreviations["text_lower"].replace(abbreviations_master, regex=True)
#then when i convert back to ndarray format gets screwed up - quotes change from double to single and it causes in donstream code
x=df_for_abbreviations['unabbreviated_text'].to_numpy(dtype='<U97064').reshape(df_for_abbreviations.shape[0],1)
x

#notice that quotes change to single quotes

array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
       ['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
      dtype='<U97064')

single quotes affect affect the downstream output

I have a dictionary of words that I would like to replace as below

abbreviations_master={}
abbreviations_master["i'm"]="i am"
abbreviations_master["it's"]="it is"
abbreviations_master["that's"]="that is"
abbreviations_master["don't"]="do not"
abbreviations_master["i'll"]="i will"
abbreviations_master["i've"]="i have"
abbreviations_master["we're"]="we are"
abbreviations_master["didn't"]="did not"
abbreviations_master["ma'am"]="madam"
abbreviations_master["you're"]="you are"
abbreviations_master["there's"]="there is "
abbreviations_master["let's"]="let us"
abbreviations_master["they're"]="they are"
abbreviations_master["can't"]="can not"
abbreviations_master["he's"]="he is"
abbreviations_master["doesn't"]="does not"
abbreviations_master["she's"]="she is"
abbreviations_master["what's"]="what is"
abbreviations_master["i'd"]="I would "
abbreviations_master["haven't"]="have not"
abbreviations_master["wasn't"]="was not"
abbreviations_master["we'll"]="we will"
abbreviations_master["won't"]="will not"
abbreviations_master["it'll"]="it will"
abbreviations_master["we've"]="we have"
abbreviations_master["wouldn't"]="would not"
abbreviations_master["that'd"]="that would "
abbreviations_master["you've"]="you have"
abbreviations_master["couldn't"]="could not"
abbreviations_master["that'll"]="that will"
abbreviations_master["y'all"]="you all"
abbreviations_master["isn't"]="is not"
abbreviations_master["it'd"]="it would"
abbreviations_master["would've"]="would have"
abbreviations_master["'cause"]="because"
abbreviations_master["hasn't"]="has not"
abbreviations_master["they've"]="they have"
abbreviations_master["you'll"]="you will"
abbreviations_master["here's"]="here is"
abbreviations_master["name's"]="name is"
abbreviations_master["shouldn't"]="should not"
abbreviations_master["wife's"]="?"
abbreviations_master["driver's"]="?"
abbreviations_master["they'll"]="they will"
abbreviations_master["everything's"]="?"
abbreviations_master["husband's"]="?"
abbreviations_master["there'll"]="there will"
abbreviations_master["should've"]="should have"
abbreviations_master["we'd"]="we would"
abbreviations_master["'bout"]="about"
abbreviations_master["she'll"]="she will"
abbreviations_master["he'll"]="he will"
abbreviations_master["you'd"]="you would"
abbreviations_master["one's"]="?"
abbreviations_master["who's"]="who has"
abbreviations_master["weren't"]="were not"
abbreviations_master["aren't"]="are not"
abbreviations_master["how's"]="how is"
abbreviations_master["how're"]="how are"
abbreviations_master["hadn't"]="had not"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最冷一天 2025-02-18 18:41:06

您可以使用re.split在保持分离器时打破输入（因为您的某些示例以>开始），并检查是否有任何单词是在您的字典中，否则，只要保留这个词。下面的代码不是很优雅，因为您的输入是np.array。如果您可以使其成为简单的字符串列表，则可以简化代码。

import re
import numpy as np

output_array = []
for input_line in X_trying:
    output_array.append([''.join(abbreviations_master[word] if word in abbreviations_master else word
                                  for word in re.split('( )', str(input_line[0]).lower()))])
output_array = np.array(output_array, dtype='<U97064')

输出格式类似于输入：

array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
       ['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
      dtype='<U97064')

请注意，（）在split中很重要。如果您有更多的分离器，则可以将它们添加为：re.split（'（| \。|，）。但是您的示例没有其他标点符号，所以我没有添加它。

You can use re.split to break the input in words, while keeping the separators (since some of your examples started with a ), and check if any of the words is in your dictionary, otherwise, just keep the word. The code below is not very elegant because your input is an np.array. If you can make it a simple list of strings, the code can be simplified.

import re
import numpy as np

output_array = []
for input_line in X_trying:
    output_array.append([''.join(abbreviations_master[word] if word in abbreviations_master else word
                                  for word in re.split('( )', str(input_line[0]).lower()))])
output_array = np.array(output_array, dtype='<U97064')

The output format is similar to the input:

array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
       ['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
      dtype='<U97064')

Note that the () in split are important. If you have more separators, you can add them as: re.split('( |\.|,). But your examples didn't have any other punctuation, so I didn't add it.

回复收藏 0 原文

~没有更多了~