如何提取特定子字符串并将文本与 pandas 数据框中的数字分开?

发布于 2025-01-20 06:29:16 字数 390 浏览 4 评论 0原文

我在数据框中有一些以下格式的数据。请参阅下面的图片链接

当前输出

我试图解决的问题有两个

  1. 方面工资列,我想将文本和数字分开并提取值。只要有一个范围,我就想取平均值
  2. 取决于工资是否为每小时/每周/每年等,我想根据是否存在子字符串字符(例如('年','月', 'week'、'hour' 等)

最终输出应如下图所示

预期输出< /a>

谢谢!

I have some data in a dataframe in the below format. Please see image link below

Current Output

The problem I'm trying to solve is two-fold

  1. For the salary column and I want to separate the text and numbers and extract the value. Wherever there is a range I want to take the average
  2. Depending on if the salary is hourly/weekly/yearly etc I want to add a column for salary type based on if there are substring characters such as ('year','month','week','hour' etc)

The final output should look like what is in the image below

Expected Output

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜光 2025-01-27 06:29:16

这可以对你有用

for i in range(len(df)):
    splitted_value = df["salary"].iloc[i].split()
    salary_type = (splitted_value[-1]+"ly").title()
    if "-" in splitted_value:
        ranged_salary = [int(x.replace("$","").replace(",","")) for x in splitted_value if "$" in x]
        salary = sum(ranged_salary)/len(ranged_salary)
    else:
        salary = int(splitted_value[-3].replace("$","").replace(",",""))
    df.loc[i,"salary_value"] = salary
    df.loc[i,"salary_type"] = salary_type

This can work for you

for i in range(len(df)):
    splitted_value = df["salary"].iloc[i].split()
    salary_type = (splitted_value[-1]+"ly").title()
    if "-" in splitted_value:
        ranged_salary = [int(x.replace("
quot;,"").replace(",","")) for x in splitted_value if "
quot; in x]
        salary = sum(ranged_salary)/len(ranged_salary)
    else:
        salary = int(splitted_value[-3].replace("
quot;,"").replace(",",""))
    df.loc[i,"salary_value"] = salary
    df.loc[i,"salary_type"] = salary_type
你与昨日 2025-01-27 06:29:16

这是一个有趣的问题,但下次请提供输入数据作为我们可以复制/粘贴的内容。

您需要一个函数,将工资数据的字符串转换为值和工资类型。

您可以解析字符串中的字符以查找数字,并在遇到 -(破折号)字符时使用布尔开关,以防您需要计算平均值。

lst = [
    "Up to $80,000 a year",
    "$8,500 - $10,500 a month",
    "$25 - $40 an hour",
    "$1,546 a week"
]


def convert(salary_data: str):
    value = ""
    value_max = ""
    need_average = False
    # iterate over the characters in the string
    for c in salary_data:
        if c.isdigit():
            if need_average:
                value_max += c
            else:
                value += c
        elif c == "-":
            # switch to adding to value_max after finding the dash
            need_average = True
    if not need_average:
        # slight cheating for the f-string below
        value_max = value
    value = f"{(int(value) + int(value_max)) / 2:.2f}"
    if "hour" in salary_data:
        salary_type = "hourly"
    elif "week" in salary_data:
        salary_type = "weekly"
    elif "month" in salary_data:
        salary_type = "monthly"
    else:
        # use this as fallback
        salary_type = "yearly"
    return value, salary_type


for element in lst:
    value, salary_type = convert(element)
    print(value, salary_type)

输出

80000.00 yearly
9500.00 monthly
32.50 hourly
1546.00 weekly

This is an interesting question, but next time please provide the input data as something we can copy/paste.

What you need is a function that converts the string for the salary data into the value and the salary type.

You parse over the characters in the string to find the numbers, and use a boolean switch when you encounter the - (dash) character, in case you need to calculate an average.

lst = [
    "Up to $80,000 a year",
    "$8,500 - $10,500 a month",
    "$25 - $40 an hour",
    "$1,546 a week"
]


def convert(salary_data: str):
    value = ""
    value_max = ""
    need_average = False
    # iterate over the characters in the string
    for c in salary_data:
        if c.isdigit():
            if need_average:
                value_max += c
            else:
                value += c
        elif c == "-":
            # switch to adding to value_max after finding the dash
            need_average = True
    if not need_average:
        # slight cheating for the f-string below
        value_max = value
    value = f"{(int(value) + int(value_max)) / 2:.2f}"
    if "hour" in salary_data:
        salary_type = "hourly"
    elif "week" in salary_data:
        salary_type = "weekly"
    elif "month" in salary_data:
        salary_type = "monthly"
    else:
        # use this as fallback
        salary_type = "yearly"
    return value, salary_type


for element in lst:
    value, salary_type = convert(element)
    print(value, salary_type)

output

80000.00 yearly
9500.00 monthly
32.50 hourly
1546.00 weekly
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文