从字符串列中提取多种模式之一

发布于 2025-02-06 08:36:14 字数 1127 浏览 1 评论 0原文

我在一个非常大的数据框架中有一个字符串列，需要根据几种模式提取字符串的一部分。在此步骤中，一场比赛就足够了，我不想找到所有匹配的案例。这是一个先前版本的改进请求，该请求使用Regexp_extract用于一种模式匹配的方法。以下代码正在工作，但考虑到数据的规模不是很有效：

sample_df = spark.createDataFrame(
  [       
      ("file pattern1",),
      ("file pattern2",),
      ("file pattern3",)
  ],
  ['textCol'])
test = (sample_df
.withColumn("p1", F.regexp_extract(F.col('textCol'), pattern1, 1))
.withColumn("p2", F.regexp_extract(F.col('textCol'), pattern2, 1))
.withColumn("p3", F.regexp_extract(F.col('textCol'), pattern3, 1))
.withColumn("file", F.when(F.col("p1")!="", F.col("p1")).otherwise(F.when(F.col("p2")!="", F.col("p2")).otherwise(F.when(F.col("p3")!="", F.col("p3")).otherwise(""))))       
       )

另一种工作方式是pandas_udf。我的功能仍然有效，我更喜欢将其保持在火花级别以进行性能考虑。

@F.pandas_udf(returnType="string")
def get_file_dir(lines):
  res = []
  for l in lines:
    for r in reg_list:
      found=""
      m = re.search(r, l)
      if m:
        found=m.group(1)
        break
    res.append(found)
  return pd.Series(res)

我正在寻找这里的任何代码优化建议，这些建议可能有助于通过我当前的群集配置来减少运行时。

原文

I have a string column in a very large dataframe and I need to extract parts of the string based on several patterns. At this step, a single match is enough and I'm not looking to find all matching cases. This is an improvement request from a previous version that was using regexp_extract method for one pattern matching. The following code is working but is not very efficient considering the scale of data:

sample_df = spark.createDataFrame(
  [       
      ("file pattern1",),
      ("file pattern2",),
      ("file pattern3",)
  ],
  ['textCol'])
test = (sample_df
.withColumn("p1", F.regexp_extract(F.col('textCol'), pattern1, 1))
.withColumn("p2", F.regexp_extract(F.col('textCol'), pattern2, 1))
.withColumn("p3", F.regexp_extract(F.col('textCol'), pattern3, 1))
.withColumn("file", F.when(F.col("p1")!="", F.col("p1")).otherwise(F.when(F.col("p2")!="", F.col("p2")).otherwise(F.when(F.col("p3")!="", F.col("p3")).otherwise(""))))       
       )

Another way to work is pandas_udf. I have this function that is working, still, I prefer to keep it at Spark level for performance considerations.

@F.pandas_udf(returnType="string")
def get_file_dir(lines):
  res = []
  for l in lines:
    for r in reg_list:
      found=""
      m = re.search(r, l)
      if m:
        found=m.group(1)
        break
    res.append(found)
  return pd.Series(res)

I'm looking for any code optimization recommendations here that might help to reduce the runtime with my current cluster configurations.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝咒 2025-02-13 08:36:14

您可以将所有模式组合在一起，将所有模式与管道|分开

patterns = '|'.join([pattern1, pattern2, pattern3])
test = sample_df.withColumn('file', F.regexp_extract('textCol', patterns, 0))

：

pattern1 = '(1$)'
pattern2 = '(\d\d)'
pattern3 = '(3$)'
sample_df = spark.createDataFrame([("file pattern1",), ("file pattern2",), ("file pattern3",)], ['textCol'])

test = (sample_df
    .withColumn("p1", F.regexp_extract(F.col('textCol'), pattern1, 1))
    .withColumn("p2", F.regexp_extract(F.col('textCol'), pattern2, 1))
    .withColumn("p3", F.regexp_extract(F.col('textCol'), pattern3, 1))
    .withColumn("file", F.when(F.col("p1")!="", F.col("p1")).otherwise(F.when(F.col("p2")!="", F.col("p2")).otherwise(F.when(F.col("p3")!="", F.col("p3")).otherwise(""))))       
)
test.show()
# +-------------+---+---+---+----+
# |      textCol| p1| p2| p3|file|
# +-------------+---+---+---+----+
# |file pattern1|  1|   |   |   1|
# |file pattern2|   |   |   |    |
# |file pattern3|   |   |  3|   3|
# +-------------+---+---+---+----+

后面：

pattern1 = '(1$)'
pattern2 = '(\d\d)'
pattern3 = '(3$)'
sample_df = spark.createDataFrame([("file pattern1",), ("file pattern2",), ("file pattern3",)], ['textCol'])

patterns = '|'.join([pattern1, pattern2, pattern3])
test = sample_df.withColumn('file', F.regexp_extract('textCol', patterns, 0))

test.show()
# +-------------+----+
# |      textCol|file|
# +-------------+----+
# |file pattern1|   1|
# |file pattern2|    |
# |file pattern3|   3|
# +-------------+----+

更改版本避免使用调用几个时嵌套。它仅一步就可创建结果列。

You can combine all the patterns together separated with a pipe |

patterns = '|'.join([pattern1, pattern2, pattern3])
test = sample_df.withColumn('file', F.regexp_extract('textCol', patterns, 0))

Before:

pattern1 = '(1$)'
pattern2 = '(\d\d)'
pattern3 = '(3$)'
sample_df = spark.createDataFrame([("file pattern1",), ("file pattern2",), ("file pattern3",)], ['textCol'])

test = (sample_df
    .withColumn("p1", F.regexp_extract(F.col('textCol'), pattern1, 1))
    .withColumn("p2", F.regexp_extract(F.col('textCol'), pattern2, 1))
    .withColumn("p3", F.regexp_extract(F.col('textCol'), pattern3, 1))
    .withColumn("file", F.when(F.col("p1")!="", F.col("p1")).otherwise(F.when(F.col("p2")!="", F.col("p2")).otherwise(F.when(F.col("p3")!="", F.col("p3")).otherwise(""))))       
)
test.show()
# +-------------+---+---+---+----+
# |      textCol| p1| p2| p3|file|
# +-------------+---+---+---+----+
# |file pattern1|  1|   |   |   1|
# |file pattern2|   |   |   |    |
# |file pattern3|   |   |  3|   3|
# +-------------+---+---+---+----+

After:

pattern1 = '(1$)'
pattern2 = '(\d\d)'
pattern3 = '(3$)'
sample_df = spark.createDataFrame([("file pattern1",), ("file pattern2",), ("file pattern3",)], ['textCol'])

patterns = '|'.join([pattern1, pattern2, pattern3])
test = sample_df.withColumn('file', F.regexp_extract('textCol', patterns, 0))

test.show()
# +-------------+----+
# |      textCol|file|
# +-------------+----+
# |file pattern1|   1|
# |file pattern2|    |
# |file pattern3|   3|
# +-------------+----+

The changed version avoids several withColumn calls and nested when. It creates the result column in just one go.

回复收藏 0 原文

~没有更多了~