从字符串列中提取多种模式之一
我在一个非常大的数据框架中有一个字符串列,需要根据几种模式提取字符串的一部分。在此步骤中,一场比赛就足够了,我不想找到所有匹配的案例。这是一个先前版本的改进请求,该请求使用Regexp_extract
用于一种模式匹配的方法。以下代码正在工作,但考虑到数据的规模不是很有效:
sample_df = spark.createDataFrame(
[
("file pattern1",),
("file pattern2",),
("file pattern3",)
],
['textCol'])
test = (sample_df
.withColumn("p1", F.regexp_extract(F.col('textCol'), pattern1, 1))
.withColumn("p2", F.regexp_extract(F.col('textCol'), pattern2, 1))
.withColumn("p3", F.regexp_extract(F.col('textCol'), pattern3, 1))
.withColumn("file", F.when(F.col("p1")!="", F.col("p1")).otherwise(F.when(F.col("p2")!="", F.col("p2")).otherwise(F.when(F.col("p3")!="", F.col("p3")).otherwise(""))))
)
另一种工作方式是pandas_udf
。我的功能仍然有效,我更喜欢将其保持在火花级别以进行性能考虑。
@F.pandas_udf(returnType="string")
def get_file_dir(lines):
res = []
for l in lines:
for r in reg_list:
found=""
m = re.search(r, l)
if m:
found=m.group(1)
break
res.append(found)
return pd.Series(res)
我正在寻找这里的任何代码优化建议,这些建议可能有助于通过我当前的群集配置来减少运行时。
I have a string column in a very large dataframe and I need to extract parts of the string based on several patterns. At this step, a single match is enough and I'm not looking to find all matching cases. This is an improvement request from a previous version that was using regexp_extract
method for one pattern matching. The following code is working but is not very efficient considering the scale of data:
sample_df = spark.createDataFrame(
[
("file pattern1",),
("file pattern2",),
("file pattern3",)
],
['textCol'])
test = (sample_df
.withColumn("p1", F.regexp_extract(F.col('textCol'), pattern1, 1))
.withColumn("p2", F.regexp_extract(F.col('textCol'), pattern2, 1))
.withColumn("p3", F.regexp_extract(F.col('textCol'), pattern3, 1))
.withColumn("file", F.when(F.col("p1")!="", F.col("p1")).otherwise(F.when(F.col("p2")!="", F.col("p2")).otherwise(F.when(F.col("p3")!="", F.col("p3")).otherwise(""))))
)
Another way to work is pandas_udf
. I have this function that is working, still, I prefer to keep it at Spark level for performance considerations.
@F.pandas_udf(returnType="string")
def get_file_dir(lines):
res = []
for l in lines:
for r in reg_list:
found=""
m = re.search(r, l)
if m:
found=m.group(1)
break
res.append(found)
return pd.Series(res)
I'm looking for any code optimization recommendations here that might help to reduce the runtime with my current cluster configurations.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以将所有模式组合在一起,将所有模式与管道
|
分开:
后面:
更改版本避免使用调用几个
时嵌套。它仅一步就可创建结果列。
You can combine all the patterns together separated with a pipe
|
Before:
After:
The changed version avoids several
withColumn
calls and nestedwhen
. It creates the result column in just one go.