PostgreSQL 函数中嵌套正则表达式替换的替代方案？

发布于 2025-01-21 04:08:05 字数 2141 浏览 5 评论 0原文

目前，我有一个杂乱无章的替换，有条件的弦乐替换和替换为开放文本字段的视图 - 在此示例中，区域分类。

（请忽略地理的准确性，我只是在处理历史标准作业。此外，我知道我可以替换速度，甚至可以清洁正则票据语句以进行回顾 - 我只是在问有关变量的问题/嵌套在这里。）

    CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
     RETURNS text
     LANGUAGE sql
     STRICT
    AS $function$
    SELECT  REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
            ,'Rest\sof\sthe\sWorld\s\-\s','')
            ,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
            ,'\&amp\;','&')
            ,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
            ,'EUROPE - Turkey','MIDDLE EAST - Turkey')

使用此功能的示例输出在我的数据集中看起来像这样，拉出了受影响的记录（有些已经以正确的格式）：

record_region_input	record_red_region_egion_output
north America-美国 - 美国 - 美国 - 东北 - 马萨诸塞州 - 马萨诸塞州 - 美国波士顿大都会	- 东北 - 马萨诸塞州 - 波士顿地铁
北美 - 美国 - 美国 - 美国 - 中大西洋 -	美国 - 美国 - 中大西洋 -
世界其他地区 - 亚洲 - 亚洲 - 泰国	亚洲 - 泰国 - 泰国
其他地区 - 欧洲 - 欧洲 - 葡萄牙 - 欧洲 -	欧洲 - 葡萄牙
剩下的葡萄牙世界 - 亚洲 - 中国 - 上海大都会	- 中国 - 佐治亚州
上海 - 佐治亚州	中东 - 佐治亚州

这是...很好。需要正则是正则是因为在这些字符串之前或之后可能发生的情况有很多可变性，并且我在其他地方有适当的验证列表。这只是常见的历史命名问题的大量灌木丛。

问题在于，对于公司命名或跨部门标准之类的东西，我获得了数百种此类“已知替代”（100+）。拥有数十个regexp_replace（嵌套语句使编辑/添加/丢弃任何东西都是令人讨厌的计数游戏。

我正在尝试专门清洁Postgres中的数据，因为我当前的管道并不总是允许标准化。

上传之前的 /使用示例函数的输出表。

原文

Right now, I have a view with a mess of common, conditional string replacement and substitutions for an open text field - in this example, regional classification.

(Please ignore the accuracy of geography, I'm just working with historical standard assignments. Also, I know I could speed things up with REPLACE or even just cleaning the RegEx statements for lookback - I'm just asking about the variable/nesting here.)

    CREATE OR REPLACE FUNCTION public.region_cleanup(record_region text)
     RETURNS text
     LANGUAGE sql
     STRICT
    AS $function$
    SELECT  REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(
            REGEXP_REPLACE(record_region,'(NORTH AMERICA\s\-\sUSA\s\-\sUSA)','USA')
            ,'Rest\sof\sthe\sWorld\s\-\s','')
            ,'NORTH\sAMERICA\s\-\sCANADA','NORTH AMERICA - Canada')
            ,'\&\;','&')
            ,'Georgia\s\-\sGeorgia','MIDDLE EAST - Georgia')
            ,'EUROPE - Turkey','MIDDLE EAST - Turkey')

A sample output using this function would look like this in my dataset, pulling out records impacted (some are already in the correct format):

record_region_input	record_region_output
NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro	USA - NORTHEAST - Massachusetts - Boston Metro
NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia	USA - MIDATLANTIC - Virginia
Rest of the World - ASIA - Thailand	ASIA - Thailand
Rest of the World - EUROPE - Portugal	EUROPE - Portugal
Rest of the World - ASIA - China - Shanghai Metro	ASIA - China - Shanghai Metro
Georgia - Georgia	MIDDLE EAST - Georgia

This is... fine. Regex is needed since there's tons of variability on what may come before or after these strings, and I have a proper validation list elsewhere. This is just a bulk scrub of common historical naming issues.

The problem is where I get hundreds of these kind of "known substitutions" (100+) for things like company naming or cross-department standards. Having dozens and dozens of REGEXP_REPLACE( nested statements makes editing/adding/dropping anything a maddening game of counting.

I'm trying to clean data within Postgres exclusively, since my current pipeline doesn't always allow for standardization prior to upload. I know how I'd tackle this cleanly outside of pure SQL, but in a 'vanilla' PostgreSQL instance (v12+) is there a better method for transforming strings for a view?

Updated with a sample input/output table using the example function.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

讽刺将军 2025-01-28 04:08:05

如果您将一串数据拆分为其他区域，那么替换区域对您来说可能会很容易。例如：

with tb as (
    select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
    union all 
    select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
    union all 
    select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro' 
)
select * from (
    select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb 
    order by tb.id 
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World');

-- Result: 
1   Boston Metro
1   Massachusetts
1   NORTHEAST
1   USA
2   MIDATLANTIC
2   USA
2   Virginia
3   ASIA
3   China
3   Shanghai Metro

之后，例如，对于重复区域，您可以使用不同的区域，对于不必要的区域，您可以使用 NOT in，并且您可以使用 like '%ASIA%' 来获取包含 ASIA 的所有区域等等。完成所有处理后，您可以再次合并更正后的字符串。示例：

with tb as (
    select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
    union all 
    select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
    union all 
    select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro' 
)
select a1.id, string_agg(a1.region, ' - ')  from (
    select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb 
    order by tb.id 
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World')
group by a1.id 

-- Return: 
1   Boston Metro - Massachusetts - NORTHEAST - USA
2   MIDATLANTIC - USA - Virginia
3   ASIA - China - Shanghai Metro

这是一个简单的想法，也许这个想法可以帮助你替换区域。

If when you will split a string of data into additional regions then maybe replacing regions will be easy for you. For example:

with tb as (
    select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
    union all 
    select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
    union all 
    select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro' 
)
select * from (
    select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb 
    order by tb.id 
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World');

-- Result: 
1   Boston Metro
1   Massachusetts
1   NORTHEAST
1   USA
2   MIDATLANTIC
2   USA
2   Virginia
3   ASIA
3   China
3   Shanghai Metro

After then, for example, for duplicating regions you can use distinct, for unnecessary regions you can use NOT in, and you can use like '%ASIA%' to get all regions which contain ASIA and etc. After all processes, you can merge the corrected string again. Example:

with tb as (
    select 1 as id, 'NORTH AMERICA - USA - USA - NORTHEAST - Massachusetts - Boston Metro' as record_region_input
    union all 
    select 2 as id, 'NORTH AMERICA - USA - USA - MIDATLANTIC - Virginia'
    union all 
    select 3 as id, 'Rest of the World - ASIA - China - Shanghai Metro' 
)
select a1.id, string_agg(a1.region, ' - ')  from (
    select distinct tb.id, unnest(string_to_array(record_region_input, ' - ')) as region from tb 
    order by tb.id 
) a1 where a1.region not in ('NORTH AMERICA', 'Rest of the World')
group by a1.id 

-- Return: 
1   Boston Metro - Massachusetts - NORTHEAST - USA
2   MIDATLANTIC - USA - Virginia
3   ASIA - China - Shanghai Metro

This is a simple idea, maybe this idea helps you to replace regions.

回复收藏 0 原文

~没有更多了~