使用正则表达式提取部分文本

发布于 2024-07-27 01:04:46 字数 415 浏览 7 评论 0原文

我想使用正则表达式提取文本的一部分。 例如,我有一个地址,只想返回号码和街道并排除其余部分:

2222 Main at King Edward Vancouver BC CA

但地址的格式大多数时候都不同。 我尝试使用 Lookbehind Regex 并得到了这个表达式:

.*?(?=\w* \w* \w{2}$)

上面的表达式很好地处理了上面的示例,但是一旦逗号进入文本,邮政编码可以是 6 个字符的字符串或两个 3 个字符的字符串,它就会变得太混乱中间有一个空格等等...

除了lookbehind 正则表达式之外,还有其他更优雅的方式来提取文本的一部分吗?

任何建议或另一个方向的观点都将不胜感激。

谢谢!

I would like to extract portion of a text using a regular expression. So for example, I have an address and want to return just the number and streets and exclude the rest:

2222 Main at King Edward Vancouver BC CA

But the addresses varies in format most of the time. I tried using Lookbehind Regex and came out with this expression:

.*?(?=\w* \w* \w{2}$)

The above expressions handles the above example nicely but then it gets way too messy as soon as commas come into the text, postal codes which can be a 6 character string or two 3 character strings with a space in the middle, etc...

Is there any more elegant way of extracting a portion of text other than a lookbehind regex?

Any suggestion or a point in another direction is greatly appreciated.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

明月夜 2024-08-03 01:04:46

正则表达式适用于遵循某种模式的规则数据。 因此,如果您的数据完全随机,则不,没有优雅的方法可以使用正则表达式来执行此操作。

另一方面,如果您知道想要什么值,您可能可以编写一些简单的正则表达式,然后在每个字符串上测试它们。

前任。
regex1= 地址# 抓取器,regex2 = 街道类型抓取器,regex3 = 名称抓取器。

尝试将 string1 与 regex1、regex2、最后是 regex3 进行匹配。 转到下一个字符串。

Regular expressions are for data that is REGULAR, that follows a pattern. So if your data is completely random, no, there's no elegant way to do this with regex.

On the other hand, if you know what values you want, you can probably write a few simple regexes, and then just test them all on each string.

Ex.
regex1= address # grabber, regex2 = street type grabber, regex3 = name grabber.

Attempt a match on string1 with regex1, regex2, and finally regex3. Move on to the next string.

烂人 2024-08-03 01:04:46

好吧,我想我会把我的帽子扔进戒指:

.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s) ?)

并且您可能需要将 ^\d+ 放在前面以达到良好的效果
我没有费心指定邮政编码的长度......只是其中任意数量的字符连字符。

到目前为止,它适用于这些输入以及城市/州/国家区域内昏迷的变化:

  • 2222 Main at King Edward 温哥华, BC, CA, 333-333
  • 555 道路和街道地点 CA US 95000
  • 2222 Main at King Edward Vancouver BC CA 333
  • 美国加利福尼亚州 555 号道路和街道

计数末尾有三个单词城市、州和国家,但除此之外,就像 ryansstack 所说的那样,如果它是随机的,那就行不通。 如果这个城市是像纽约这样的两个词,那就不行了。 是的...正则表达式不是这个工具。

顺便说一句:在 regexhero.net 上测试

well i thot i'd throw my hat into the ring:

.*(?=,? ([a-zA-Z]+,?\s){3}([\d-]*\s)?)

and you might want ^ or \d+ at the front for good measure
and i didn't bother specifying lengths for the postal codes... just any amount of characters hyphens in this one.

it works for these inputs so far and variations on comas within the City/state/country area:

  • 2222 Main at King Edward Vancouver, BC, CA, 333-333
  • 555 road and street place CA US 95000
  • 2222 Main at King Edward Vancouver BC CA 333
  • 555 road and street place CA US

it is counting at there being three words at the end for the city, state and country but other than that it's like ryansstack said, if it's random it won't work. if the city is two words like New York it won't work. yeah... regex isn't the tool for this one.

btw: tested on regexhero.net

柠檬 2024-08-03 01:04:46

我可以想到两种方法可以做到这一点

1)如果您知道地址之后的数据的“其余”恰好是2个字段,即BC和CA,您可以使用空格作为分隔符对字符串进行拆分,删​​除最后 2 项。

2)对分隔符/[AZ][AZ]/进行分割并将结果存储在数组中。 然后打印出数组(前提是地址不包含2个或更多大写字母)

i can think of 2 ways you can do this

1) if you know that "the rest" of your data after the address is exactly 2 fields, ie BC and CA, you can do split on your string using space as delimiter, remove the last 2 items.

2) do a split on delimiter /[A-Z][A-Z]/ and store the result in array. then print out the array ( this is provided that the address doesn't contain 2 or more capital letters)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文