用于查找电话号码的正则表达式
大家好,
我是 Stackoverflow 的新手,我有一个简单的问题。假设我们有大量的 HTML 文件(理论上无限大)。如何使用正则表达式从所有这些文件中提取电话号码列表?
解释/表达将非常感激。电话号码可以采用以下任意格式:
- (123) 456 7899
- (123).456.7899
- (123)-456-7899
- 123-456-7899
- 123 456 7899
- 1234567899
非常感谢您的所有帮助,祝您一切顺利!
Possible Duplicates:
A comprehensive regex for phone number validation
grep with regex for phone number
Hello Everyone,
I am new to Stackoverflow and I have a quick question. Let's assume we are given a large number of HTML files (large as in theoretically infinite). How can I use Regular Expressions to extract the list of Phone Numbers from all those files?
Explanation/expression will be really appreciated. The Phone numbers can be any of the following formats:
- (123) 456 7899
- (123).456.7899
- (123)-456-7899
- 123-456-7899
- 123 456 7899
- 1234567899
Thanks a lot for all your help and have a good one!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
<代码>/^[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{3})[\.-)( ]*( [0-9]{4})$/
应该完成您想要做的事情,
第一部分
^
表示“行的开头”,这将迫使它考虑。 我在那里的[\.-)( ]*
表示“任何出现 0 次或多次的句点、连字符、括号或空格”。([0 -9]{3})
簇匹配一组 3 个数字(最后一个设置为匹配 4)希望有帮助!
/^[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{4})$/
Should accomplish what you are trying to do.
The first part
^
means the "start of the line" which will force it to account for the whole string.The
[\.-)( ]*
that I have in there mean "any period, hyphen, parenthesis, or space appearing 0 or more times".The
([0-9]{3})
clusters match a group of 3 numbers (the last one is set to match 4)Hope that helps!
在不知道您使用什么语言的情况下,我不确定语法是否正确。
这应该匹配您的所有组,并且误报率非常低:
匹配后您感兴趣的组是组 1、组 3 和组 4。组 2 的存在只是为了确保第一个和第二个分隔符
、
.
或-
是相同的。例如,使用 sed 命令删除字符并以 123456789 的形式保留电话号码:
以下是我的表达式的误报:
将表达式分成两部分,一部分与括号匹配,另一部分不匹配,这将消除除第一个之外的所有误报:
第 1 组、第 3 组、在这种情况下,4 或 5、7 和 8 很重要。
Without knowing what language you're using I am unsure whether or not the syntax is correct.
This should match all of your groups with very few false positives:
The groups you will be interested in after the match are groups 1, 3, and 4. Group 2 exists only to make sure the first and second separator characters
,
.
, or-
are the same.For example a sed command to strip the characters and leave phone numbers in the form 123456789:
Here are the false positives of my expression:
Breaking up the expression into two parts, one that matches with parenthesis and one that does not will eliminate all of these false positives except for the first one:
Groups 1, 3, and 4 or 5, 7, and 8 would matter in this case.
这将帮助您捕获括号中带有区号的代码
其他代码是:
我将第一个和第二个分开,因为将它们放在一起而不回溯可能会让您接受
(123 456 7890
或 < code>123) 456 7890另请注意,在我的终端上使用
grep
时,我必须转义{ }
才能重复。您可能不需要,或者您可能必须转义其他字符,具体取决于您打算使用它的位置。This will help you catch the ones with an area code in parentheses
The others are:
I separated the first one and the second one because putting them together without backtracking could get you into accepting
(123 456 7890
or123) 456 7890
Note also that on my terminal using
grep
, I had to escape the{ }
for the repetition. You may not have to, or you may have to escape other characters depending on where you intend to use this.^(\(?\d{3}\)?)([ .-])(\d{3})([ .-])(\d{4})$
这应该匹配除最后一个模式之外的所有模式。
对于最后一个,您可以使用分隔模式
^\d{10}$
并且有一个错误,它将匹配
(123 456 7899
^(\(?\d{3}\)?)
,如果我们破坏此代码,第一个字符 (^
) 与文本的开头匹配。 ? 和\)?
会接受或不接受这个字符,有一个问题是你必须检查是否有一个起始字符,如果有第二个必须匹配,我不知道是否可以仅使用正则表达式\d{3}
匹配三个数字([ .-])
将匹配其中的任何一个,但只能匹配一个且仅一次。(\d{3})
将匹配三个数字与 2 相同< /p>
$
)由于您想从 HTML 页面中提取内容,因此必须忽略
^< /code> 和
$
来匹配文本的任何部分并设置一个标志global
,在 javascript 中 /exp/g
可以 在此处测试正则表达式
^(\(?\d{3}\)?)([ .-])(\d{3})([ .-])(\d{4})$
This should match all except the last pattern.
For the last one you could use a separated pattern
^\d{10}$
And there is a error, it will match
(123 456 7899
^(\(?\d{3}\)?)
, if we break this code, the first character (^
) matches the beginning of the text.\(?
and\)?
will accept or not this character, there is the problem to do that you have to check if there was an opening char, if there were the second have to match, I don't know if it is possible using Regex only. And\d{3}
will match three numbers([ .-])
will match any of those, but only one and only once.(\d{3})
will match three numbersSame as 2
(\d{4})$
four numbers followed by the end of the text ($
)Since you want to extract from an HTML page you would have to ignore
^
and$
to match any part of the text and set a flagglobal
, in javascript /exp/g
You can test Regex here