Django 或 python 操作电子邮件地址并推断域
我希望能够解析电子邮件地址以隔离域部分,并测试电子邮件地址是否属于给定域的一部分。
据我所知,email
模块并没有这样做。除了通常的字符串处理和正则表达式例程之外,还有什么值得使用来执行此操作吗?
注意:我知道如何处理 python 字符串。我不需要基本的食谱,但欢迎很棒的食谱。
这里的问题本质上是电子邮件地址的格式(示意性地)userpart@sub\.domain\.[sld]+\.tld
。
剥离@之前的部分很容易;困难的部分是解析域,以确定哪些部分是较大组织域中的子域,而不是通用的二级(或者,我猜甚至更高阶)公共域。
想象一下解析 [电子邮件受保护]
查找该组织的域名为 organization.co.uk
,因此能够匹配 mail.organise.co.uk
和 finance.organization.co .uk
作为 organization.co.uk
的子域。
基本上有两种可能的(非基于 DNS 的)方法:构建一个有限自动机,该自动机了解所有通用 sld 及其与 tld 的关系(包括流行的“假”sld,如 uk.com
),或者尝试猜测,基于必须存在 tld 的知识,并假设如果有三个(或更多)元素,则如果二级域名少于三/四个字符,则二级域名是通用的。每种方法的相对缺点应该是显而易见的。
另一种方法是查看 DNS 条目来确定什么是注册域,但这也有其自身的缺点。
无论如何,我宁愿依赖别人的工作。
I want to be able to parse email addresses to isolate the domain part, and test if an email address is part of a given domain.
The email
module doesn't, as far as I can tell, do that. Is there anything worth using to do this other than the usual string handling and regex routines?
Note: I know how to deal with python strings. I don't need basic recipes, although awesome recipes are welcome.
The problem here is essentially that email addresses have the format (schematically) userpart@sub\.domain\.[sld]+\.tld
.
Stripping the part before the @ is easy; the hard part is parsing the domain to work out which parts are subdomains on a larger organisation's domain, rather than generic second-level (or, I guess even higher order) public domains.
Imagine parsing [email protected]
to find that the organisation's domain name is organisation.co.uk
and so be able to match both mail.organisation.co.uk
and finance.organisation.co.uk
as subdomains of organisation.co.uk
.
There are basically two possible (non-dns-based) approaches: build a finite automaton that knows about all generic slds and their relation to the tld (including popular 'fake' slds like uk.com
), or try to guess, based on the knowledge that there must be a tld, and assuming that if there are three (or more) elements, the second-level domain is generic if it has fewer than three/four characters. The relative drawbacks of each approach should be obvious.
The alternative is to look through DNS entries to work out what is a registered domain, which has its own drawbacks.
In any case, I would rather piggyback on the work of others.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据 @dm03514 的评论,有一个 python 库可以做到这一点: tldextract:
As per @dm03514's comment, there is a python library that does exactly this: tldextract:
通过这个简单的脚本,我们将
@
替换为@.
,以便我们的域终止,并且endswith
不会匹配以相同的文字。With this simple script, we replace
@
with@.
so that our domain is terminated and theendswith
won't match a domain ending with the same text.