Django 或 python 操作电子邮件地址并推断域

发布于 2024-12-09 01:42:31 字数 943 浏览 1 评论 0原文

我希望能够解析电子邮件地址以隔离域部分,并测试电子邮件地址是否属于给定域的一部分。

据我所知,email 模块并没有这样做。除了通常的字符串处理和正则表达式例程之外,还有什么值得使用来执行此操作吗?

注意:我知道如何处理 python 字符串。我不需要基本的食谱,但欢迎很棒的食谱。

这里的问题本质上是电子邮件地址的格式(示意性地)userpart@sub\.domain\.[sld]+\.tld

剥离@之前的部分很容易;困难的部分是解析域,以确定哪些部分是较大组织域中的子域,而不是通用的二级(或者,我猜甚至更高阶)公共域。

想象一下解析 [电子邮件受保护]查找该组织的域名为 organization.co.uk,因此能够匹配 mail.organise.co.ukfinance.organization.co .uk 作为 organization.co.uk 的子域。

基本上有两种可能的(非基于 DNS 的)方法:构建一个有限自动机,该自动机了解所有通用 sld 及其与 tld 的关系(包括流行的“假”sld,如 uk.com),或者尝试猜测,基于必须存在 tld 的知识,并假设如果有三个(或更多)元素,则如果二级域名少于三/四个字符,则二级域名是通用的。每种方法的相对缺点应该是显而易见的。

另一种方法是查看 DNS 条目来确定什么是注册域,但这也有其自身的缺点。

无论如何,我宁愿依赖别人的工作。

I want to be able to parse email addresses to isolate the domain part, and test if an email address is part of a given domain.

The email module doesn't, as far as I can tell, do that. Is there anything worth using to do this other than the usual string handling and regex routines?

Note: I know how to deal with python strings. I don't need basic recipes, although awesome recipes are welcome.

The problem here is essentially that email addresses have the format (schematically) userpart@sub\.domain\.[sld]+\.tld.

Stripping the part before the @ is easy; the hard part is parsing the domain to work out which parts are subdomains on a larger organisation's domain, rather than generic second-level (or, I guess even higher order) public domains.

Imagine parsing [email protected] to find that the organisation's domain name is organisation.co.uk and so be able to match both mail.organisation.co.uk and finance.organisation.co.uk as subdomains of organisation.co.uk.

There are basically two possible (non-dns-based) approaches: build a finite automaton that knows about all generic slds and their relation to the tld (including popular 'fake' slds like uk.com), or try to guess, based on the knowledge that there must be a tld, and assuming that if there are three (or more) elements, the second-level domain is generic if it has fewer than three/four characters. The relative drawbacks of each approach should be obvious.

The alternative is to look through DNS entries to work out what is a registered domain, which has its own drawbacks.

In any case, I would rather piggyback on the work of others.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧竹 2024-12-16 01:42:31

根据 @dm03514 的评论,有一个 python 库可以做到这一点: tldextract

>>> import tldextract
>>> tldextract.extract('[email protected]')
ExtractResult(subdomain='bar', domain='baz', tld='org.uk')

As per @dm03514's comment, there is a python library that does exactly this: tldextract:

>>> import tldextract
>>> tldextract.extract('[email protected]')
ExtractResult(subdomain='bar', domain='baz', tld='org.uk')
我不在是我 2024-12-16 01:42:31

通过这个简单的脚本,我们将 @ 替换为 @.,以便我们的域终止,并且 endswith 不会匹配以相同的文字。

def address_in_domain(address, domain):
    return address.replace('@', '@.').endswith('.' + domain)

if __name__ == '__main__':
    addresses = [
        '[email protected]',
        '[email protected]',
        '[email protected]',
    ]
    print filter(lambda address: address_in_domain(address, 'domain.com'), addresses)
    # Prints: ['[email protected]', '[email protected]']

With this simple script, we replace @ with @. so that our domain is terminated and the endswith won't match a domain ending with the same text.

def address_in_domain(address, domain):
    return address.replace('@', '@.').endswith('.' + domain)

if __name__ == '__main__':
    addresses = [
        '[email protected]',
        '[email protected]',
        '[email protected]',
    ]
    print filter(lambda address: address_in_domain(address, 'domain.com'), addresses)
    # Prints: ['[email protected]', '[email protected]']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文