在python中处理大文本文件
我有一个非常大的文件(3.8G),它是我学校系统中的用户摘录。我需要重新处理该文件,以便它只包含他们的 ID 和电子邮件地址,以逗号分隔。
我对此缺乏经验,想用它作为 Python 的学习练习。
该文件的条目如下所示:
dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: [email protected]
dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: [email protected]
我正在尝试获取一个如下所示的文件:
0099886,[email protected]
0083156,[email protected]
有任何提示或代码吗?
I have a very large file (3.8G) that is an extract of users from a system at my school. I need to reprocess that file so that it just contains their ID and email address, comma separated.
I have very little experience with this and would like to use it as a learning exercise for Python.
The file has entries that look like this:
dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: [email protected]
dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: [email protected]
I am trying to get a file that looks like:
0099886,[email protected]
0083156,[email protected]
Any tips or code?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对我来说,这实际上看起来像一个 LDIF 文件。 python-ldap 库有一个纯 Python LDIF 处理库,如果您的文件在 LDIF 中可能存在一些令人讨厌的问题,例如 Base64 编码值、条目折叠等,那么这可能会有所帮助。
您可以像这样使用它:
编辑
因此,要从实时 LDAP 目录中提取,使用在 python-ldap 库中,您可能想做这样的事情:
可能值得阅读< a href="http://www.python-ldap.org/doc/html/ldap.html" rel="nofollow noreferrer">ldap 模块的文档,尤其是 示例。
请注意,在上面的示例中,我完全跳过了提供过滤器,您可能希望在生产中执行此操作。 LDAP 中的过滤器类似于 SQL 语句中的
WHERE
子句;它限制返回的对象。 微软实际上有一个关于 LDAP 过滤器的很好的指南< /a>. LDAP 过滤器的规范参考是 RFC 4515。同样,如果即使应用适当的过滤器后也可能有数千个条目,您可能需要查看 LDAP 分页控件,尽管使用它会再次使示例更加复杂。希望这足以让您开始,但如果出现任何问题,请随时提出或提出新问题。
祝你好运。
That actually looks like an LDIF file to me. The python-ldap library has a pure-Python LDIF handling library that could help if your file possesses some of the nasty gotchas possible in LDIF, e.g. Base64-encoded values, entry folding, etc.
You could use it like so:
Edit
So to extract from a live LDAP directory, using the python-ldap library you would want to do something like this:
It's probably worthwhile reading through the documentation for the ldap module, especially the example.
Note that in the example above, I completely skipped supplying a filter, which you would probably want to do in production. A filter in LDAP is similar to the
WHERE
clause in a SQL statement; it restricts what objects are returned. Microsoft actually has a good guide on LDAP filters. The canonical reference for LDAP filters is RFC 4515.Similarly, if there are potentially several thousand entries even after applying an appropriate filter, you may need to look into the LDAP paging control, though using that would, again, make the example more complex. Hopefully that's enough to get you started, but if anything comes up, feel free to ask or open a new question.
Good luck.
假设每个条目的结构始终相同,只需执行以下操作:
Assuming that the structure of each entry will always be the same, just do something like this:
再次假设您的文件格式良好:
本质上与其他方法相同。
Again assuming your file is well-formed:
Essentially equivalent to the other methods.
要打开文件,您需要使用类似
with
关键字的内容,以确保即使出现问题也能正确关闭:至于实际解析该信息,我建议构建一个 ID 字典电子邮件对。您还需要一个用于 uid 和电子邮件的变量。
要实际解析文件(文件打开时运行的内容),您可以执行以下操作:
使用 CSV 编写器(记住在文件开头导入 csv),我们可以这样输出:
另一个选项是打开作者在文件之前写入标题,然后在写入 CSV 的同时从文件中读取行。这可以避免将信息转储到内存中,这可能是非常理想的。所以把它们放在一起我们得到
To open the file you'll want to use something like the
with
keyword to ensure it closes properly even if something goes wrong:As for actually parsing out that information, I'd recommend building a dictionary of ID email pairs. You'll also need a variable for the uid and the email.
To actually parse through the file (the stuff run while your file is open) you can do something like this:
Using the CSV writer (remember to import csv at the beginning of the file) we can output like this:
Another option is to open the writer before the file, write the header, then read the lines from the file at the same time as writing to the CSV. This avoids dumping the information into memory, which might be highly desirable. So putting it all together we get