在python中处理大文本文件

发布于 2024-11-26 13:11:17 字数 984 浏览 0 评论 0原文

我有一个非常大的文件（3.8G），它是我学校系统中的用户摘录。我需要重新处理该文件，以便它只包含他们的 ID 和电子邮件地址，以逗号分隔。

我对此缺乏经验，想用它作为 Python 的学习练习。

该文件的条目如下所示：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: [email protected]

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: [email protected]

我正在尝试获取一个如下所示的文件：

0099886,[email protected]
0083156,[email protected]

有任何提示或代码吗？

原文

I have a very large file (3.8G) that is an extract of users from a system at my school. I need to reprocess that file so that it just contains their ID and email address, comma separated.

I have very little experience with this and would like to use it as a learning exercise for Python.

The file has entries that look like this:

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: [email protected]

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: [email protected]

I am trying to get a file that looks like:

0099886,[email protected]
0083156,[email protected]

Any tips or code?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风铃鹿 2024-12-03 13:11:17

对我来说，这实际上看起来像一个 LDIF 文件。 python-ldap 库有一个纯 Python LDIF 处理库，如果您的文件在 LDIF 中可能存在一些令人讨厌的问题，例如 Base64 编码值、条目折叠等，那么这可能会有所帮助。

您可以像这样使用它：

import csv
import ldif

class ParseRecords(ldif.LDIFParser):
   def __init__(self, csv_writer):
       self.csv_writer = csv_writer
   def handle(self, dn, entry):
       self.csv_writer.writerow([entry['LoginId'], entry['mail']])

with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
    csv_writer = csv.writer(output)
    csv_writer.writerow(['LoginId', 'Mail'])
    ParseRecords(input, csv_writer).parse()

编辑

因此，要从实时 LDAP 目录中提取，使用在 python-ldap 库中，您可能想做这样的事情：

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

可能值得阅读< a href="http://www.python-ldap.org/doc/html/ldap.html" rel="nofollow noreferrer">ldap 模块的文档，尤其是示例。

请注意，在上面的示例中，我完全跳过了提供过滤器，您可能希望在生产中执行此操作。 LDAP 中的过滤器类似于 SQL 语句中的 WHERE 子句；它限制返回的对象。微软实际上有一个关于 LDAP 过滤器的很好的指南< /a>. LDAP 过滤器的规范参考是 RFC 4515。

同样，如果即使应用适当的过滤器后也可能有数千个条目，您可能需要查看 LDAP 分页控件，尽管使用它会再次使示例更加复杂。希望这足以让您开始，但如果出现任何问题，请随时提出或提出新问题。

祝你好运。

That actually looks like an LDIF file to me. The python-ldap library has a pure-Python LDIF handling library that could help if your file possesses some of the nasty gotchas possible in LDIF, e.g. Base64-encoded values, entry folding, etc.

You could use it like so:

import csv
import ldif

class ParseRecords(ldif.LDIFParser):
   def __init__(self, csv_writer):
       self.csv_writer = csv_writer
   def handle(self, dn, entry):
       self.csv_writer.writerow([entry['LoginId'], entry['mail']])

with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
    csv_writer = csv.writer(output)
    csv_writer.writerow(['LoginId', 'Mail'])
    ParseRecords(input, csv_writer).parse()

Edit

So to extract from a live LDAP directory, using the python-ldap library you would want to do something like this:

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

It's probably worthwhile reading through the documentation for the ldap module, especially the example.

Note that in the example above, I completely skipped supplying a filter, which you would probably want to do in production. A filter in LDAP is similar to the WHERE clause in a SQL statement; it restricts what objects are returned. Microsoft actually has a good guide on LDAP filters. The canonical reference for LDAP filters is RFC 4515.

Similarly, if there are potentially several thousand entries even after applying an appropriate filter, you may need to look into the LDAP paging control, though using that would, again, make the example more complex. Hopefully that's enough to get you started, but if anything comes up, feel free to ask or open a new question.

Good luck.

回复收藏 0 原文

冷清清 2024-12-03 13:11:17

假设每个条目的结构始终相同，只需执行以下操作：

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

Assuming that the structure of each entry will always be the same, just do something like this:

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

回复收藏 0 原文

天涯离梦残月幽梦 2024-12-03 13:11:17

再次假设您的文件格式良好：

with open(inputfilename) as inputfile, with open(outputfilename) as outputfile:
    mail = loginid = ''
    for line in inputfile:
        line = inputfile.split(':')
        if line[0] not in ('LoginId', 'mail'):
            continue
        if line[0] == 'LoginId':
            loginid = line[1].strip()
        if line[0] == 'mail':
            mail = line[1].strip()
        if mail and loginid:
            output.write(loginid + ',' + mail + '\n')
            mail = loginid = ''

本质上与其他方法相同。

Again assuming your file is well-formed:

with open(inputfilename) as inputfile, with open(outputfilename) as outputfile:
    mail = loginid = ''
    for line in inputfile:
        line = inputfile.split(':')
        if line[0] not in ('LoginId', 'mail'):
            continue
        if line[0] == 'LoginId':
            loginid = line[1].strip()
        if line[0] == 'mail':
            mail = line[1].strip()
        if mail and loginid:
            output.write(loginid + ',' + mail + '\n')
            mail = loginid = ''

Essentially equivalent to the other methods.

回复收藏 0 原文

瘫痪情歌 2024-12-03 13:11:17

要打开文件，您需要使用类似 with 关键字的内容，以确保即使出现问题也能正确关闭：

with open(<your_file>, "r") as f:
   # Do stuff

至于实际解析该信息，我建议构建一个 ID 字典电子邮件对。您还需要一个用于 uid 和电子邮件的变量。

data = {}
uid = 0
email = ""

要实际解析文件（文件打开时运行的内容），您可以执行以下操作：

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

使用 CSV 编写器（记住在文件开头导入 csv），我们可以这样输出：

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

另一个选项是打开作者在文件之前写入标题，然后在写入 CSV 的同时从文件中读取行。这可以避免将信息转储到内存中，这可能是非常理想的。所以把它们放在一起我们得到

writer = csv.writer(<filename>)
writer.writerow("User, Email")
with open(<your_file>, "r") as f:
    for line in f:
        if "uid=" in line:
            # Parse the user id out by grabbing the substring between the first = and ,
            uid = line[line.find("=")+1:line.find(",")]
        elif "mail:" in line:
            # Parse the email out by grabbing everything from the : to the end (removing the newline character)
            email = line[line.find(": ")+2:-1]
            # Given the formatting you've provided, this comes second so we can make an entry into the dict here
            writer.writerow(iid + "," + email)

To open the file you'll want to use something like the with keyword to ensure it closes properly even if something goes wrong:

with open(<your_file>, "r") as f:
   # Do stuff

As for actually parsing out that information, I'd recommend building a dictionary of ID email pairs. You'll also need a variable for the uid and the email.

data = {}
uid = 0
email = ""

To actually parse through the file (the stuff run while your file is open) you can do something like this:

for line in f:
    if "uid=" in line:
        # Parse the user id out by grabbing the substring between the first = and ,
        uid = line[line.find("=")+1:line.find(",")]
    elif "mail:" in line:
        # Parse the email out by grabbing everything from the : to the end (removing the newline character)
        email = line[line.find(": ")+2:-1]
        # Given the formatting you've provided, this comes second so we can make an entry into the dict here
        data[uid] = email

Using the CSV writer (remember to import csv at the beginning of the file) we can output like this:

writer = csv.writer(<filename>)
writer.writerow("User, Email")
for id, mail in data.iteritems:
    writer.writerow(id + "," + mail)

Another option is to open the writer before the file, write the header, then read the lines from the file at the same time as writing to the CSV. This avoids dumping the information into memory, which might be highly desirable. So putting it all together we get

writer = csv.writer(<filename>)
writer.writerow("User, Email")
with open(<your_file>, "r") as f:
    for line in f:
        if "uid=" in line:
            # Parse the user id out by grabbing the substring between the first = and ,
            uid = line[line.find("=")+1:line.find(",")]
        elif "mail:" in line:
            # Parse the email out by grabbing everything from the : to the end (removing the newline character)
            email = line[line.find(": ")+2:-1]
            # Given the formatting you've provided, this comes second so we can make an entry into the dict here
            writer.writerow(iid + "," + email)

回复收藏 0 原文

~没有更多了~