Python - 将许多小txt文件分割成一个数据结构以输入mysql
我有几十万个 txt 文件,它们都是非常标准的形状(它们都有共同的元素 - ID、日期、收件人、发件人、主题、正文。
这些不是结构化格式,例如多部分电子邮件消息。
我想剥离它们有很多,所以我想确保该方法可行。
我正在考虑一些关键问题(而且我不是编码员)。是学习/爱好的东西)。
1)是否有一种结构数据类型可以转换以以合理的方式将这些位保持在一起。我认为有一个 file.ID、file.Date 等 tryp 交易以结构化方式保存整个文件是合乎逻辑的,这样以后就可以将其摄取到数据库中。这是蟒蛇吗?或者是我修改 Matlab 后的宿醉?
2) 正文部分可以有几 kb 大或单个句子。 (1) 作为 blob 更好 - 我会失去搜索 - 这样做的目的.. 以及 (2) 如何确保我可以在摄取时间时在 MYSQL 数据库中构造一个足够大的字段?我不知道每个元素的最长大小是多少,除非我在消息拆分器中运行某种计数器来处理每条消息看到的最大值
3)我想我会从散步开始,获取文件列表从步行中,然后逐行提取每个文件。我将使用行位置来推断一些已知位置(ID、日期),然后根据特征使用一些正则表达式或模式来分割其余部分。一旦我分割了文件,我就计划摄取它们。但是,我想知道在每条消息末尾连接到数据库并将各个部分一一转储到自己的记录中是否更合乎逻辑。
时间并不戏剧化,对于所有重要的事情来说,它可以持续一周。我的 i7 上有大约 8GB 的 RAM,所以我又不特别需要资源,并且很高兴让它慢慢通过。
这听起来符合逻辑吗?我错过了核心步骤吗?
谢谢。
I have a few hundred thousand txt files that are a pretty standard shape (they all have common elements - ID, Date, To, From, Subject, Body.
These are not in a structured format e.g. a multipart email message.
I want to strip these into their constitute parts, and feed the whole lot into a db. There are lots of them, so I wanted to make sure the approach would work.
There are a number of key issues I am pondering (and I am no coder - this is learning / hobby stuff).
1) is there a structure data type I can cast to keep the bits together in a sensible way. I was thinking it would be logical to have a file.ID, file.Date, etc tryp deal that holds the whole file in a structured way so it can be later ingested into the db. Is this pythony? or a hangover from my tinkering with Matlab?
2) the body section can be several kb large or a single sentance. (1) is this better as a blob - I would loose the searching - kind of the point of doing this.. and (2) How do make sure I can construct a field large enough in my MYSQL database come ingest time? I won't know what the longest size is of each of the element unless I run some kind of counter in the message splitter that handles the max value seen per message
3) I figure that I would start with a walk, get the file list from the walk, then pull each file, line by line. I'll use line position to infer some known locations (ID, Date) and then some RegEx or patterns based on features to split the rest. Once I have split the files up, I plan to ingest them. However, I wonder if it would be more logical to connect to the db at the end of each message and dump the parts into its own record one by one.
Time is no drama, it can run for a week for all that matters. I have about 8gb of RAM on an i7, so again I'm not resource hungry specifically, and happy to let it grind its way through.
Does this sound logical? Have I missed a core step?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
广告 1)
我认为存储这种结构化数据的最“Pythonic”方式是使用
dict
。另一种解决方案是声明一个类,但由于您不打算进行进一步的处理(即您的数据类型不需要任何方法),因此您应该坚持使用最简单的解决方案(海事组织)。只需使用
将每个文件的数据存储在
data
中即可。广告 2)
在 python 方面,您可以只使用字符串(即
str
或unicode
,如果您使用 <3.0)。 python 中的字符串没有大小限制(除了架构限制,但在 64 位机器上,这并不是真正的问题......)。在 MySQL 端,我将使用
TEXT
作为正文部分的数据类型。您还可以使用VARCHAR
,但您需要给出最大长度。Ad 3)
我建议独立处理每个文件,即解析它并随后立即将其写入数据库。我认为没有理由不这样做。无需用所有数据填充内存(或者在读取最后一个文件而没有将任何内容写入数据库之前冒着崩溃的风险)。我可能会使用某种机制来标记已处理的文件(将它们移动到另一棵树,重命名它们):如果我由于某种原因需要重新启动程序,这将阻止处理同一文件两次。
Ad 1)
I think the most 'pythonic' way to store this structured data would be to use a
dict
. The other solution would be to declare aclass
, but since you don't plan to do further processing (i.e. you wouldn't need any methods for your datatype), you should stick with the simplest possible solution (imo).Just use
To store the data from each file in the
data
.Ad 2)
On the python side you can just use strings (i.e.
str
orunicode
, if you're on < 3.0). Strings in python have no size limit (besides your architecture limit, but on a 64-bit machine, that's not really a problem ...).On the MySQL-Side, I would use
TEXT
as datatype for the body section. You could also use aVARCHAR
, but you would need to give a maxmimal length.Ad 3)
I would recommend to process each file independently, i.e parse it and write it to the db immediately afterwards. Imo there is no reason not to do so. There's no need to fill memory with all the data (or risk a crash just before the last file is read without anything being written to the db). I would probably use some mechanism to mark processed files (move them to another tree, rename them): if I need to restart the program for some reason, this would prevent to process the same file twice.