Python 上的不规则字符串解析
我是 python/django 的新手,我正在尝试从我的抓取工具中找出更有效的信息。目前,抓取工具获取漫画书标题列表,并将其正确分为三部分(发布日期、原始日期和标题)的 CSV 列表。然后,我将当前日期和标题传递到数据库的不同部分,这是我在加载器脚本中执行的操作(将 mm/dd/yy 转换为 yyyy-mm-dd,保存到“pub_date”列,标题转到“title”柱子)。
常见的字符串可能如下所示:
10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)
我成功获取了日期,但标题比较棘手。在这种情况下,我理想地希望在第二个“|”之后填充三个不同的列。标题应该转到“title”,即一个charfield。数字 12(在“#”之后)应该进入 DecimalField“issue_num”,而“()”之间的所有内容都应该进入“Special”字符域。我不知道如何进行这种严格的解析。
有时,有多个 #(特别是一部漫画被描述为捆绑包,“包含问题 #90-#95”),并且有几个漫画有多个“()”组(例如“背叛猩球崛起” #1(共 4 个)(25 份激励封面) )
开始解决这个问题的好方法是什么?对于更复杂的行,我对 If/else 语句的了解很快就崩溃了。我怎样才能有效地(如果可能的话)Python式地解析这些行并细分它们,以便以后可以将它们插入数据库中的正确位置?
I'm new to python/django and I am trying to suss out more effective information from my scraper. Currently, the scraper takes a list of comic book titles and correctly divides them into a CSV list in three parts (Published Date, Original Date, and Title). I then pass the current date and title through to different parts of my databse, which I do in my Loader script (convert mm/dd/yy into yyyy-mm-dd, save to "pub_date" column, title goes to "title" column).
A common string can look like this:
10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)
I am successfully grabbing the date, but the title is trickier. In this instance, I'd ideally like to fill three different columns with the information after the second "|". The Title should go to "title", a charfield. the number 12 (after the '#') should go into the DecimalField "issue_num", and everything between the '()' 's should go into the "Special" charfield. I am not sure how to do this kind of rigorous parsing.
Sometimes, there are multiple #'s (one comic in particular is described as a bundle, "Containing issues #90-#95") and several have multiple '()' groups (such as, "Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover)
)
What would be a good road to start onto crack this problem? My knowledge of If/else statements quickly fell apart for the more complicated lines. How can I efficiently and (if possible) pythonic-ly parse through these lines and subdivide them so I can later slot them into the correct place in my database?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用正则表达式模块
re
。例如,如果您的示例记录的第三个|
分隔字段位于变量s
中,那么您可以执行以下操作:您将收到
IndexError 在最后三行中查找缺失字段。调整 RE 直到它解析出您想要的所有内容。
Use the regular expression module
re
. For example, if you have the third|
-delimited field of your sample record in a variables
, then you can doYou'll get an
IndexError
in the last three lines for a missing field. Adapt the RE until it parses everything your want.解析标题是困难的部分,听起来你可以自己处理日期等。问题是,没有一个规则可以解析每个标题,但是有很多规则,并且您只能猜测哪一个规则适用于特定标题。
我通常通过创建一个规则列表来处理这个问题,从最具体到一般,然后逐一尝试,直到有一个匹配为止。
要编写此类规则,您可以使用
re
模块,甚至pyparsing。总体思路是这样的:
正如您所看到的,第一个标题已正确解析,但第二个标题则不然。您必须编写一堆规则来解释数据中每种可能的标题格式。
Parsing the title is the hard part, it sounds like you can handle the dates etc yourself. The problem is that there is not one rule that can parse every title but there are many rules and you can only guess which one works on a particular title.
I usually handle this by creating a list of rules, from most specific to general and try them out one by one until one matches.
To write such rules you can use the
re
module or even pyparsing.The general idea goes like this:
As you can see the first title is parsed correctly, but not the 2nd. You'll have to write a bunch of rules that account for every possible title format in your data.
正则表达式是必经之路。但如果你觉得写起来不舒服,你可以尝试我写的一个小型解析器(https://github.com/hgrecco/stringparser)。它将字符串格式 (PEP 3101) 转换为正则表达式。在您的情况下,您将执行以下操作:
本例中的输出是一个(有序)字典。这适用于任何简单的情况,您可以调整它以捕获多个问题或多个 ()
还有一件事:请注意,在当前版本中,您需要手动转义正则表达式字符(即,如果您想查找 |,您需要键入\|)。我计划很快改变这一点。
Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format (PEP 3101) to a regular expression. In your case, you will do the following:
The output in this case is an (ordered) dictionary. This will work for any simple cases and you might tweak it to catch multiple issues or multiple ()
One more thing: notice that in the current version you need to manually escape regex characters (i.e. if you want to find |, you need to type \|). I am planning to change this soon.