如何从没有常量分隔符的文本行中提取字段?
在每个字段之间没有明确的分隔符(分隔符)的情况下,从每行提取每个字段的最佳方法是什么?
以下是我需要提取其字段的行示例:
3/3/2010 11:00:46 AM BASEMENT-IN
3/3/2010 11:04:04 AM 2, YaserAlNaqeb BASEMENT-OUT
3/3/2010 11:04:06 AM BASEMENT-IN
3/3/2010 11:04:18 AM BASEMENT-IN
3/3/2010 11:14:32 AM 4, Dhileep BASEMENT-OUT
3/3/2010 11:14:34 AM BASEMENT-IN
3/3/2010 11:14:41 AM BASEMENT-IN
3/3/2010 11:15:33 AM 4, Dhileep BASEMENT-IN
3/3/2010 11:15:42 AM BASEMENT-IN
3/3/2010 11:15:42 AM BASEMENT-IN
3/3/2010 11:30:22 AM 34, KumarRaju BASEMENT-IN
3/3/2010 11:31:28 AM 39, Eldrin BASEMENT-OUT
3/3/2010 11:31:31 AM BASEMENT-IN
3/3/2010 11:31:39 AM BASEMENT-IN
3/3/2010 11:32:38 AM 39, Eldrin BASEMENT-IN
3/3/2010 11:32:47 AM BASEMENT-IN
3/3/2010 11:32:47 AM BASEMENT-IN
3/3/2010 11:33:26 AM 34, KumarRaju BASEMENT-OUT
3/3/2010 11:33:28 AM BASEMENT-IN
每行有 6 个字段,其中一些可以为空。解决这个问题的最佳方法是什么?
- 我正在使用 Java
Edition 01
- 字段 5 可以为空(但是在所有情况下都应该识别它的存在)
- 空格数量可以更改
- 最后一个单词可以更改
What is the best way to extract each field from each line where there is no clear separator (deliminator) between each field?
Here is a sample of the lines I need to extract its fields:
3/3/2010 11:00:46 AM BASEMENT-IN
3/3/2010 11:04:04 AM 2, YaserAlNaqeb BASEMENT-OUT
3/3/2010 11:04:06 AM BASEMENT-IN
3/3/2010 11:04:18 AM BASEMENT-IN
3/3/2010 11:14:32 AM 4, Dhileep BASEMENT-OUT
3/3/2010 11:14:34 AM BASEMENT-IN
3/3/2010 11:14:41 AM BASEMENT-IN
3/3/2010 11:15:33 AM 4, Dhileep BASEMENT-IN
3/3/2010 11:15:42 AM BASEMENT-IN
3/3/2010 11:15:42 AM BASEMENT-IN
3/3/2010 11:30:22 AM 34, KumarRaju BASEMENT-IN
3/3/2010 11:31:28 AM 39, Eldrin BASEMENT-OUT
3/3/2010 11:31:31 AM BASEMENT-IN
3/3/2010 11:31:39 AM BASEMENT-IN
3/3/2010 11:32:38 AM 39, Eldrin BASEMENT-IN
3/3/2010 11:32:47 AM BASEMENT-IN
3/3/2010 11:32:47 AM BASEMENT-IN
3/3/2010 11:33:26 AM 34, KumarRaju BASEMENT-OUT
3/3/2010 11:33:28 AM BASEMENT-IN
There are 6 fields in each line and some of them can be empty. What is the best way to approach this problem?
- I'm using Java
Edition 01
- Field 5 can be empty (however its existence should be recognized in all cases)
- Amount of spaces can change
- Last word can change
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以按列号删除日期和 BASEMENT-FOO 数据,因为它们始终出现在行中的同一点。然后你可以根据逗号分割余数。您是否需要处理转义逗号 \ 或引号“foo, bar”中的逗号取决于您和您的业务需求。
Well you can strip off the date and the BASEMENT-FOO data by column number, since they always appear at the same point in the line. Then you can split the remainder based on commas. Whether you need to handle escaped commas \, or commas in quotes "foo, bar" is up to you and your business requirements.
对我来说,似乎有 3 个元字段:
MF1:
3/3/2010 11:32:38 AM
MF2:
39, Eldrin
MF3:
BASEMENT- IN
,其中MF2是可选的。我的分隔符将是:
MF1 到并包括 [AM|PM]
MF2 编号,除 BASEMENT-*
MF3 BASEMENT-*
之外的任何内容我不太擅长正则表达式,但我会将这 3 个组提取为类似
?表示可选组。
To me there seem to be 3 meta-fields:
MF1:
3/3/2010 11:32:38 AM
MF2:
39, Eldrin
MF3:
BASEMENT-IN
of which MF2 is optional. My delimiters then would be:
MF1 up to and including [AM|PM]
MF2 number,anything except BASEMENT-*
MF3 BASEMENT-*
I'm not all that good at regexes but I would extract those 3 groups as something like
where the ? means optional group.
您可以执行以下操作:
日期、时间和上午/下午。
然后阅读下一篇文章,因为
日期、时间、int 根据需要。
You can do:
date, time and AM/PM.
then read next piece as name
date,time,int as needed.
找到每行中空白字符与非空白字符相邻的列,然后对这些数字进行统计分析:每行或几乎每行出现的那些很可能是字段边界。
对于与字母相邻的标点符号也类似,但通常不可能猜测 a - 或 a , 是否用于分隔字段。如果它出现在每行的相同位置,则它可能是分隔符,但在 D-FL R-TX D-NY 等列表中,它可能不是。所以对于任意数据不可能有全自动的解决方案。
Find the columns in each line where blank characters are adjacent to non-blank ones, then do a statistical analysis on those numbers: those which occur in every line or almost every line are very probably the field boundaries.
Similarly for punctuation adjacent to letters, but in general it is impossible to guess whether a - or a , is meant to delimit a field or not. If it occurs in the same position in every line, it might be a delimiter, but in lists of things such as D-FL R-TX D-NY it probably isn't. So there can be no fully automatic solution for arbitrary data.
由于每个字段都非常不同(至少在上面粘贴的示例中),您可以执行以下操作:
Since each field is very distinct (atleast in the example you pasted above) you can do this:
您可以使用 Strtokenizer来自 Commons Lang 并指定要分割的多个分隔符:
通过 StrMatcher.
例如
将给出(从上面的例子):
You can use Strtokenizer from Commons Lang and specify multiple delimiters to split on:
There are a number of built in types that is supports via StrMatcher.
e.g.
will give (from the example above):