如何从没有常量分隔符的文本行中提取字段?

发布于 2024-08-23 17:07:47 字数 1520 浏览 5 评论 0原文

在每个字段之间没有明确的分隔符(分隔符)的情况下,从每行提取每个字段的最佳方法是什么?

以下是我需要提取其字段的行示例:

3/3/2010 11:00:46 AM                      BASEMENT-IN          
3/3/2010 11:04:04 AM 2, YaserAlNaqeb      BASEMENT-OUT         
3/3/2010 11:04:06 AM                      BASEMENT-IN          
3/3/2010 11:04:18 AM                      BASEMENT-IN          
3/3/2010 11:14:32 AM 4, Dhileep              BASEMENT-OUT         
3/3/2010 11:14:34 AM                      BASEMENT-IN          
3/3/2010 11:14:41 AM                      BASEMENT-IN          
3/3/2010 11:15:33 AM 4, Dhileep           BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:30:22 AM 34, KumarRaju        BASEMENT-IN          
3/3/2010 11:31:28 AM 39, Eldrin           BASEMENT-OUT         
3/3/2010 11:31:31 AM                      BASEMENT-IN          
3/3/2010 11:31:39 AM                      BASEMENT-IN          
3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:33:26 AM 34, KumarRaju        BASEMENT-OUT         
3/3/2010 11:33:28 AM                      BASEMENT-IN    

每行有 6 个字段,其中一些可以为空。解决这个问题的最佳方法是什么?

  • 我正在使用 Java

Edition 01

  • 字段 5 可以为空(但是在所有情况下都应该识别它的存在)
  • 空格数量可以更改
  • 最后一个单词可以更改

What is the best way to extract each field from each line where there is no clear separator (deliminator) between each field?

Here is a sample of the lines I need to extract its fields:

3/3/2010 11:00:46 AM                      BASEMENT-IN          
3/3/2010 11:04:04 AM 2, YaserAlNaqeb      BASEMENT-OUT         
3/3/2010 11:04:06 AM                      BASEMENT-IN          
3/3/2010 11:04:18 AM                      BASEMENT-IN          
3/3/2010 11:14:32 AM 4, Dhileep              BASEMENT-OUT         
3/3/2010 11:14:34 AM                      BASEMENT-IN          
3/3/2010 11:14:41 AM                      BASEMENT-IN          
3/3/2010 11:15:33 AM 4, Dhileep           BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:15:42 AM                      BASEMENT-IN          
3/3/2010 11:30:22 AM 34, KumarRaju        BASEMENT-IN          
3/3/2010 11:31:28 AM 39, Eldrin           BASEMENT-OUT         
3/3/2010 11:31:31 AM                      BASEMENT-IN          
3/3/2010 11:31:39 AM                      BASEMENT-IN          
3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN          
3/3/2010 11:33:26 AM 34, KumarRaju        BASEMENT-OUT         
3/3/2010 11:33:28 AM                      BASEMENT-IN    

There are 6 fields in each line and some of them can be empty. What is the best way to approach this problem?

  • I'm using Java

Edition 01

  • Field 5 can be empty (however its existence should be recognized in all cases)
  • Amount of spaces can change
  • Last word can change

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

找个人就嫁了吧 2024-08-30 17:07:47

您可以按列号删除日期和 BASEMENT-FOO 数据,因为它们始终出现在行中的同一点。然后你可以根据逗号分割余数。您是否需要处理转义逗号 \ 或引号“foo, bar”中的逗号取决于您和您的业务需求。

Well you can strip off the date and the BASEMENT-FOO data by column number, since they always appear at the same point in the line. Then you can split the remainder based on commas. Whether you need to handle escaped commas \, or commas in quotes "foo, bar" is up to you and your business requirements.

岁月染过的梦 2024-08-30 17:07:47

对我来说,似乎有 3 个元字段:

3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN 

MF1: 3/3/2010 11:32:38 AM

MF2: 39, Eldrin

MF3: BASEMENT- IN

,其中MF2是可选的。我的分隔符将是:

MF1 到并包括 [AM|PM]

MF2 编号,除 BASEMENT-*

MF3 BASEMENT-*

之外的任何内容我不太擅长正则表达式,但我会将这 3 个组提取为类似

(anything)(AM|PM)(number,anything)?(BASEMENT-anything)

?表示可选组。

To me there seem to be 3 meta-fields:

3/3/2010 11:32:38 AM 39, Eldrin           BASEMENT-IN          
3/3/2010 11:32:47 AM                      BASEMENT-IN 

MF1: 3/3/2010 11:32:38 AM

MF2: 39, Eldrin

MF3: BASEMENT-IN

of which MF2 is optional. My delimiters then would be:

MF1 up to and including [AM|PM]

MF2 number,anything except BASEMENT-*

MF3 BASEMENT-*

I'm not all that good at regexes but I would extract those 3 groups as something like

(anything)(AM|PM)(number,anything)?(BASEMENT-anything)

where the ? means optional group.

多情癖 2024-08-30 17:07:47

您可以执行以下操作:

  • 将整行读取为字符串。
  • 将读取行拆分为空格 (\s+)。你应该得到5或6件。
  • piece0、piece1 和piece2 将是
    日期、时间和上午/下午。
  • 检查piece3是否有编号:如果有
    然后阅读下一篇文章,因为
  • 最后一篇文章是地下室的东西。
  • 将字符串中的片段转换为
    日期、时间、int 根据需要。

You can do:

  • read an entire line as string.
  • split the read line on spaces(\s+). You should get 5 or 6 pieces.
  • piece0, piece1 and piece2 will be
    date, time and AM/PM.
  • check if piece3 has number: if yes
    then read next piece as name
  • last piece is that Basement thing.
  • convert the pieces from string to say
    date,time,int as needed.
乞讨 2024-08-30 17:07:47

找到每行中空白字符与非空白字符相邻的列,然后对这些数字进行统计分析:每行或几乎每行出现的那些很可能是字段边界。

对于与字母相邻的标点符号也类似,但通常不可能猜测 a - 或 a , 是否用于分隔字段。如果它出现在每行的相同位置,则它可能是分隔符,但在 D-FL R-TX D-NY 等列表中,它可能不是。所以对于任意数据不可能有全自动的解决方案。

Find the columns in each line where blank characters are adjacent to non-blank ones, then do a statistical analysis on those numbers: those which occur in every line or almost every line are very probably the field boundaries.

Similarly for punctuation adjacent to letters, but in general it is impossible to guess whether a - or a , is meant to delimit a field or not. If it occurs in the same position in every line, it might be a delimiter, but in lists of things such as D-FL R-TX D-NY it probably isn't. So there can be no fully automatic solution for arbitrary data.

如果没有 2024-08-30 17:07:47

由于每个字段都非常不同(至少在上面粘贴的示例中),您可以执行以下操作:

  1. 将字符串拆分为标记。
  2. 通过正则表达式模式运行标记化数组的每个元素。

Since each field is very distinct (atleast in the example you pasted above) you can do this:

  1. Split the string into tokens.
  2. Run each element of the tokenized array through a Regex Pattern.
夏末 2024-08-30 17:07:47

您可以使用 Strtokenizer来自 Commons Lang 并指定要分割的多个分隔符:

通过 StrMatcher.

StrTokenizer(char[] input, StrMatcher delim) 

例如

StrMatcher delims = StrMatcher.charSetMatcher(new char[] {' ', ',', '\n'});
StrTokenizer str = new StrTokenizer(match.toString(), delims);
while (str.hasNext()) {
    System.out.println("Token:[" + str.nextToken() + "]");
}

将给出(从上面的例子):

Token:[3/3/2010]
Token:[11:00:46]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:04]
Token:[AM]
Token:[2]
Token:[YaserAlNaqeb]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:04:06]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:18]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:32]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:14:34]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:41]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:33]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:30:22]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:28]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:31:31]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:39]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:38]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:33:26]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:33:28]
Token:[AM]
Token:[BASEMENT-IN]

You can use Strtokenizer from Commons Lang and specify multiple delimiters to split on:

There are a number of built in types that is supports via StrMatcher.

StrTokenizer(char[] input, StrMatcher delim) 

e.g.

StrMatcher delims = StrMatcher.charSetMatcher(new char[] {' ', ',', '\n'});
StrTokenizer str = new StrTokenizer(match.toString(), delims);
while (str.hasNext()) {
    System.out.println("Token:[" + str.nextToken() + "]");
}

will give (from the example above):

Token:[3/3/2010]
Token:[11:00:46]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:04]
Token:[AM]
Token:[2]
Token:[YaserAlNaqeb]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:04:06]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:04:18]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:32]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:14:34]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:14:41]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:33]
Token:[AM]
Token:[4]
Token:[Dhileep]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:15:42]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:30:22]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:28]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:31:31]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:31:39]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:38]
Token:[AM]
Token:[39]
Token:[Eldrin]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:32:47]
Token:[AM]
Token:[BASEMENT-IN]
Token:[3/3/2010]
Token:[11:33:26]
Token:[AM]
Token:[34]
Token:[KumarRaju]
Token:[BASEMENT-OUT]
Token:[3/3/2010]
Token:[11:33:28]
Token:[AM]
Token:[BASEMENT-IN]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文