如何动态识别数据文件中的未知分隔符？

发布于 2024-09-27 23:50:17 字数 755 浏览 9 评论 0原文

我有三个输入数据文件。每个对其中包含的数据使用不同的分隔符。数据文件一看起来像这样：

apples | bananas | oranges | grapes

数据文件二看起来像这样：

quarter, dime, nickel, penny

数据文件三看起来像这样：

horse cow pig chicken goat

（列数的变化也是有意的）

我的想法是计算非字母字符的数量，并假设最高计数是分隔符。但是，具有非空格分隔符的文件在分隔符前后也有空格，因此空格在所有三个文件上获胜。这是我的代码：

def count_chars(s):
    valid_seps=[' ','|',',',';','\t']
    cnt = {}
    for c in s:
        if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
    return cnt

infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)

它将打印一本字典，其中包含所有可接受字符的计数。在每种情况下，空格总是获胜，所以我不能依赖它来告诉我分隔符是什么。

但我想不出更好的方法来做到这一点。

有什么建议吗？

原文

I have three input data files. Each uses a different delimiter for the data contained therein. Data file one looks like this:

apples | bananas | oranges | grapes

data file two looks like this:

quarter, dime, nickel, penny

data file three looks like this:

horse cow pig chicken goat

(the change in the number of columns is also intentional)

The thought I had was to count the number of non-alpha characters, and presume that the highest count was the separator character. However, the files with non-space separators also have spaces before and after the separators, so the spaces win on all three files. Here's my code:

def count_chars(s):
    valid_seps=[' ','|',',',';','\t']
    cnt = {}
    for c in s:
        if c in valid_seps: cnt[c] = cnt.get(c,0) + 1
    return cnt

infile = 'pipe.txt' #or 'comma.txt' or 'space.txt'
records = open(infile,'r').read()
print count_chars(records)

It will print a dictionary with the counts of all the acceptable characters. In each case, the space always wins, so I can't rely on that to tell me what the separator is.

But I can't think of a better way to do this.

Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梨涡少年 2024-10-04 23:50:17

尝试Python CSV的标准怎么样： http://docs.python.org/library/ csv.html#csv.Sniffer

import csv

sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','

How about trying Python CSV's standard: http://docs.python.org/library/csv.html#csv.Sniffer

import csv

sniffer = csv.Sniffer()
dialect = sniffer.sniff('quarter, dime, nickel, penny')
print dialect.delimiter
# returns ','

回复收藏 0 原文

国际总奸 2024-10-04 23:50:17

如果你使用 python，我建议只调用 re.split 包含所有有效的预期分隔符的行：

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

唯一的问题是其中一个文件是否使用分隔符作为数据的一部分。

如果必须识别分隔符，最好的办法是计算除空格之外的所有内容。如果几乎没有出现，那么它可能是空格，否则，它是映射字符的最大值。

不幸的是，确实没有办法确定。您可能有用逗号填充的空格分隔数据，或者您可能有 |分隔的数据用分号填充。它可能并不总是有效。

If you're using python, I'd suggest just calling re.split on the line with all valid expected separators:

>>> l = "big long list of space separated words"
>>> re.split(r'[ ,|;"]+', l)
['big', 'long', 'list', 'of', 'space', 'separated', 'words']

The only issue would be if one of the files used a separator as part of the data.

If you must identify the separator, your best bet is to count everything excluding spaces. If there are almost no occurrences, then it's probably space, otherwise, it's the max of the mapped characters.

Unfortunately, there's really no way to be sure. You may have space separated data filled with commas, or you may have | separated data filled with semicolons. It may not always work.

回复收藏 0 原文

苦妄 2024-10-04 23:50:17

对于 Python 3，使用标准库“csv”：

import csv


def get_delimiter(file_path: str) -> str:
    with open(file_path, 'r') as csvfile:
        delimiter = str(csv.Sniffer().sniff(csvfile.read()).delimiter)
        return delimiter

With Python 3, using standard library 'csv':

import csv


def get_delimiter(file_path: str) -> str:
    with open(file_path, 'r') as csvfile:
        delimiter = str(csv.Sniffer().sniff(csvfile.read()).delimiter)
        return delimiter

回复收藏 0 原文

萝莉病 2024-10-04 23:50:17

大多数情况下，我们可以根据一些先验信息（例如公共分隔符列表）和频率计数来确定分隔符，所有行给出相同数量的分隔符

def head(filename: str, n: int):
    try:
        with open(filename) as f:
            head_lines = [next(f).rstrip() for x in range(n)]
    except StopIteration:
        with open(filename) as f:
            head_lines = f.read().splitlines()
    return head_lines


def detect_delimiter(filename: str, n=2):
    sample_lines = head(filename, n)
    common_delimiters= [',',';','\t',' ','|',':']
    for d in common_delimiters:
        ref = sample_lines[0].count(d)
        if ref > 0:
            if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
                return d
    return ','

通常n=2行就足够了，检查更多行更强有力的答案。当然，有些情况（通常是人为的）会导致错误检测，但在实践中不太可能发生。

这里我使用 head 函数的高效 python 实现，它只读取文件的第 n-第一行。请参阅我的答案如何阅读第 N-文件的行

We can determine the delimiter right most of the time based on some prior information (such as list of common delimiter) and frequency counting that all the lines give the same number of delimiter

def head(filename: str, n: int):
    try:
        with open(filename) as f:
            head_lines = [next(f).rstrip() for x in range(n)]
    except StopIteration:
        with open(filename) as f:
            head_lines = f.read().splitlines()
    return head_lines


def detect_delimiter(filename: str, n=2):
    sample_lines = head(filename, n)
    common_delimiters= [',',';','\t',' ','|',':']
    for d in common_delimiters:
        ref = sample_lines[0].count(d)
        if ref > 0:
            if all([ ref == sample_lines[i].count(d) for i in range(1,n)]):
                return d
    return ','

Often n=2 lines should be enough, check more lines for a more robust answers. Of course there are cases (often artificial ones) those lead to a false detection but it is unlikely happened in practice.

Here I use an efficient python implementation of head function that only read n-first line of a file. See my answer on How to read first N-lines of a file

回复收藏 0 原文

满天都是小星星 2024-10-04 23:50:17

由于空格问题，我最终选择了正则表达式。这是我完成的代码，以防有人感兴趣，或者可以使用其中的其他内容。顺便说一句，找到一种动态识别列顺序的方法会很不错，但我意识到这有点棘手。与此同时，我又开始使用老办法来解决这个问题。

for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
            #couldn't quite figure out a way to make this a single block 
            #(rather than three separate if/elifs. But you can see the split is
            #generalized already, so if anyone can come up with a better way,
            #I'm all ears!! :)
            for row in open(infile,'r').readlines():
                if infile.find('comma') > -1: 
                    datefmt = "%m/%d/%Y"
                    last, first, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                elif infile.find('space') > -1: 
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, dobraw, color = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]

                elif infile.find('pipe') > -1:
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                    #There is also a way to do this with csv.Sniffer, but the 
                    #spaces around the pipe delimiter also confuse sniffer, so
                    #I couldn't use it.
                else: raise ValueError(infile + "is not an acceptable input file.")

I ended up going with the regex, because of the problem of spaces. Here's my finished code, in case anyone's interested, or could use anything else in it. On a tangential note, it would be neat to find a way to dynamically identify column order, but I realize that's a bit more tricky. In the meantime, I'm falling back on old tricks to sort that out.

for infile in glob.glob(os.path.join(self._input_dir, self._file_mask)):
            #couldn't quite figure out a way to make this a single block 
            #(rather than three separate if/elifs. But you can see the split is
            #generalized already, so if anyone can come up with a better way,
            #I'm all ears!! :)
            for row in open(infile,'r').readlines():
                if infile.find('comma') > -1: 
                    datefmt = "%m/%d/%Y"
                    last, first, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                elif infile.find('space') > -1: 
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, dobraw, color = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]

                elif infile.find('pipe') > -1:
                    datefmt = "%m-%d-%Y"
                    last, first, unused, gender, color, dobraw = \
                            [x.strip() for x in re.split(r'[ ,|;"\t]+', row)]
                    #There is also a way to do this with csv.Sniffer, but the 
                    #spaces around the pipe delimiter also confuse sniffer, so
                    #I couldn't use it.
                else: raise ValueError(infile + "is not an acceptable input file.")

回复收藏 0 原文

~没有更多了~