如何以有意义的方式读取字典键？

发布于 2024-10-17 17:37:40 字数 521 浏览 3 评论 0原文

我有大约一千个文件，它们以半合理的方式命名，如下所示：

aaa.ba.ca.01
aaa.ba.ca.02
aaa.ba.ca.03

aaa.ba.da.01
aaa.ba.da.02
aaa.ba.da.03

等等。假设每个文件包含 2 列数字，我需要将其读入字典：波长、通量。部分阅读对我来说很容易，困难的部分是我需要加载这些字典，以便它们存储如下信息：

wavelength['aaa.ba.ca.01']（这是该文件的波长）

['aaa.ba.ca']（这是所有子文件的波长，即...ca.01，...ca.02和...ca.03 - 按顺序）

波长 ba']（其中还包括所有“子文件”的所有波长——同样按顺序）。

等等。文件名表现良好（各部分用句点分隔，分组层次结构始终相同方向等），但文件长度可以在 4 个部分或 8 个部分之间。

我的问题：是否有一些明智的方法可以让 python glob 文件名并通过解析字符串或其他一些魔法将数据放入这些字典中？我碰壁了。

原文

I have about a thousand files that are named in a semi-sensible way like the following:

aaa.ba.ca.01
aaa.ba.ca.02
aaa.ba.ca.03

aaa.ba.da.01
aaa.ba.da.02
aaa.ba.da.03

and so on. Let's say each file contains 2 columns of numbers which I need to read in to the dictionaries: wavelength, flux. The reading in part is easy for me, the hard part is that I need to load these dictionaries so that they store the information like:

wavelength['aaa.ba.ca.01'] (which is the wavelengths of that one file)

wavelength['aaa.ba.ca'] (which is the wavelengths of all subfiles ie ...ca.01, ...ca.02, and ...ca.03 -- in order)

wavelength['aaa.ba'] (which also includes all wavelengths of all "subfiles" as well -- again in order).

and so on. The filenames are well-behaved (the sections are separated by periods, the grouping hierarchy is always the same direction, etc.) but the files can be between 4 sections, or 8 sections long.

My question: is there some sensible way to have python glob the names of the files and by parsing strings or some other magic get the data into these dictionaries? I've hit a brick wall.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浪推晚风 2024-10-24 17:37:40

一种简单但效率不高的方法是对 Python 字典进行子类化，这样当给定一个不完整的键时，它会按字母顺序连接所有匹配键的内容。

可能会有更有效的设计：这是一个主要缺点，它将根据部分键请求对所有现有字典键进行排序和验证。否则，实现起来非常简单，值得一试：

class MultiDict(dict):
    def __getitem__(self, key):
        if key in self:
            return dict.__getitem__(self, key)
        result = []
        for complete_key in sorted(self.keys()):
            if complete_key.startswith(key):
                result.extend(self[complete_key])
        return result

#example 
a = MultiDict()
a["a0"] = [1]
a["a1"] = [2]
print  a["a"]
[1, 2]

至于获取字典中的数据，只需使用 glob 或 os.listdir 迭代所有文件，并将所需的内容作为列表读取到 MultiDict 中使用文件名作为键的项目。

A simple, but not efficient, way to do so is to subclass Pythons dictionary, so that when given one non-complete key, it concatenates the contents of all matching keys, in alphabetical order.

There could be more efficient designs: this one major drawback being it will sort and verify all existing dictionary keys on a partial key request. Otherwise, it is so simple to implement that it is worth a try:

class MultiDict(dict):
    def __getitem__(self, key):
        if key in self:
            return dict.__getitem__(self, key)
        result = []
        for complete_key in sorted(self.keys()):
            if complete_key.startswith(key):
                result.extend(self[complete_key])
        return result

#example 
a = MultiDict()
a["a0"] = [1]
a["a1"] = [2]
print  a["a"]
[1, 2]

As for getting teh data in the dictionary, just iterate over all files, with glob or os.listdir, and read the desired contents, as a list, into a MultiDict item using the filename as a key.

回复收藏 0 原文

幻梦 2024-10-24 17:37:40

你想要的听起来根本不像一本字典。从很多方面来说，我认为这是一种类似于树的数据结构。因此，您不想使用字典，而是想要创建一个数据结构，其中您有第一个节点：

                                Root
     'ba'             'ca'               'cd'             'fg'
   /   |   \         /    \             /    \              |
  /    |    \       /      \           /      \             |
'aa' 'di'  '30'    '34'   '45'       'ac'     'ty'        '01'

然后执行深度优先搜索，其中您指示了按名称搜索的叶子数量：“ba”。 aa' 只会返回 'ba'->'aa' 叶子中的内容，而 'ba' 将返回 'ba'->'aa'、'ba'->'di' 和 'ba'- >“30”。

如果您愿意，我会将每个“级别”嵌套到它自己的字典中。这样你就可以快速映射到波长等。

What you want does not sound like a dictionary at all. In many ways, I'd say that this is a data structure comparable to a tree. So instead of using a dictionary you're going to want to make a data structure wherein you've got a first node:

                                Root
     'ba'             'ca'               'cd'             'fg'
   /   |   \         /    \             /    \              |
  /    |    \       /      \           /      \             |
'aa' 'di'  '30'    '34'   '45'       'ac'     'ty'        '01'

and then perform a depth first search wherein you've indicated the number of leafs searched by the name: 'ba.aa' would only return things from the 'ba'->'aa' leaf, while 'ba' would return 'ba'->'aa', 'ba'->'di', and 'ba'->'30'.

If you want, I'd make each "level" of nesting into it's own dictionary. That way you could map quickly to the wavelengths and such.

回复收藏 0 原文

愁以何悠 2024-10-24 17:37:40

如果您只有 1000 个文件，则使用线性搜索来查找它们可能就可以了。在我的机器上，一次查找花费了 250 美元。然后您可以使用 itertools.chain 组合多个文件中的数据。

class DataGlob(object):

def __init__(self):
    self.files = []
    self.wavedata = dict()
    self.fluxdata = dict()

def add(self, filename):
    wlist = []
    flist = []
    for l in open(filename):
        (wlen, flux) = map(float, l.split())
        wlist.append(wlen)
        flist.append(flux)
    self.wavedata[filename] = wlist
    self.fluxdata[filename] = flist

def find_keys(self, prefix):
    return [f for f in self.files if f.startswith(prefix)]

def wavelength(self,fileprefix):
    validkeys = find_keys(prefix)
    return itertools.chain.from_iterable(self.wavedata[k] for k in validkeys)

def flux(self, fileprefix):
    validkeys = self.find_keys(fileprefix)
    return itertools.chain.from_iterable(self.fluxdata[k] for k in validkeys)

If you only have 1000 files a linear search to look them up is probably fine. On my machine it took 250 us to do one look up. Then you can use itertools.chain to combine data from multiple files.

class DataGlob(object):

def __init__(self):
    self.files = []
    self.wavedata = dict()
    self.fluxdata = dict()

def add(self, filename):
    wlist = []
    flist = []
    for l in open(filename):
        (wlen, flux) = map(float, l.split())
        wlist.append(wlen)
        flist.append(flux)
    self.wavedata[filename] = wlist
    self.fluxdata[filename] = flist

def find_keys(self, prefix):
    return [f for f in self.files if f.startswith(prefix)]

def wavelength(self,fileprefix):
    validkeys = find_keys(prefix)
    return itertools.chain.from_iterable(self.wavedata[k] for k in validkeys)

def flux(self, fileprefix):
    validkeys = self.find_keys(fileprefix)
    return itertools.chain.from_iterable(self.fluxdata[k] for k in validkeys)

回复收藏 0 原文

~没有更多了~