从页面字典创建层次结构树内容

发布于 2024-08-12 21:48:18 字数 1038 浏览 9 评论 0原文

以下键:值对是“页面”和“页面内容”。

{
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

对于任何给定的“项目”,我如何找到该项目的路径?在大多数情况下,由于我对数据结构的了解非常有限,我假设这将是一个层次结构树。如果我错了请纠正我!

更新:抱歉,我应该更清楚地了解数据和我的预期结果。

假设“page-a”是一个索引,每个“页面”实际上是网站上出现的页面,其中每个“项目”类似于出现在亚马逊、新蛋等上的产品页面。

因此,我的预期输出对于“item-d”,将是该项目的一个或多个路径。 例如(分隔符是任意的,此处用于说明): item-d 有以下路径:

page-a > page-b > page-e > item-d
page-a > page-c > item-d

UPDATE2:更新了我原来的dict以提供更准确和真实的数据。添加“.html”以进行澄清。

The following key:value pairs are 'page' and 'page contents'.

{
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

For any given 'item' how could I find the path(s) to said item? With my very limited knowledge of data structures in most cases, I'm assuming this would be a hierarchy tree. Please correct me if I'm wrong!

UPDATE: My apologies, I should have been more clear about the data and my expected outcome.

Assuming 'page-a' is an index, each 'page' is literally a page appearing on a website, where as each 'item' is something like a product page that would appear on Amazon, Newegg, etc.

Thus, my expected output for 'item-d' would be a path (or paths) to that item.
For example (delimiter is arbitrary, for illustration here):
item-d has the following paths:

page-a > page-b > page-e > item-d
page-a > page-c > item-d

UPDATE2: Updated my original dict to provide more accurate and real data. '.html' added for clarification.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

神仙妹妹 2024-08-19 21:48:18

这是一个简单的方法——它是 O(N 平方),所以,并不是那么高度可扩展,但是对于合理的书籍大小来说,它会很好地为您服务(如果您有数百万页,您需要考虑一个非常好的方法)不同且不太简单的方法;-)。

首先,制作一个更可用的字典,将页面映射到内容集:例如,如果原始字典是d,则制作另一个字典mud为:

mud = dict((p, set(d[p]['contents'].split())) for p in d)

然后,进行字典映射每个页面到其父页面:

parent = dict((p, [k for k in mud if p in mud[k]]) for p in mud)

在这里,我使用父页面列表(集合也可以),但是对于像您的示例中那样具有 0 或 1 个父页面的页面也可以 - 您只需使用空列表表示“没有父项”,否则是一个将父项作为唯一项的列表。这应该是一个非循环有向图(如果您有疑问,当然可以检查,但我跳过该检查)。

现在,给定一个页面,找到从其父级到无父级父级(“根页面”)的路径只需要“遍历”父级字典即可。例如,在 0/1 父案例中:

path = [page]
while parent[path[-1]]:
  path.append(parent[path[-1]][0])

如果您可以更好地阐明您的规格(每本书的页数范围、每页的父项数量等),则毫无疑问可以改进此代码,但作为开始我希望它能有所帮助。

编辑:正如OP澄清的那样,带有>的情况1 个父级(因此,多个路径)确实很有趣,让我展示如何处理这个问题:

partial_paths = [ [page] ]
while partial_paths:
  path = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      partial_paths.append(path + [p])
  else:
    # we've reached a root (parentless node)
    print(path)

当然,您可以 yield 每个路径,而不是 printing当它到达根时(使函数体成为生成器),或者以您需要的任何方式处理它。

再次编辑:评论者担心图表中的循环。如果这种担心是有道理的,那么跟踪路径中已经看到的节点并检测和警告任何循环并不困难。最快的方法是在每个代表部分路径的列表旁边保留一个集合(我们需要列表进行排序,但检查集合中的成员资格是 O(1) ,而检查列表中的 O(N) ):

partial_paths = [ ([page], set([page])) ]
while partial_paths:
  path, pset = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      if p in pset:
        print('Cycle: %s (%s)' % (path, p))
        continue
      partial_paths.append((path + [p], pset.union([p])))
  else:
    # we've reached a root (parentless node)
    print('Path: %s' % (path,))

为了清楚起见,打包可能是值得的list 和 set 表示具有合适方法的小型实用程序类 Path 的部分路径。

Here's a simple approach -- it's O(N squared), so, not all that highly scalable, but will serve you well for a reasonable book size (if you have, say, millions of pages, you need to be thinking about a very different and less simple approach;-).

First, make a more usable dict, mapping page to set of contents: e.g., if the original dict is d, make another dict mud as:

mud = dict((p, set(d[p]['contents'].split())) for p in d)

Then, make the dict mapping each page to its parent pages:

parent = dict((p, [k for k in mud if p in mud[k]]) for p in mud)

Here, I'm using lists of parent pages (sets would be fine too), but that's OK for pages with 0 or 1 parents as in your example, too -- you'll just be using an empty list to mean "no parent", else a list with the parent as the one and only item. This should be an acyclic directed graph (if you're in doubt, you can check, of course, but I'm skipping that check).

Now, given a page, finding the paths up its parent(s) to a parentless-parent ("root page") simply require "walking" the parent dict. E.g., in the 0/1 parent case:

path = [page]
while parent[path[-1]]:
  path.append(parent[path[-1]][0])

If you can clarify your specs better (ranges of number of pages per book, number of parents per page, and so on), this code can no doubt be refined, but as a start I hope it can help.

Edit: as the OP clarified that cases with > 1 parent (and so, multiple paths) are indeed of interest, let me show how do deal with that:

partial_paths = [ [page] ]
while partial_paths:
  path = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      partial_paths.append(path + [p])
  else:
    # we've reached a root (parentless node)
    print(path)

Of course, instead of printing, you can yield each path when it reaches a root (making the function whose body this is into a generator), or otherwise treat it in whatever way you require.

Edit again: a commenter is worried about cycles in the graph. If that worry's warranted, it's not hard to keep track of nodes already seen in a path and detect and warn about any cycles. Fastest is to keep a set alongside each list representing a partial path (we need the list for ordering, but checking for membership is O(1) in sets vs O(N) in lists):

partial_paths = [ ([page], set([page])) ]
while partial_paths:
  path, pset = partial_paths.pop()
  if parent[path[-1]]:
    # add as many partial paths as open from here
    for p in parent[path[-1]]:
      if p in pset:
        print('Cycle: %s (%s)' % (path, p))
        continue
      partial_paths.append((path + [p], pset.union([p])))
  else:
    # we've reached a root (parentless node)
    print('Path: %s' % (path,))

It's probably worthwhile, for clarity, packing the list and set representing a partial path into a small utility class Path with suitable methods.

碍人泪离人颜 2024-08-19 21:48:18

这是您的问题的说明。当你有图片时,就更容易推理图表。

首先,缩写数据:

#!/usr/bin/perl -pe
s/section-([a-e])\.html/uc$1/eg; s/product-([a-e])\.html/$1/g

结果:

# graph as adj list
DATA = {
  'A':{'contents':'B C D'},
  'B':{'contents':'D E'},
  'C':{'contents':'a b c d'},
  'D':{'contents':'a c'},
  'E':{'contents':'b d'},
  'a':{'contents':''},
  'b':{'contents':''},
  'c':{'contents':''},
  'd':{'contents':''}
}

转换为 graphviz 的格式:

with open('data.dot', 'w') as f:
    print >> f, 'digraph {'
    for node, v in data.iteritems():
        for child in v['contents'].split():
            print >> f, '%s -> %s;' % (node, child),
        if v['contents']: # don't print empty lines
            print >> f
    print >> f, '}'

结果:

digraph {
A -> C; A -> B; A -> D;
C -> a; C -> b; C -> c; C -> d;
B -> E; B -> D;
E -> b; E -> d;
D -> a; D -> c;
}

绘制图表:

$ dot -Tpng -O data.dot

data.dot

Here's an illustration for your question. It is easier to reason about graphs when you have a picture.

First, abbreviate the data:

#!/usr/bin/perl -pe
s/section-([a-e])\.html/uc$1/eg; s/product-([a-e])\.html/$1/g

Result:

# graph as adj list
DATA = {
  'A':{'contents':'B C D'},
  'B':{'contents':'D E'},
  'C':{'contents':'a b c d'},
  'D':{'contents':'a c'},
  'E':{'contents':'b d'},
  'a':{'contents':''},
  'b':{'contents':''},
  'c':{'contents':''},
  'd':{'contents':''}
}

Convert to graphviz's format:

with open('data.dot', 'w') as f:
    print >> f, 'digraph {'
    for node, v in data.iteritems():
        for child in v['contents'].split():
            print >> f, '%s -> %s;' % (node, child),
        if v['contents']: # don't print empty lines
            print >> f
    print >> f, '}'

Result:

digraph {
A -> C; A -> B; A -> D;
C -> a; C -> b; C -> c; C -> d;
B -> E; B -> D;
E -> b; E -> d;
D -> a; D -> c;
}

Plot the graph:

$ dot -Tpng -O data.dot

data.dot

極樂鬼 2024-08-19 21:48:18

编辑随着问题得到更好的解释,我认为以下内容可能是您所需要的,或者至少可以提供一些起点。

data = {
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':\
                    'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

def findSingleItemInData(item):
    return map( lambda x: (item, x), \
                [key for key in data if data[key]['contents'].find(item) <> -1])

def trace(text):
    searchResult = findSingleItemInData(text)
    if not searchResult:
        return text

    retval = [] 
    for item in searchResult:
        retval.append([text, trace(item[-1])]) 

    return retval

print trace('product-d.html')

我真的不知道你期望看到什么,但也许类似
这会起作用。

data = {
   'page-a':{'contents':'page-b page-c'},
   'page-b':{'contents':'page-d page-e'},
   'page-c':{'contents':'item-a item-b item-c item-d'},
   'page-d':{'contents':'item-a item-c'},
   'page-e':{'contents':'item-b item-d'}
}

itemToFind = 'item-c'

for key in data:
  for index, item in enumerate(data[key]['contents'].split()):
    if item == itemToFind:
      print key, 'at position', index

如果你使用稍微一点的话,会更容易,而且我认为更正确
不同的数据结构:

 data = {
   'page-a':{'contents':['page-b', 'page-c']},
   'page-b':{'contents':['page-d', 'page-e']},
   'page-c':{'contents':['item-a', 'item-b item-c item-d']},
   'page-d':{'contents':['item-a', 'item-c']},
   'page-e':{'contents':['item-b', 'item-d']}
 }

那么你就不需要拆分。

鉴于最后一种情况,它甚至可以表达得更短:

for key in data:
    print [ (key, index, value) for index,value in \
             enumerate(data[key]['contents']) if value == 'item-c' ]

甚至可以更短,删除空列表:

print filter(None, [[ (key, index, value) for index,value in \ 
       enumerate(data[key]['contents']) if value == 'item-c' ] for key in data])

那应该是一行,但我使用 \ 作为换行符,以便可以读取它
没有滚动条。

EDIT With the question explained a bit better I think the following might be what you need, or could at least provide something of a starting point.

data = {
  'section-a.html':{'contents':'section-b.html section-c.html section-d.html'},
  'section-b.html':{'contents':'section-d.html section-e.html'},
  'section-c.html':{'contents':\
                    'product-a.html product-b.html product-c.html product-d.html'},
  'section-d.html':{'contents':'product-a.html product-c.html'},
  'section-e.html':{'contents':'product-b.html product-d.html'},
  'product-a.html':{'contents':''},
  'product-b.html':{'contents':''},
  'product-c.html':{'contents':''},
  'product-d.html':{'contents':''}
}

def findSingleItemInData(item):
    return map( lambda x: (item, x), \
                [key for key in data if data[key]['contents'].find(item) <> -1])

def trace(text):
    searchResult = findSingleItemInData(text)
    if not searchResult:
        return text

    retval = [] 
    for item in searchResult:
        retval.append([text, trace(item[-1])]) 

    return retval

print trace('product-d.html')

OLD

I don't really know what you expect to see, but maybe something like
this will work.

data = {
   'page-a':{'contents':'page-b page-c'},
   'page-b':{'contents':'page-d page-e'},
   'page-c':{'contents':'item-a item-b item-c item-d'},
   'page-d':{'contents':'item-a item-c'},
   'page-e':{'contents':'item-b item-d'}
}

itemToFind = 'item-c'

for key in data:
  for index, item in enumerate(data[key]['contents'].split()):
    if item == itemToFind:
      print key, 'at position', index

It would be easier, and I think more correct, if you'd use a slightly
different data structure:

 data = {
   'page-a':{'contents':['page-b', 'page-c']},
   'page-b':{'contents':['page-d', 'page-e']},
   'page-c':{'contents':['item-a', 'item-b item-c item-d']},
   'page-d':{'contents':['item-a', 'item-c']},
   'page-e':{'contents':['item-b', 'item-d']}
 }

Then you wouldn't need to split.

Given that last case, it can even be expressed a bit shorter:

for key in data:
    print [ (key, index, value) for index,value in \
             enumerate(data[key]['contents']) if value == 'item-c' ]

And even shorter, with the empty lists removed:

print filter(None, [[ (key, index, value) for index,value in \ 
       enumerate(data[key]['contents']) if value == 'item-c' ] for key in data])

That should be a single line, but I used \ as line break indicator so it can be read
without scrollbars.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文