摆脱记忆错误以加入算法

发布于 2025-02-12 13:33:40 字数 7322 浏览 0 评论 0原文

我得到了一个数据集，坐在.txt文件中，由RDF三元组的1000万行组成，例如：

wsdbm:User0 wsdbm:follows   wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0  rev:hasReview   wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf  wsdbm:User119 .
....

由于这些是RDF三元组，因此，在我们的情况下，我们

Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119

的目标是以SQL形式编写查询：

SELECT follows.subject, follows.object, friendOf.object, 
       likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

到目前为止，我创建了一个称为ProtertyTables的类，该类具有通过初始文件迭代的方法，并将每个主题，谓词和对象转换为一个整数，以改善Join并保存内存的计算时间：

class PropertyTables():
    """
    This class holds all 4 Property Tables necessary for the required query.
    Each Property Table is an instance of the class 'PropertyTable'.
    """

    def __init__(self):

        self.property_tables = defaultdict()
        self.hash_map = HashDict()

    def parse_file(self, file_path, remove_prefix = False):
    
        data = open(file_path, 'r')
        for line in data:
            subj, prop, *obj = line.rstrip('\n.').split('\t')
            obj = obj[0].rstrip()
    
            if remove_prefix:
                subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
            
            if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
                self.hash_and_store(subj, prop, obj)
    
        data.close()

该类propertytable，docstring中提到的属性：

class PropertyTable():
    """
    This class represents a single Property Table, i.e. it holds every Subject and Object
    """

    def __init__(self):

        self.table = []


    def insert(self, r, s):
        
        # If r and s are already tuples, they get appended to the Property Table. 
        # Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the 
        # Property Tables when reading the data.

        if type(r) == tuple:
            self.table.append(r + s)

        else:
            self.table.append((r, s))

类hashdict（）是一个简单的字典，可以在加入后再次检索它们。

为了不走一篇帖子，我现在有一个单个哈希算法：

def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):

    ht = defaultdict(list)

    # Create Hash Table for table1

    for s in property_1.table:
        ht[s[index_0]].append(s)

    # Join Tables

    joined_table = PropertyTable()

    for r in property_2.table:
        for s in ht[r[index_1]]:
            joined_table.insert(s, r)

    return joined_table

我使用此功能顺序加入每个表，给定以前的要求。

WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)

结果对于小表来说是正确的，但是1000万行只是导致了不足的记忆错误，我正在寻找避免这种情况的方法。对此非常广泛的帖子，我感到抱歉，但是我想为了一些建议，需要一些细节！

编辑：

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    53     68.0 MiB     68.0 MiB           1       @profile
    54                                             def hash_and_store(self, subj, prop, obj):
    55                                         
    56     68.0 MiB      0.0 MiB           1           hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
    57                                         
    58     68.0 MiB      0.0 MiB           1           if prop not in self.property_tables: 
    59                                                     self.property_tables[prop] = PropertyTable()
    60     68.0 MiB      0.0 MiB           1           self.property_tables[prop].insert(hashed_subj, hashed_obj)


Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    32     68.1 MiB     68.1 MiB           1       @profile
    33                                             def parse_file(self, file_path, remove_prefix = False):
    34                                         
    35     68.1 MiB      0.0 MiB           1           data = open(file_path, 'r')
    36                                         
    37                                                 
    38                                                 
    39                                                
    40                                         
    41     80.7 MiB      0.3 MiB      109311           for line in data:
    42     80.7 MiB      0.0 MiB      109310               subj, prop, *obj = line.rstrip('\n.').split('\t')
    43     80.7 MiB      0.5 MiB      109310               obj = obj[0].rstrip()
    44                                         
    45     80.7 MiB      0.0 MiB      109310               if remove_prefix:
    46     80.7 MiB      9.0 MiB      655860                   subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
    47                                                     
    48     80.7 MiB      0.0 MiB      109310               if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
    49     80.7 MiB      2.8 MiB       80084                   self.hash_and_store(subj, prop, obj)
    50                                         
    51     80.7 MiB      0.0 MiB           1           data.close()
    
    
   Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     80.7 MiB     80.7 MiB           1       @profile
    39                                             def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
    40                                         
    41     80.7 MiB      0.0 MiB           1           ht = defaultdict(list)
    42                                         
    43                                                 # Create Hash Table for table1
    44                                         
    45     81.2 MiB      0.0 MiB       31888           for s in property_1.table:
    46     81.2 MiB      0.5 MiB       31887               ht[s[index_0]].append(s)
    47                                         
    48                                                 # Join Tables
    49                                         
    50     81.2 MiB      0.0 MiB           1           joined_table = PropertyTable()
    51                                         
    52    203.8 MiB      0.0 MiB       45713           for r in property_2.table:
    53    203.8 MiB      0.0 MiB     1453580               for s in ht[r[index_1]]:
    54    203.8 MiB    122.6 MiB     1407868                   joined_table.insert(s, r)
    55                                             
    56    203.8 MiB      0.0 MiB           1           return joined_table

原文

I got a dataset, sitting in a .txt file, consisting of 10 million rows in the form of RDF triples, like such:

wsdbm:User0 wsdbm:follows   wsdbm:User300 .
wsdbm:User6 wsdbm:likes wsdbm:Product92 .
wsdbm:Product0  rev:hasReview   wsdbm:Review478 .
wsdbm:User2 wsdbm:friendOf  wsdbm:User119 .
....

Since these are RDF triples, in our case we have

Subjects: User0, User6, Product, User2
Predicates: follows, likes, hasReview, friendOf
Objects: User300, Product92, Review478, User119

My goal is to write a query in the SQL form:

SELECT follows.subject, follows.object, friendOf.object, 
       likes.object, hasReview.object
FROM follows, friendOf, likes, hasReview
WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

So far, I create a class called PropertyTables, which has a method that iterates over the initial file and convert each subject, predicate and object into an integer to improve computational time on the join and save memory:

class PropertyTables():
    """
    This class holds all 4 Property Tables necessary for the required query.
    Each Property Table is an instance of the class 'PropertyTable'.
    """

    def __init__(self):

        self.property_tables = defaultdict()
        self.hash_map = HashDict()

    def parse_file(self, file_path, remove_prefix = False):
    
        data = open(file_path, 'r')
        for line in data:
            subj, prop, *obj = line.rstrip('\n.').split('\t')
            obj = obj[0].rstrip()
    
            if remove_prefix:
                subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
            
            if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
                self.hash_and_store(subj, prop, obj)
    
        data.close()

the class PropertyTable, mentioned in the docstring:

class PropertyTable():
    """
    This class represents a single Property Table, i.e. it holds every Subject and Object
    """

    def __init__(self):

        self.table = []


    def insert(self, r, s):
        
        # If r and s are already tuples, they get appended to the Property Table. 
        # Otherwise, we convert them to a tuple beforehand. This is mostly relevant when creating the 
        # Property Tables when reading the data.

        if type(r) == tuple:
            self.table.append(r + s)

        else:
            self.table.append((r, s))

The class HashDict() is a simple dictionary that hashes values, so we can retrieve them again after the join.

To not go to far with one post, I have now a single hash join algorithm:

def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):

    ht = defaultdict(list)

    # Create Hash Table for table1

    for s in property_1.table:
        ht[s[index_0]].append(s)

    # Join Tables

    joined_table = PropertyTable()

    for r in property_2.table:
        for s in ht[r[index_1]]:
            joined_table.insert(s, r)

    return joined_table

I use this function to sequentially join each table, given the requirements from before.

WHERE follows.object = friendOf.subject
      AND friendOf.object = likes.subject
      AND likes.object = hasReview.subject

join_follows_friendOf = hash_join(pt.property_tables['follows'], 1, pt.property_tables['friendOf'], 0)
join_friendOf_likes = hash_join(join_follows_friendOf, 3, pt.property_tables['likes'], 0)
join_likes_hasReview = hash_join(join_friendOf_likes, 5, pt.property_tables['hasReview'], 0)

The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this. I am sorry for this very extensive post, but I guess some details are necessary in order for some advice!

Edit:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    53     68.0 MiB     68.0 MiB           1       @profile
    54                                             def hash_and_store(self, subj, prop, obj):
    55                                         
    56     68.0 MiB      0.0 MiB           1           hashed_subj, hashed_obj = self.hash_map.hash_values(subj, obj)
    57                                         
    58     68.0 MiB      0.0 MiB           1           if prop not in self.property_tables: 
    59                                                     self.property_tables[prop] = PropertyTable()
    60     68.0 MiB      0.0 MiB           1           self.property_tables[prop].insert(hashed_subj, hashed_obj)


Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    32     68.1 MiB     68.1 MiB           1       @profile
    33                                             def parse_file(self, file_path, remove_prefix = False):
    34                                         
    35     68.1 MiB      0.0 MiB           1           data = open(file_path, 'r')
    36                                         
    37                                                 
    38                                                 
    39                                                
    40                                         
    41     80.7 MiB      0.3 MiB      109311           for line in data:
    42     80.7 MiB      0.0 MiB      109310               subj, prop, *obj = line.rstrip('\n.').split('\t')
    43     80.7 MiB      0.5 MiB      109310               obj = obj[0].rstrip()
    44                                         
    45     80.7 MiB      0.0 MiB      109310               if remove_prefix:
    46     80.7 MiB      9.0 MiB      655860                   subj, prop, obj = [self.remove_prefix(s) for s in (subj, prop, obj)]
    47                                                     
    48     80.7 MiB      0.0 MiB      109310               if prop in ['follows', 'friendOf', 'likes', 'hasReview']:
    49     80.7 MiB      2.8 MiB       80084                   self.hash_and_store(subj, prop, obj)
    50                                         
    51     80.7 MiB      0.0 MiB           1           data.close()
    
    
   Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     80.7 MiB     80.7 MiB           1       @profile
    39                                             def hash_join(self, property_1: PropertyTable, index_0, property_2: PropertyTable, index_1):
    40                                         
    41     80.7 MiB      0.0 MiB           1           ht = defaultdict(list)
    42                                         
    43                                                 # Create Hash Table for table1
    44                                         
    45     81.2 MiB      0.0 MiB       31888           for s in property_1.table:
    46     81.2 MiB      0.5 MiB       31887               ht[s[index_0]].append(s)
    47                                         
    48                                                 # Join Tables
    49                                         
    50     81.2 MiB      0.0 MiB           1           joined_table = PropertyTable()
    51                                         
    52    203.8 MiB      0.0 MiB       45713           for r in property_2.table:
    53    203.8 MiB      0.0 MiB     1453580               for s in ht[r[index_1]]:
    54    203.8 MiB    122.6 MiB     1407868                   joined_table.insert(s, r)
    55                                             
    56    203.8 MiB      0.0 MiB           1           return joined_table

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亽野灬性zι浪 2025-02-19 13:33:40

您问题的核心是：

结果对于小表来说是正确的，但是1000万行只会导致记忆错误，我正在寻找避免这种情况的方法。

遵循您的顶级问题语句，但使用较少通用的结构，我们可以做类似的事情：

def runQuery(dataLines):
    from collections import defaultdict
    pred = dict(zip(['follows','friendOf','likes','hasReview'],range(4)))
    tables = [defaultdict(list) for _ in pred]

    def encode(s):
        if s[-1].isdigit():
            i = 0
            while s[-1 - i].isdigit():
                i += 1
            return int(s[-i:])
        if any(s.endswith(k) for k in pred):
            return sum(v for k, v in pred.items() if s.endswith(k))
        return None
    
    for line in dataLines:
        if not line:
            continue
        subj, prop, *obj = line.rstrip('\n.').split('\t')
        obj = obj[0].rstrip()
        subj, prop, obj = [encode(s) for s in (subj, prop, obj)]
        if prop is not None:
            tables[prop][subj].append(obj)

    tables = [{k:tuple(v) for k, v in table.items()} for table in tables]
    #[print(list(pred.keys())[i], tables[i], sep='\n') for i in range(len(pred))]

    # create reverse index for subject, object where subject [user] follows object [user]
    object_of_follows = defaultdict(set)
    for k, v in tables[pred['follows']].items():
        for user in v:
            object_of_follows[user].add(k)
    # create reverse index for subject, object where subject [user] is friendOf object [user]
    object_of_friendOf = defaultdict(set)
    for k, v in tables[pred['friendOf']].items():
        if k in object_of_follows:
            for user in v:
                object_of_friendOf[user].add(k)
    # create reverse index for subject, object where subject [user] likes object [product]
    object_of_likes = defaultdict(set)
    for k, v in tables[pred['likes']].items():
        if k in object_of_friendOf:
            for product in v:
                object_of_likes[product].add(k)
    # create reverse index for subject, object where subject [product] hasReview object [review]
    object_of_hasReview = defaultdict(set)
    for k, v in tables[pred['hasReview']].items():
        if k in object_of_likes:
            for review in v:
                object_of_hasReview[review].add(k)

    def addToResult(result, e):
        d = object_of_hasReview[e]
        c = {y for x in d for y in object_of_likes[x]}
        b = {y for x in c for y in object_of_friendOf[x]}
        a = {y for x in b for y in object_of_follows[x]}
        toAdd = [(ax, bx, cx, dx, e) for dx in d for cx in c for bx in b for ax in a]
        result += toAdd

    result = []
    for e in object_of_hasReview:
        addToResult(result, e)
    print(f'result row count {len(result):,}')
    return result

说明：

创建4个表的列表（conters，friendf，likes，likes，hasreview），每个词典映射受对象元组的约束，
创建4个反向索引（object_of_follows，object_of_friendof，object_of_likes， object_of_hasreview）;例如：
- object_of_follows是一个dict，它映射每个用户中的对象都遵循 映射到一组用户，每个用户都是中的主题，请的代码> 在friendf at object_of_follows（换句话说，是中一个或多个主题的对象， conters）
- 等。
中object_of_hasreview中包含每个唯一结果的多个结果行contrys.subject，closts.Object，object，friendsof.object，likes.object， hasreview.Object，如查询中指定的
返回所有此类爆炸行的列表。

1000万行的测试代码：

dataLines = []
numFollowers = 1000
numChildren = 10
overlapFactor = max(1, numChildren // 2)
def largerPowerOfTen(x):
    y = 1
    while x >= y:
        y *= 10
    return y

aCeil = largerPowerOfTen(numFollowers)
bCeil = largerPowerOfTen(aCeil * numChildren)
cCeil = largerPowerOfTen(bCeil * numChildren)
dCeil = largerPowerOfTen(cCeil * numChildren)
friendOf, likes = set(), set()
for a in range(numFollowers):
    for b in range(aCeil + a * overlapFactor, aCeil + a * overlapFactor + numChildren):
        dataLines.append(f'wsdbm:User{a}    wsdbm:follows   wsdbm:User{b} .\n')
        for c in range(bCeil + b * overlapFactor, bCeil + b * overlapFactor + numChildren):
            if (b,c) not in friendOf:
                dataLines.append(f'wsdbm:User{b}    wsdbm:friendOf  wsdbm:User{c} .\n')
                friendOf.add((b,c))
            for d in range(cCeil + c * overlapFactor, cCeil + c * overlapFactor + numChildren):
                if (c,d) not in likes:
                    dataLines.append(f'wsdbm:User{c}    wsdbm:likes wsdbm:Product{d} .\n')
                    likes.add((c,d))
                for e in range(dCeil * (d + 1), dCeil * (d + 1) + numChildren):
                    dataLines.append(f'wsdbm:Product{d} wsdbm:hasReview wsdbm:Review{e} .\n')

print(f'dataLines row count {len(dataLines):,}')

from timeit import timeit
n = 1
print(f'Timeit results:')
t = timeit(f"runQuery(dataLines)", setup=f"from __main__ import dataLines, runQuery", number=n) / n
print(f'======== runQuery ran in {t} seconds using {n} iterations')
'''
result = runQuery(dataLines)
print(f'result row count {len(result):,}')
print(f'{"follows.subject":>20}{"follows.object":>20}{"friendsOf.object":>20}{"likes.object":>20}{"hasReview.object":>20}')
[print(f'{a:20}{b:20}{c:20}{d:20}{e:20}') for a,b,c,d,e in result]
'''

输出：

dataLines row count 10,310,350
Timeit results:
result row count 12,398,500
======== runQuery ran in 81.53253880003467 seconds using 1 iterations

较小规模的样本运行中的输入/输出：

参数

numFollowers = 3
numChildren = 3
overlapFactor = 2

输入（存储在表中）：

follows
{0: (10, 11, 12), 1: (12, 13, 14), 2: (14, 15, 16)}
friendOf
{10: (120, 121, 122), 11: (122, 123, 124), 12: (124, 125, 126), 13: (126, 127, 128), 14: (128, 129, 130), 15: (130, 131, 132), 16: (132, 133, 134)}
likes
{120: (1240, 1241, 1242), 121: (1242, 1243, 1244), 122: (1244, 1245, 1246), 123: (1246, 1247, 1248), 124: (1248, 1249, 1250), 125: (1250, 1251, 1252), 126: (1252, 1253, 1254), 127: (1254, 1255, 1256), 128: (1256, 1257, 1258), 129: (1258, 1259, 1260), 130: (1260, 1261, 1262), 131: (1262, 1263, 1264), 132: (1264, 1265, 1266), 133: (1266, 1267, 1268), 134: (1268, 1269, 1270)}
hasReview
{1240: (12410000, 12410001, 12410002), 1241: (12420000, 12420001, 12420002), 1242: (12430000, 12430001, 12430002, 12430000, 12430001, 12430002), 1243: (12440000, 12440001, 12440002), 1244: (12450000, 12450001, 12450002, 12450000, 12450001, 12450002, 12450000, 12450001, 12450002), 1245: (12460000, 12460001, 12460002, 12460000, 12460001, 12460002), 1246: (12470000, 12470001, 12470002, 12470000, 12470001, 12470002, 12470000, 12470001, 12470002), 1247: (12480000, 12480001, 12480002), 1248: (12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002), 1249: (12500000, 12500001, 12500002, 12500000, 12500001, 12500002, 12500000, 12500001, 12500002), 1250: (12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002), 1251: (12520000, 12520001, 12520002, 12520000, 12520001, 12520002), 1252: (12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002), 1253: (12540000, 12540001, 12540002, 12540000, 12540001, 12540002, 12540000, 12540001, 12540002), 1254: (12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002), 1255: (12560000, 12560001, 12560002), 1256: (12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002), 1257: (12580000, 12580001, 12580002, 12580000, 12580001, 12580002, 12580000, 12580001, 12580002), 1258: (12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002), 1259: (12600000, 12600001, 12600002, 12600000, 12600001, 12600002), 1260: (12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002), 1261: (12620000, 12620001, 12620002, 12620000, 12620001, 12620002, 12620000, 12620001, 12620002), 1262: (12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002), 1263: (12640000, 12640001, 12640002), 1264: (12650000, 12650001, 12650002, 12650000, 12650001, 12650002, 12650000, 12650001, 12650002), 1265: (12660000, 12660001, 12660002, 12660000, 12660001, 12660002), 1266: (12670000, 12670001, 12670002, 12670000, 12670001, 12670002, 12670000, 12670001, 12670002), 1267: (12680000, 12680001, 12680002), 1268: (12690000, 12690001, 12690002, 12690000, 12690001, 12690002), 1269: (12700000, 12700001, 12700002), 1270: (12710000, 12710001, 12710002)}

输出

result row count 351
     follows.subject      follows.object    friendsOf.object        likes.object    hasReview.object
                   0                  10                 120                1240            12410000
                   0                  10                 120                1240            12410001
                   0                  10                 120                1240            12410002
                   0                  10                 120                1241            12420000
                   0                  10                 120                1241            12420001
                   0                  10                 120                1241            12420002
                   0                  10                 120                1242            12430000
                   0                  10                 121                1242            12430000
                   0                  10                 120                1242            12430001
                   0                  10                 121                1242            12430001
                   0                  10                 120                1242            12430002
                   0                  10                 121                1242            12430002
                   0                  10                 121                1243            12440000
                   0                  10                 121                1243            12440001
                   0                  10                 121                1243            12440002
                   0                  10                 121                1244            12450000
                   0                  11                 121                1244            12450000
                   0                  10                 122                1244            12450000
                   0                  11                 122                1244            12450000
                   0                  10                 121                1244            12450001
                   0                  11                 121                1244            12450001
                   0                  10                 122                1244            12450001
                   0                  11                 122                1244            12450001
                   0                  10                 121                1244            12450002
                   0                  11                 121                1244            12450002

etc.

The core of your question is this:

The result is correct for small tables, but 10 million rows simply result in an Out of Memory Error and I am looking for ways to avoid this.

Following your top-level problem statement but with a less generic structure, we can do something like this:

def runQuery(dataLines):
    from collections import defaultdict
    pred = dict(zip(['follows','friendOf','likes','hasReview'],range(4)))
    tables = [defaultdict(list) for _ in pred]

    def encode(s):
        if s[-1].isdigit():
            i = 0
            while s[-1 - i].isdigit():
                i += 1
            return int(s[-i:])
        if any(s.endswith(k) for k in pred):
            return sum(v for k, v in pred.items() if s.endswith(k))
        return None
    
    for line in dataLines:
        if not line:
            continue
        subj, prop, *obj = line.rstrip('\n.').split('\t')
        obj = obj[0].rstrip()
        subj, prop, obj = [encode(s) for s in (subj, prop, obj)]
        if prop is not None:
            tables[prop][subj].append(obj)

    tables = [{k:tuple(v) for k, v in table.items()} for table in tables]
    #[print(list(pred.keys())[i], tables[i], sep='\n') for i in range(len(pred))]

    # create reverse index for subject, object where subject [user] follows object [user]
    object_of_follows = defaultdict(set)
    for k, v in tables[pred['follows']].items():
        for user in v:
            object_of_follows[user].add(k)
    # create reverse index for subject, object where subject [user] is friendOf object [user]
    object_of_friendOf = defaultdict(set)
    for k, v in tables[pred['friendOf']].items():
        if k in object_of_follows:
            for user in v:
                object_of_friendOf[user].add(k)
    # create reverse index for subject, object where subject [user] likes object [product]
    object_of_likes = defaultdict(set)
    for k, v in tables[pred['likes']].items():
        if k in object_of_friendOf:
            for product in v:
                object_of_likes[product].add(k)
    # create reverse index for subject, object where subject [product] hasReview object [review]
    object_of_hasReview = defaultdict(set)
    for k, v in tables[pred['hasReview']].items():
        if k in object_of_likes:
            for review in v:
                object_of_hasReview[review].add(k)

    def addToResult(result, e):
        d = object_of_hasReview[e]
        c = {y for x in d for y in object_of_likes[x]}
        b = {y for x in c for y in object_of_friendOf[x]}
        a = {y for x in b for y in object_of_follows[x]}
        toAdd = [(ax, bx, cx, dx, e) for dx in d for cx in c for bx in b for ax in a]
        result += toAdd

    result = []
    for e in object_of_hasReview:
        addToResult(result, e)
    print(f'result row count {len(result):,}')
    return result

Explanation:

Create a list of 4 tables (follows, friendOf, likes, hasReview), each a dictionary mapping subject to a tuple of objects
Create 4 reverse indexes (object_of_follows, object_of_friendOf, object_of_likes, object_of_hasReview); for example:
- object_of_follows is a dict that maps each user that is an object in follows to a set of users, each of which is a subject in follows that follows the object
- object_of_friendOf is a dict that maps each object (user) in friendOf to a set of users, each of which is a subject (user) associated with the object in friendOf and is in object_of_follows (in other words, is an object for one or more subjects in follows)
- etc.
Explode each review that survived in object_of_hasReview into multiple result rows containing each unique result follows.subject, follows.object, friendsOf.object, likes.object, hasReview.object as specified in the query
Return the list of all such exploded rows.

Test code for 10 million lines:

dataLines = []
numFollowers = 1000
numChildren = 10
overlapFactor = max(1, numChildren // 2)
def largerPowerOfTen(x):
    y = 1
    while x >= y:
        y *= 10
    return y

aCeil = largerPowerOfTen(numFollowers)
bCeil = largerPowerOfTen(aCeil * numChildren)
cCeil = largerPowerOfTen(bCeil * numChildren)
dCeil = largerPowerOfTen(cCeil * numChildren)
friendOf, likes = set(), set()
for a in range(numFollowers):
    for b in range(aCeil + a * overlapFactor, aCeil + a * overlapFactor + numChildren):
        dataLines.append(f'wsdbm:User{a}    wsdbm:follows   wsdbm:User{b} .\n')
        for c in range(bCeil + b * overlapFactor, bCeil + b * overlapFactor + numChildren):
            if (b,c) not in friendOf:
                dataLines.append(f'wsdbm:User{b}    wsdbm:friendOf  wsdbm:User{c} .\n')
                friendOf.add((b,c))
            for d in range(cCeil + c * overlapFactor, cCeil + c * overlapFactor + numChildren):
                if (c,d) not in likes:
                    dataLines.append(f'wsdbm:User{c}    wsdbm:likes wsdbm:Product{d} .\n')
                    likes.add((c,d))
                for e in range(dCeil * (d + 1), dCeil * (d + 1) + numChildren):
                    dataLines.append(f'wsdbm:Product{d} wsdbm:hasReview wsdbm:Review{e} .\n')

print(f'dataLines row count {len(dataLines):,}')

from timeit import timeit
n = 1
print(f'Timeit results:')
t = timeit(f"runQuery(dataLines)", setup=f"from __main__ import dataLines, runQuery", number=n) / n
print(f'======== runQuery ran in {t} seconds using {n} iterations')
'''
result = runQuery(dataLines)
print(f'result row count {len(result):,}')
print(f'{"follows.subject":>20}{"follows.object":>20}{"friendsOf.object":>20}{"likes.object":>20}{"hasReview.object":>20}')
[print(f'{a:20}{b:20}{c:20}{d:20}{e:20}') for a,b,c,d,e in result]
'''

Output:

dataLines row count 10,310,350
Timeit results:
result row count 12,398,500
======== runQuery ran in 81.53253880003467 seconds using 1 iterations

Here's input/output from a smaller-scale sample run:

Params

numFollowers = 3
numChildren = 3
overlapFactor = 2

Input (after storing in tables):

follows
{0: (10, 11, 12), 1: (12, 13, 14), 2: (14, 15, 16)}
friendOf
{10: (120, 121, 122), 11: (122, 123, 124), 12: (124, 125, 126), 13: (126, 127, 128), 14: (128, 129, 130), 15: (130, 131, 132), 16: (132, 133, 134)}
likes
{120: (1240, 1241, 1242), 121: (1242, 1243, 1244), 122: (1244, 1245, 1246), 123: (1246, 1247, 1248), 124: (1248, 1249, 1250), 125: (1250, 1251, 1252), 126: (1252, 1253, 1254), 127: (1254, 1255, 1256), 128: (1256, 1257, 1258), 129: (1258, 1259, 1260), 130: (1260, 1261, 1262), 131: (1262, 1263, 1264), 132: (1264, 1265, 1266), 133: (1266, 1267, 1268), 134: (1268, 1269, 1270)}
hasReview
{1240: (12410000, 12410001, 12410002), 1241: (12420000, 12420001, 12420002), 1242: (12430000, 12430001, 12430002, 12430000, 12430001, 12430002), 1243: (12440000, 12440001, 12440002), 1244: (12450000, 12450001, 12450002, 12450000, 12450001, 12450002, 12450000, 12450001, 12450002), 1245: (12460000, 12460001, 12460002, 12460000, 12460001, 12460002), 1246: (12470000, 12470001, 12470002, 12470000, 12470001, 12470002, 12470000, 12470001, 12470002), 1247: (12480000, 12480001, 12480002), 1248: (12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002, 12490000, 12490001, 12490002), 1249: (12500000, 12500001, 12500002, 12500000, 12500001, 12500002, 12500000, 12500001, 12500002), 1250: (12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002, 12510000, 12510001, 12510002), 1251: (12520000, 12520001, 12520002, 12520000, 12520001, 12520002), 1252: (12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002, 12530000, 12530001, 12530002), 1253: (12540000, 12540001, 12540002, 12540000, 12540001, 12540002, 12540000, 12540001, 12540002), 1254: (12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002, 12550000, 12550001, 12550002), 1255: (12560000, 12560001, 12560002), 1256: (12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002, 12570000, 12570001, 12570002), 1257: (12580000, 12580001, 12580002, 12580000, 12580001, 12580002, 12580000, 12580001, 12580002), 1258: (12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002, 12590000, 12590001, 12590002), 1259: (12600000, 12600001, 12600002, 12600000, 12600001, 12600002), 1260: (12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002, 12610000, 12610001, 12610002), 1261: (12620000, 12620001, 12620002, 12620000, 12620001, 12620002, 12620000, 12620001, 12620002), 1262: (12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002, 12630000, 12630001, 12630002), 1263: (12640000, 12640001, 12640002), 1264: (12650000, 12650001, 12650002, 12650000, 12650001, 12650002, 12650000, 12650001, 12650002), 1265: (12660000, 12660001, 12660002, 12660000, 12660001, 12660002), 1266: (12670000, 12670001, 12670002, 12670000, 12670001, 12670002, 12670000, 12670001, 12670002), 1267: (12680000, 12680001, 12680002), 1268: (12690000, 12690001, 12690002, 12690000, 12690001, 12690002), 1269: (12700000, 12700001, 12700002), 1270: (12710000, 12710001, 12710002)}

Output

result row count 351
     follows.subject      follows.object    friendsOf.object        likes.object    hasReview.object
                   0                  10                 120                1240            12410000
                   0                  10                 120                1240            12410001
                   0                  10                 120                1240            12410002
                   0                  10                 120                1241            12420000
                   0                  10                 120                1241            12420001
                   0                  10                 120                1241            12420002
                   0                  10                 120                1242            12430000
                   0                  10                 121                1242            12430000
                   0                  10                 120                1242            12430001
                   0                  10                 121                1242            12430001
                   0                  10                 120                1242            12430002
                   0                  10                 121                1242            12430002
                   0                  10                 121                1243            12440000
                   0                  10                 121                1243            12440001
                   0                  10                 121                1243            12440002
                   0                  10                 121                1244            12450000
                   0                  11                 121                1244            12450000
                   0                  10                 122                1244            12450000
                   0                  11                 122                1244            12450000
                   0                  10                 121                1244            12450001
                   0                  11                 121                1244            12450001
                   0                  10                 122                1244            12450001
                   0                  11                 122                1244            12450001
                   0                  10                 121                1244            12450002
                   0                  11                 121                1244            12450002

etc.

回复收藏 0 原文

~没有更多了~