文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

DBpedia 数据集

发布于 2025-01-01 12:38:42 字数 7519 浏览 0 评论 0 收藏 0

让我们从幂方法开始，它找到一个特征向量。只是一个特征向量有什么用？你可能想知道。这实际上是 PageRank 的基础（阅读价值 25,000,000,000 美元的 Eigenvector：Google 背后的线性代数来了解更多信息）

我们将使用来自 DBpedia 的维基百科链接数据集，而不是试图排名互联网上所有网站的重要性。 DBpedia 提供 125 种语言的结构化维基百科数据。

“完整的 DBpedia 数据集包含用于 125 种不同语言的 3800 万个标签和摘要，2520 万个图像链接和 2980 万个外部网页的链接；8090 万个维基百科类别的链接，4120 万个 YAGO 类别的链接” - 关于 DBpedia

今天的课程灵感来自这个 SciKit 学习示例。

导入

import os, numpy as np, pickle
from bz2 import BZ2File
from datetime import datetime
from pprint import pprint
from time import time
from tqdm import tqdm_notebook
from scipy import sparse

from sklearn.decomposition import randomized_svd
from sklearn.externals.joblib import Memory
from urllib.request import urlopen

下载数据

我们拥有的数据是：

重定向：重定向到其他 URL 的 URL
链接：哪些页面链接到哪些其他页面

注意：这需要一段时间。

PATH = 'data/dbpedia/'
URL_BASE = 'http://downloads.dbpedia.org/3.5.1/en/'
filenames = ["redirects_en.nt.bz2", "page_links_en.nt.bz2"]

for filename in filenames:
    if not os.path.exists(PATH+filename):
        print("Downloading '%s', please wait..." % filename)
        open(PATH+filename, 'wb').write(urlopen(URL_BASE+filename).read())

redirects_filename = PATH+filenames[0]
page_links_filename = PATH+filenames[1]

图的邻接矩阵

我们将构造一个图的邻接矩阵，表示哪个页面指向哪个页面。

来源： PageRank 和 HyperLink 生成的主题搜索

幂 $A^2$ 将为你提供，通过两步从一个页面到另一个页面有多少种方式。你可以在这些笔记中看到更详细的示例，如适用于航空旅行。

我们希望跟踪哪些页面指向哪些页面。我们将它存储在一个方形矩阵中，位置 (r, c) 为 1，表示行 r 中的主题指向列 c 中的主题

你可以在此处更加了解图。

数据格式

文件中的一样看起来：

<http://dbpedia.org/resource/AfghanistanHistory> <http://dbpedia.org/property/redirect> <http://dbpedia.org/resource/History_of_Afghanistan> .

在下面的切片中， +1, -1 来删除 <> 。

DBPEDIA_RESOURCE_PREFIX_LEN = len("http://dbpedia.org/resource/")
SLICE = slice(DBPEDIA_RESOURCE_PREFIX_LEN + 1, -1)

def get_lines(filename): return (line.split() for line in BZ2File(filename))

遍历重定向并创建来源到目的地的字典。

def get_redirect(targ, redirects):
    seen = set()
    while True:
        transitive_targ = targ
        targ = redirects.get(targ)
        if targ is None or targ in seen: break
        seen.add(targ)
    return transitive_targ

def get_redirects(redirects_filename):
    redirects={}
    lines = get_lines(redirects_filename)
    return {src[SLICE]:get_redirect(targ[SLICE], redirects) 
                for src,_,targ,_ in tqdm_notebook(lines, leave=False)}

redirects = get_redirects(redirects_filename)

mem_usage()

# 13.766303744

def add_item(lst, redirects, index_map, item):
    k = item[SLICE]
    lst.append(index_map.setdefault(redirects.get(k, k), len(index_map)))

limit=119077682 #5000000

# 计算整数索引映射
index_map = dict() # links->IDs
lines = get_lines(page_links_filename)
source, destination, data = [],[],[]
for l, split in tqdm_notebook(enumerate(lines), total=limit):
    if l >= limit: break
    add_item(source, redirects, index_map, split[0])
    add_item(destination, redirects, index_map, split[2])
    data.append(1)

n=len(data); n

# 119077682

查看我们的数据

以下步骤仅用于说明我们的数据中的信息及其结构。他们效率不高。

让我们看看 index_map 中的项目类型：

index_map.popitem()

# (b'1940_Cincinnati_Reds_Team_Issue', 9991173)

让我们看一下索引映射中的一个项目：

1940_Cincinnati_Reds_Team_Issue 具有索引 9991173. ，这仅在目标列表中显示一次：

[i for i,x in enumerate(source) if x == 9991173]

# [119077649]

source[119077649], destination[119077649]

# (9991173, 9991050)

现在，我们要检查哪个页面是源（具有索引 9991050）。注意：通常你不应通过搜索其值来访问字典。这是低效的，不是使用字典的方式。

for page_name, index in index_map.items():
    if index == 9991050:
        print(page_name)

# b'W711-2'

我们可以在维基百科上看到辛辛那提红队问题重定向到 W711-2：

test_inds = [i for i,x in enumerate(source) if x == 9991050]

len(test_inds)

# 47

test_inds[:5]

# [119076756, 119076757, 119076758, 119076759, 119076760]

test_dests = [destination[i] for i in test_inds]

现在，我们要检查哪个页面是源（具有索引 9991174）：

for page_name, index in index_map.items():
    if index in test_dests:
        print(page_name)

'''
b'Baseball'
b'Ohio'
b'Cincinnati'
b'Flash_Thompson'
b'1940'
b'1938'
b'Lonny_Frey'
b'Cincinnati_Reds'
b'Ernie_Lombardi'
b'Baseball_card'
b'James_Wilson'
b'Trading_card'
b'Detroit_Tigers'
b'Baseball_culture'
b'Frank_McCormick'
b'Bucky_Walters'
b'1940_World_Series'
b'Billy_Werber'
b'Ival_Goodman'
b'Harry_Craft'
b'Paul_Derringer'
b'Johnny_Vander_Meer'
b'Cigarette_card'
b'Eddie_Joost'
b'Myron_McCormick'
b'Beckett_Media'
b'Icarus_affair'
b'Ephemera'
b'Sports_card'
b'James_Turner'
b'Jimmy_Ripple'
b'Lewis_Riggs'
b'The_American_Card_Catalog'
b'Rookie_card'
b'Willard_Hershberger'
b'Elmer_Riddle'
b'Joseph_Beggs'
b'Witt_Guise'
b'Milburn_Shoffner'
'''

我们可以看到列表中的项目出现在维基百科页面中：

创建矩阵

下面我们使用 Scipy 的 COO 格式创建一个稀疏矩阵，并将其转换为 CSR。

问题：COO 和 CSR 是什么？为什么我们要用 COO 创建它然后马上转换它？

X = sparse.coo_matrix((data, (destination,source)), shape=(n,n), dtype=np.float32)
X = X.tocsr()

del(data,destination, source)

X

'''
<119077682x119077682 sparse matrix of type '<class 'numpy.float32'>'
    with 93985520 stored elements in Compressed Sparse Row format>
'''

names = {i: name for name, i in index_map.items()}

mem_usage()

# 12.903882752

保存矩阵以便不会重复计算

pickle.dump(X, open(PATH+'X.pkl', 'wb'))
pickle.dump(index_map, open(PATH+'index_map.pkl', 'wb'))

X = pickle.load(open(PATH+'X.pkl', 'rb'))
index_map = pickle.load(open(PATH+'index_map.pkl', 'rb'))

names = {i: name for name, i in index_map.items()}

X

'''
<119077682x119077682 sparse matrix of type '<class 'numpy.float32'>'
    with 93985520 stored elements in Compressed Sparse Row format>
'''

分享到QQ

分享到微博