在非超级计算机上处​​理大量数据的通用技术

发布于 2024-11-30 05:48:41 字数 629 浏览 7 评论 0原文

我正在参加一些人工智能课程,并了解了一些我想尝试的基本算法。我通过举办数据分析竞赛的 Kaggle 访问了包含大量真实世界数据的多个数据集。

我曾尝试参加一些比赛来提高我的机器学习技能,但一直无法找到访问代码中数据的好方法。 Kaggle 为每次比赛提供一个 csv 格式的大型数据文件,50-200mb。

在我的代码中加载和使用这些表的最佳方法是什么?我的第一反应是使用数据库,所以我尝试将 csv 加载到 sqlite 单个数据库中,但这给我的计算机带来了巨大的负载,并且在提交期间,我的计算机崩溃是很常见的。接下来,我尝试在共享主机上使用 mysql 服务器,但是在其上进行查询需要很长时间,并且这使我的分析代码非常慢。另外,我担心我会超出我的带宽。

到目前为止,在我的课程中,我的讲师通常会清理数据并为我们提供可完全加载到 RAM 中的可管理数据集。显然这对于​​我目前的兴趣来说是不可能的。请建议我应该如何进行。我目前使用的是一台 4 年前的 MacBook,配备 4GB 内存和双核 2.1Ghz cpu。

顺便说一句,我希望用 Python 进行大部分分析,因为我最了解这种语言。我想要一个解决方案,允许我用这种语言完成所有或几乎所有编码。

I'm taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions.

I have tried entering several competitions to improve my machine learning skills, but have been unable to find a good way to access the data in my code. Kaggle provides one large data file, 50-200mb, per competition in csv format.

What is the best way to load and use these tables in my code? My first instinct was to use databases, so I tried loading the csv into sqlite a single database, but this put a tremendous load on my computer and during the commits, it was common for my computer to crash. Next, I tried using a mysql server on a shared host, but doing queries on it took forever, and it made my analysis code really slow. Plus, I am afraid I will exceed my bandwidth.

In my classes so far, my instructors usually clean up the data and give us managable datasets that can be completely loaded into RAM. Obviously this is not possible for my current interests. Please suggest how I should proceed. I am currently using a 4 year old macbook with 4gb ram and a dualcore 2.1Ghz cpu.

By the way, I am hoping to do the bulk of my analysis in Python, as I know this language the best. I'd like a solution that allows me to do all or nearly all coding in this language.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

°如果伤别离去 2024-12-07 05:48:42

对于加载到数据库中来说,200 兆字节并不是一个特别大的文件。您可能想尝试将输入文件拆分为更小的文件。

split -l 50000 your-input-filename

split 实用程序会将您的输入文件拆分为多个您喜欢的大小的文件。我上面每个文件使用了 50000 行。它是一个常见的 Unix 和 Linux 命令行实用程序;不过不知道 Mac 是否附带此功能。

本地安装 PostgreSQL 甚至 MySQL 对于您正在做的事情来说可能是比 SQLite 更好的选择。

如果您不想将数据加载到数据库中,可以使用 grep、awk 和 sed 等命令行实用程序获取数据的子集。 (或者像 python、ruby 和 perl 这样的脚本语言。)将子集通过管道传输到您的程序中。

200 megabytes isn't a particularly large file for loading into a database. You might want to try to split the input file into smaller files.

split -l 50000 your-input-filename

The split utility will split your input file into multiple files of whatever size you like. I used 50000 lines per file above. It's a common Unix and Linux command-line utility; don't know whether it ships with Macs, though.

Local installs of PostgreSQL or even MySQL might be a better choice than SQLite for what you're doing.

If you don't want to load the data into a database, you can take subsets of it using command-line utilities like grep, awk, and sed. (Or scripting languages like python, ruby, and perl.) Pipe the subsets into your program.

不如归去 2024-12-07 05:48:42

你需要“熊猫”。我想你现在一定已经明白了。但如果其他人面临这个问题,仍然可以从答案中受益。
因此您不必将数据加载到任何 RDBMS 中。将其保存在文件中并通过简单的 Pandas 使用它
负载,数据帧。这是 pandas lib 的链接-->
http://pandas.pydata.org/

如果数据太大,您需要任何集群。 Apache Spark 或 Mahout 可以在 Amazon EC2 云上运行。在那里买一些空间,它会很容易使用。 Spark 还有一个适用于 Python 的 API。

You need 'Pandas' for it. I think you must have got it by now. But still if someone else is facing the issue can be benefited by the answer.
So you don't have to load the data into any RDBMS. keep it in file and use it by simple Pandas
load, dataframes. Here is the link for pandas lib-->
http://pandas.pydata.org/

If data is too large you need any cluster for that. Apache Spark or Mahout which can run on Amazon EC2 cloud. Buy some space there and it will be easy to use. Spark also has an API for Python.

半衾梦 2024-12-07 05:48:42

我使用 H2O 在 R 中加载了 2 GB Kaggle 数据集。
H2O 是基于 java 构建的,它创建了一个虚拟环境,数据将在内存中可用,并且由于 H2O 是 java,因此您的访问速度会更快。
您只需要习惯 H2O 的语法即可。
它拥有许多构建精美的机器学习算法,并为分布式计算提供了良好的支持。
您还可以轻松使用所有 CPU 核心。查看 h2o.ai 了解如何使用它。如果您只有 4 GB 内存,它可以轻松处理 200 MB。您应该升级到8G或16G

I loaded a 2 GB Kaggle dataset in R using H2O.
H2O is build on java and it creates a virtual environment, the data will be available in memory and you will have much faster access since H2O is java.
You will just have to get used to the syntax of H2O.
It has many beautifully built ml algorithms and provides good support for distributed computing.
You can also use all your cpu cores easily. Check out h2o.ai for how to use it. It can handle 200 MB easily, given than you only have 4 GB ram. You should upgrade to 8 G or 16 G

破晓 2024-12-07 05:48:42

一般技术是“分而治之”。如果您可以将数据分成几部分并分别处理它们,那么它就可以由一台机器处理。有些任务可以通过这种方式解决(PageRank、NaiveBayes、HMM 等),有些则不能(需要全局优化),如 LogisticeRegression、CRF、许多降维技术

General technic is to "divide and conquere". If you can split you data into parts and process them separetly then it can be handled by one machine. Some task can be solved that way(PageRank, NaiveBayes, HMM etc) and some were not,(one requred global optimisation) like LogisticeRegression, CRF, many dimension reduction technicues

病女 2024-12-07 05:48:41

原型——这是处理大数据时最重要的事情。明智地将其分割,以便您可以将其加载到内存中,以便使用解释器(例如 python、R)访问它。这是大规模创建和完善分析流程的最佳方式。

换句话说,修剪多 GB 大小的数据文件,使其小到足以执行命令行分析。

这是我用来执行此操作的工作流程 - 当然不是最好的方法,但它是一种方法,并且有效:

I. 使用延迟加载 方法(希望)可以用您的语言提供
选择读取大型数据文件,特别是那些超过 1 GB 的文件。我
然后会建议根据
我在下面讨论的技术,最后将其完全存储
数据集市或中间暂存容器中预处理的数据。

使用 Python 延迟加载大型数据文件的一个示例:

# 'filename' is the full path name for a data file whose size 
# exceeds the memory on the box it resides. #

import tokenize

data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next()           # returns a single line from the large data file.

II. 美化和重铸

  • 重铸存储分类数据的列
    变量(例如,男性/女性)为整数(例如,-1、1)。维持
    一个
    查找表(与您用于此转换的哈希值相同
    除了
    键和值被交换)以转换这些整数
    后退
    将人类可读的字符串标签作为分析的最后一步
    工作流程;

  • 白化你的数据——即“规范化”那些
    保持连续数据。这两个步骤都将实质上
    减少
    数据集的大小——不引入任何噪音。一个
    美白的伴随好处是防止分析
    错误
    体重过重造成的。

III. 采样:纵向修剪数据

降维:采样的正交模拟。识别对因变量(又称“结果”或响应变量)没有影响或影响极小的变量(列/字段/特征),并将其从工作数据立方体中消除。

主成分分析 (PCA) 是一种简单而可靠的技术:

import numpy as NP
from scipy import linalg as LA

D = NP.random.randn(8, 5)       # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
    print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))

  eigenvalue    var proportion
    2.22        0.44
    1.81        0.81
    0.67        0.94
    0.23        0.99
    0.06        1.00

如您所见,前三个特征值占原始数据中观察到的方差的 94%。根据您的目的,您通常可以通过删除最后两列来修剪原始数据矩阵 D:

D = D[:,:-2]

V.数据集市存储:在永久存储(数据仓库)和分析流程之间插入一个层。换句话说,严重依赖数据集市/数据立方体——位于数据仓库和分析应用程序层之间的“暂存区域”。对于您的分析应用程序来说,该数据集市是一个更好的 IO 层。 R 的“数据框”或“数据表”(来自同名的 CRAN 包)是很好的候选者。我还强烈推荐 redis——极快的读取速度、简洁的语义和零配置,使其成为此用例的绝佳选择。 redis 将轻松处理您在问题中提到的大小的数据集。例如,使用 Redis 中的哈希数据结构,您可以拥有与 MySQL 或 SQLite 相同的结构和相同的关系灵活性,而无需繁琐的配置。另一个优点:与 SQLite 不同,redis 实际上是一个数据库服务器。我实际上是 SQLite 的忠实粉丝,但我相信由于我刚才给出的原因,redis 在这里工作得更好。

from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user', 
       traffic_source : 'affiliate', page_views_per_session : 17, 
       total_purchases : 28.15})

Prototype--that's the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter--e.g., python, R. That's the best way to create and refine your analytics process flow at scale.

In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.

Here's the workflow i use to do that--surely not the best way to do it, but it is one way, and it works:

I. Use lazy loading methods (hopefully) available in your language of
choice to read in large data files, particularly those exceeding about 1 GB. I
would then recommend processing this data stream according to the
techniques i discuss below, then finally storing this fully
pre-processed data in a Data Mart, or intermediate staging container.

One example using Python to lazy load a large data file:

# 'filename' is the full path name for a data file whose size 
# exceeds the memory on the box it resides. #

import tokenize

data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next()           # returns a single line from the large data file.

II. Whiten and Recast:

  • Recast your columns storing categorical
    variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain
    a
    look-up table (the same hash as you used for this conversion
    except
    the keys and values are swapped out) to convert these integers
    back
    to human-readable string labels as the last step in your analytic
    workflow;

  • whiten your data--i.e., "normalize" the columns that
    hold continuous data. Both of these steps will substantially
    reduce
    the size of your data set--without introducing any noise. A
    concomitant benefit from whitening is prevention of analytics
    error
    caused by over-weighting.

III. Sampling: Trim your data length-wise.

IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the 'outcomes' or response variable) and eliminate them from your working data cube.

Principal Component Analysis (PCA) is a simple and reliable technique to do this:

import numpy as NP
from scipy import linalg as LA

D = NP.random.randn(8, 5)       # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
    print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))

  eigenvalue    var proportion
    2.22        0.44
    1.81        0.81
    0.67        0.94
    0.23        0.99
    0.06        1.00

So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:

D = D[:,:-2]

V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes--a 'staging area' that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R's 'data frame' or 'data table' (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis--blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.

from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user', 
       traffic_source : 'affiliate', page_views_per_session : 17, 
       total_purchases : 28.15})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文