Introduction to Python
- Variables
- Operators
- Iterators
- Conditional Statements
- Functions
- Strings and String Handling
- Lists, Tuples, Dictionaries
- Classes
- Modules
- The standard library
- Keeping the Anaconda distribution up-to-date
- Exercises
Getting started with Python and the IPython notebook
- Cells
- Code Cells
- Magic Commands
- Python as Glue
- Python <-> R <-> Matlab <-> Octave
- More Glue: Julia and Perl
Functions are first class objects
Data science is OSEMN
- Obtaining data
- Scrubbing data
- Exercises
Working with text
- String methods
- Splitting and joining strings
- The string module
- Regular expressions
- The NLTK toolkit
- Exercises
Preprocessing text data
- Example: Counting words in a document
Working with structured data
Using SQLite3
- Basic concepts of database normalization
Using HDF5
- Interfacing withPandas
Using numpy
- References
- Example
- NDArray
- Broadcasting, row, column and matrix operations
- Universal functions (Ufuncs)
- Generalized ufucns
- Random numbers
- Linear algebra
- Exercises
Using Pandas
- Series
- DataFrame
- Panels
- Split-Apply-Combine
- Using statsmodels
Computational problems in statistics
- Textbook example - is coin fair?
- Bayesian approach
- Comment
Computer numbers and mathematics
- Some examples of numbers behaving badly
- Finite representation of numbers
- Using arbitrary precision libraries
- From numbers to Functions: Stability and conditioning
- Exercises
Algorithmic complexity
- Profling and benchmarking
- Measuring algorithmic complexity
- Space complexity
Linear Algebra and Linear Systems
- Simultaneous Equations
- Linear Independence
- Norms and Distance of Vectors
- Trace and Determinant of Matrices
- Column space, Row space, Rank and Kernel
- Matrices as Linear Transformations
- Matrix Norms
- Special Matrices
- Exercises
Linear Algebra and Matrix Decompositions
- Large Linear Systems
- Example: Netflix Competition (circa 2006-2009)
- Matrix Decompositions
- Matrix Decompositions for PCA and Least Squares
- Singular Value Decomposition
- Stabilty and Condition Number
- Exercises
Change of Basis
- Variance and covariance
- Eigendecomposition of the covariance matrix
- PCA
- Change of basis via PCA
- Graphical illustration of change of basis
- Dimension reduction via PCA
- Using Singular Value Decomposition (SVD) for PCA
Optimization and Non-linear Methods
- Example: Maximum Likelihood Estimation (MLE)
- Bisection Method
- Secant Method
- Newton-Rhapson Method
- Gauss-Newton
- Inverse Quadratic Interpolation
- Brent’s Method
Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
Algorithms for Optimization and Root Finding for Multivariate Problems
- Optimizers
- Solvers
- GLM Estimation and IRLS
Expectation Maximizatio (EM) Algorithm
- Jensen’s inequality
- Maximum likelihood with complete information
- Incomplete information
- Gaussian mixture models
- Using EM
- Vectorized version
- Vectorization with Einstein summation notation
- Comparison of EM routines
Monte Carlo Methods
- Pseudorandom number generators (PRNG)
- Monte Carlo swindles (Variance reduction techniques)
- Quasi-random numbers
Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
Markov Chain Monte Carlo (MCMC)
- Bayesian Data Analysis
- Metropolis-Hastings sampler
- Gibbs sampler
- Slice sampler
- Hierarchical models
Using PyMC2
- Coin toss
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regreession model
- Estimating parameters of a logistic model
- Using a hierarchcical model
Using PyMC3
- Coin toss
- Estimating mean and standard deviation of normal distribution
- Estimating parameters of a linear regreession model
- Estimating parameters of a logistic model
- Using a hierarchcical model
Using PyStan
- References
- Simple Logistic model
C Crash Course
- Hello world
- A tutorial example - coding a Fibonacci function in C
- Types in C
- Operators
- Control of program flow
- Arrays and pointers
- Functions
- Function pointers
- Using make to compile C programs
- Exercise
Code Optimization
- Profiling
- Using better algorihtms and data structures
- I/O Bound problems
- Problem set for optimization
Using C code in Python
- Example: The Fibonacci Sequence
- Using clang and bitey
- Using gcc and ctypes
- Using Cython
- Benchmark
Using functions from various compiled languages in Python
- C
- C++
- Fortran
- Benchmarking
- Wrapping a function from a C library for use in Python
- Wrapping functions from C++ library for use in Pyton
Julia and Python
- Defining a function in Julia
- Using it in Python
- Using Python libraries in Julia
Converting Python Code to C for speed
- Example: Fibonacci
- Example: Matrix multiplication
- Example: Pairwise distance matrix
- Profiling code
- Numba
- Cython
- Comparison with optimized C from scipy
Optimization bake-off
- Python version
- Numpy version
- Numexpr version
- Numba version
- NumbaPro version
- Parakeet version
- Cython version
- C version
- C++ version
- Fortran version
- Bake-off
- Summary
- Recommendations for optimizing Python code
Writing Parallel Code
- Concepts
- Embarassingly parallel programs
- Using Multiprocessing
- Using IPython parallel for interactive parallel computing
- Other parallel programming approaches not covered
- References
Massively parallel programming with GPUs
- Programming GPUs
- GPU Architecture
- CUDA Python
- Getting Started with CUDA
- Vector addition - the ‘Hello, world’ of CUDA
- Performing a reduction on CUDA
- Recreational
- More examples
Writing CUDA in C
- Review of GPU Architechture - A Simplification
- Cuda C program - an Outline
Distributed computing for Big Data
- Why and when does distributed computing matter?
- Ingredients for effiicient distributed computing
- What is Hadoop?
- Review of functional programming
- The Hadoop MapReduce workflow
- Using Hadoop MapReduce
- Spark
Hadoop MapReduce on AWS EMR with mrjob
- MapReduce code
- Configuration file
- Launching job
Spark on a local mahcine using 4 nodes
- Using Spark in standalone prograsm
- Introduction to Spark concepts with a data manipulation example
- Using the MLlib for Regression
- References
Modules and Packaging
- Modules
- Distributing your package
Tour of the Jupyter (IPython3) notebook
- Installing Jupyter
- Installing other kernels
- Installing extensions
- Installing Python3 while keeping Python2
- Now, restart your notebook server
Polyglot programming
- Python 2
- Python 3
- Bash
- R
- Scala
- Julia
- Processing
What you should know and learn more about
- Statistical foundations
- Computing foundations
- Mathematical foundations
- Statistical algorithms
- Libraries worth knowing about after numpy, scipy and matplotlib
Wrapping R libraries with Rpy

文江博客开发文档 Computational Statistics in Python 文章详情

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

Introduction to Spark concepts with a data manipulation example

发布于 2025-02-25 23:44:06 字数 5134 浏览 0 评论 0 收藏 0

Adapted from scala version in Chapter 2: Introduction to Data Analysis with Scala and Spark of Advanced Analytics with Spark (O’Reilly 2015)

import os

if not os.path.exists('documentation'):
    ! curl -o documentation https://archive.ics.uci.edu/ml/machine-learning-databases/00210/documentation
if not os.path.exists('donation.zip'):
    ! curl -o donation.zip https://archive.ics.uci.edu/ml/machine-learning-databases/00210/donation.zip
! unzip -n -q donation.zip
! unzip -n -q 'block_*.zip'
if not os.path.exists('linkage'):
    ! mkdir linkage
! mv block_*.csv linkage
! rm block_*.zip

10 archives were successfully processed.

Info about the data set

Please see the documentation file.

If we are running Spark on Hadoop, we need to transfer files to HDFS

! hadoop fs -mkdir linkage
! hadoop fs -put block_*.csv linkage

rdd = sc.textFile('linkage')

Actions trigger execution and return a non-RDD result

rdd.first()

u'"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"'

rdd.take(10)

[u'"id_1","id_2","cmp_fname_c1","cmp_fname_c2","cmp_lname_c1","cmp_lname_c2","cmp_sex","cmp_bd","cmp_bm","cmp_by","cmp_plz","is_match"',
 u'37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE',
 u'39086,47614,1,?,1,?,1,1,1,1,1,TRUE',
 u'70031,70237,1,?,1,?,1,1,1,1,1,TRUE',
 u'84795,97439,1,?,1,?,1,1,1,1,1,TRUE',
 u'36950,42116,1,?,1,1,1,1,1,1,1,TRUE',
 u'42413,48491,1,?,1,?,1,1,1,1,1,TRUE',
 u'25965,64753,1,?,1,?,1,1,1,1,1,TRUE',
 u'49451,90407,1,?,1,?,1,1,1,1,0,TRUE',
 u'39932,40902,1,?,1,?,1,1,1,1,1,TRUE']

def is_header(line):
    return "id_1" in line

Transforms return an RDD and are lazy

vals = rdd.filter(lambda x: not is_header(x))
vals

PythonRDD[4] at RDD at PythonRDD.scala:42

vals.count()

Now it is evaluated

vals.take(10)

[u'37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE',
 u'39086,47614,1,?,1,?,1,1,1,1,1,TRUE',
 u'70031,70237,1,?,1,?,1,1,1,1,1,TRUE',
 u'84795,97439,1,?,1,?,1,1,1,1,1,TRUE',
 u'36950,42116,1,?,1,1,1,1,1,1,1,TRUE',
 u'42413,48491,1,?,1,?,1,1,1,1,1,TRUE',
 u'25965,64753,1,?,1,?,1,1,1,1,1,TRUE',
 u'49451,90407,1,?,1,?,1,1,1,1,0,TRUE',
 u'39932,40902,1,?,1,?,1,1,1,1,1,TRUE',
 u'46626,47940,1,?,1,?,1,1,1,1,1,TRUE']

Each time we access vals, it is reconstructed from the original sources

Spark maintains a DAG of how each RDD was constructed so that data sets can be reconstructed - hence resilient distributed datasets. However, this is inefficient.

# vals is reconstructed again
vals.first()

u'37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE'

Spark allows us to persist RDDs that we will be re-using

vals.cache()

PythonRDD[4] at RDD at PythonRDD.scala:42

# now vals is no longer reconstructed but retrieved from memory
vals.take(10)

[u'37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE',
 u'39086,47614,1,?,1,?,1,1,1,1,1,TRUE',
 u'70031,70237,1,?,1,?,1,1,1,1,1,TRUE',
 u'84795,97439,1,?,1,?,1,1,1,1,1,TRUE',
 u'36950,42116,1,?,1,1,1,1,1,1,1,TRUE',
 u'42413,48491,1,?,1,?,1,1,1,1,1,TRUE',
 u'25965,64753,1,?,1,?,1,1,1,1,1,TRUE',
 u'49451,90407,1,?,1,?,1,1,1,1,0,TRUE',
 u'39932,40902,1,?,1,?,1,1,1,1,1,TRUE',
 u'46626,47940,1,?,1,?,1,1,1,1,1,TRUE']

vals.take(10)

[u'37291,53113,0.833333333333333,?,1,?,1,1,1,1,0,TRUE',
 u'39086,47614,1,?,1,?,1,1,1,1,1,TRUE',
 u'70031,70237,1,?,1,?,1,1,1,1,1,TRUE',
 u'84795,97439,1,?,1,?,1,1,1,1,1,TRUE',
 u'36950,42116,1,?,1,1,1,1,1,1,1,TRUE',
 u'42413,48491,1,?,1,?,1,1,1,1,1,TRUE',
 u'25965,64753,1,?,1,?,1,1,1,1,1,TRUE',
 u'49451,90407,1,?,1,?,1,1,1,1,0,TRUE',
 u'39932,40902,1,?,1,?,1,1,1,1,1,TRUE',
 u'46626,47940,1,?,1,?,1,1,1,1,1,TRUE']

Parse lines and work on them

def parse(line):
    pieces = line.strip().split(',')
    id1, id2 = map(int, pieces[:2])
    scores = [np.nan if p=='?' else float(p) for p in pieces[2:11]]
    matched = True if pieces[11] == 'TRUE' else False
    return [id1, id2, scores, matched]

mds = vals.map(lambda x: parse(x))

mds.cache()

PythonRDD[10] at RDD at PythonRDD.scala:42

match_counts = mds.map(lambda x: x[-1]).countByValue()

for cls in match_counts:
    print cls, match_counts[cls]

False 5728201
True 20931

Summary statistics

mds.map(lambda x: x[2][0]).stats()

(count: 5749132, mean: nan, stdev: nan, max: nan, min: nan)

mds.filter(lambda x: np.isfinite(x[2][0])).map(lambda x: x[2][0]).stats()

(count: 5748125, mean: 0.712902470443, stdev: 0.3887583258, max: 1.0, min: 0.0)

Takes too long on laptop - skip

stats = [mds.filter(lambda x: np.isfinite(x[2][i])).map(lambda x: x[2][i]).stats() for i in range(3)]

for stat in stats: print stat

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据