- Introduction to Python
- Getting started with Python and the IPython notebook
- Functions are first class objects
- Data science is OSEMN
- Working with text
- Preprocessing text data
- Working with structured data
- Using SQLite3
- Using HDF5
- Using numpy
- Using Pandas
- Computational problems in statistics
- Computer numbers and mathematics
- Algorithmic complexity
- Linear Algebra and Linear Systems
- Linear Algebra and Matrix Decompositions
- Change of Basis
- Optimization and Non-linear Methods
- Practical Optimizatio Routines
- Finding roots
- Optimization Primer
- Using scipy.optimize
- Gradient deescent
- Newton’s method and variants
- Constrained optimization
- Curve fitting
- Finding paraemeters for ODE models
- Optimization of graph node placement
- Optimization of standard statistical models
- Fitting ODEs with the Levenberg–Marquardt algorithm
- 1D example
- 2D example
- Algorithms for Optimization and Root Finding for Multivariate Problems
- Expectation Maximizatio (EM) Algorithm
- Monte Carlo Methods
- Resampling methods
- Resampling
- Simulations
- Setting the random seed
- Sampling with and without replacement
- Calculation of Cook’s distance
- Permutation resampling
- Design of simulation experiments
- Example: Simulations to estimate power
- Check with R
- Estimating the CDF
- Estimating the PDF
- Kernel density estimation
- Multivariate kerndel density estimation
- Markov Chain Monte Carlo (MCMC)
- Using PyMC2
- Using PyMC3
- Using PyStan
- C Crash Course
- Code Optimization
- Using C code in Python
- Using functions from various compiled languages in Python
- Julia and Python
- Converting Python Code to C for speed
- Optimization bake-off
- Writing Parallel Code
- Massively parallel programming with GPUs
- Writing CUDA in C
- Distributed computing for Big Data
- Hadoop MapReduce on AWS EMR with mrjob
- Spark on a local mahcine using 4 nodes
- Modules and Packaging
- Tour of the Jupyter (IPython3) notebook
- Polyglot programming
- What you should know and learn more about
- Wrapping R libraries with Rpy
文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
Cuda C program - an Outline
The following are the minimal ingredients for a Cuda C program:
- The kernel. This is the function that will be executed in parallel on the GPU.
- Main C program
- allocates memory on the GPU
- copies data in CPU memory to GPU memory
- ‘launches’ the kernel (just a function call with some extra arguments)
- copies data from GPU memory back to CPU memory
Kernel Code
%%file kernel.hold __global void square_kernel(float *d_out, float *d_in){ int i = thread.Idx; # This is a unique identifier of the thread float f = d_in[i] # Why this statement? d_out[i] = f*f; # d_out is what we will copy back to the host memory }
Overwriting kernel.hold
CPU Code
%%file main.hold int main(int argc, char **argv){ const int ARRAY_SIZE = 64; const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float); float h_in[ARRAY_SIZE]; for (int i =0;i<ARRAY_SIZE;i++){ h_in[i] = float(i); } float h_out[ARRAY_SIZE]; float *d_in; // These are device memory pointers float *d_out; cudaMalloc((void **) &d_in, ARRAY_BYTES); cudaMalloc((void **) &d_out, ARRAY_BYTES); cudaMemcpy(d_in, h_in, ARRAY_BYTES,cudaMemcpyHostToDevice); square_kernel<<<1,ARRAY_SIZE>>>(d_out,d_in); cudaMemcpy(h_out,d_out,ARRAY_BYTES,cudaMemcpyDeviceToHost); for (int i = 0;i<ARRAY_SIZE;i++){ printf("%f", h_out[i]); printf(((i % 4) != 3 ? "\t" : "\n")); } cudaFree(d_in); }
Overwriting main.hold
Shared Memory
Lifted from: https://www.cac.cornell.edu/vw/gpu/shared_mem_exec.aspx
%%file shared_mem_ex.cu #include <stdio.h> #include <stdlib.h> #define N 1024*1024 #define BLOCKSIZE 1024 __global__ void share_ary_oper(int *ary, int *ary_out) { // Thread index int tx = threadIdx.x; int idx=blockDim.x*blockIdx.x + threadIdx.x; __shared__ int part_ary[BLOCKSIZE]; part_ary[tx]=ary[idx]; part_ary[tx]=part_ary[tx]*10; ary_out[idx]=part_ary[tx]; __syncthreads(); } int main(){ int *device_array, *device_array_out; int *host_array, *host_array_out; int i, nblk; float k; size_t size = N*sizeof(int); //Device memory cudaMalloc((void **)&device_array, size); cudaMalloc((void **)&device_array_out, size); //Host memory //cudaMallocHost() produces pinned memoty on the host cudaMallocHost((void **)&host_array, size); cudaMallocHost((void **)&host_array_out, size); for(i=0;i<N;i++) { host_array[i]=i; host_array_out[i]=0; } cudaMemcpy(device_array, host_array, size, cudaMemcpyHostToDevice); cudaMemcpy(device_array_out, host_array_out, size, cudaMemcpyHostToDevice); nblk=N/BLOCKSIZE; share_ary_oper<<<nblk, BLOCKSIZE>>>(device_array, device_array_out); cudaMemcpy(host_array, device_array, size, cudaMemcpyDeviceToHost); cudaMemcpy(host_array_out, device_array_out, size, cudaMemcpyDeviceToHost); printf("Printing elements 10-15 of output array\n"); for (i=N;i<N;i++) { k=host_array_out[i]-i*10; if(k<0.1) printf("Incorrect IX %d=%.1f\n",i, k); } for (i=10;i<15;i++) printf("host_array_out[%d]=%d\n", i, host_array_out[i]); cudaFree(device_array); cudaFree(host_array); cudaFree(device_array_out); cudaFree(host_array_out); cudaDeviceReset(); return EXIT_SUCCESS; }
Overwriting shared_mem_ex.cu
Makefile
%%file Makefile CC=nvcc CFLAGS=-Wall shared_mem.o: shared_mem_ex.cu $(CC) $(CFAGS) -c shared_mem_ex.cu clean: rm -f *.o
Overwriting Makefile
Compile
! make
nvcc -c shared_mem_ex.cu
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论