Skip to content

1y33/100Days

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Progress and Tasks

Bro in CUDA 📗 : https://github.com/a-hamdi/cuda

Mentor 🚀 : https://github.com/hkproj | https://github.com/hkproj/100-days-of-gpu

Mandatory and Optional Tasks

Day Task Description
D15 Mandatory FA2-Forward: Implement forward pass for FA2 (e.g., a custom neural network layer).
D20 Mandatory FA2-Backwards: Implement backward pass for FA2 (e.g., gradient computation).
D20 Optional Fused Chunked CE Loss + Backwards: Fused implementation of chunked cross-entropy loss with backward pass. Can use Liger Kernel as a reference implementation.

Project Progress by Day

Day Files & Summaries
day1 printAdd.cu: Print global indices for 1D vector (index calculation).
addition.cu: GPU vector addition; basics of memory allocation/host-device transfer.
day2 function.cu: Use __device__ function in kernel; per-thread calculations.
day3 addMatrix.cu: 2D matrix addition; map row/column indices to threads.
anotherMatrix.cu: Transform matrices with custom function; 2D index operations.
day4 layerNorm.cu: Layer normalization using shared memory; mean/variance computation.
day5 vectorSumTricks.cu: Parallel vector sum via reduction; shared memory optimizations.
day6 SMBlocks.cu: Retrieve SM ID per thread via inline PTX.
SoftMax.cu: Shared-memory softmax; split exponent/normalization steps.
TransposeMatrix.cu: Matrix transpose via index swapping.
ImportingToPython/rollcall.cu: Python-CUDA integration.
AdditionKernel/additionKernel.cu: Modify PyTorch tensors in CUDA.
day7 naive.cu: Naive matrix multiplication.
matmul.cu: Tiled matmul with shared memory.
conv1d.cu: 1D convolution with shared memory.
pythontest.py: Validate custom convolution against PyTorch.
day8 pmpbook/chapter3matvecmul.cu: Matrix-vector multiplication.
pmpbook/chapter3ex.cu: Benchmarks different matrix add kernels.
pmpbook/deviceinfo.cu: Prints device properties.
pmpbook/color2gray.cu: Convert RGB to grayscale.
pmpbook/vecaddition.cu: Another vector addition example.
pmpbook/imageblur.cu: Simple image blur.
selfAttention/selfAttention.cu: Self-attention kernel with online softmax.
day9 flashAttentionFromTut.cu: Minimal Flash Attention kernel with shared memory tiling.
bind.cpp: Torch C++ extension bindings for Flash Attention.
test.py: Tests the minimal Flash Attention kernel against a manual softmax-based attention for comparison.
day10 ppmbook/matrixmul.cu: Matrix multiplication using CUDA.
setup.py: Torch extension build script for CUDA code (FlashAttention).
FlashAttention.cu: Example Flash Attention CUDA kernel.
FlashAttention.cpp: Torch bindings for the Flash Attention kernel.
test.py: Manual vs. CUDA-based attention test.
linking/test.py: Builds simple CUDA kernel for testing linking.
linking/simpleKernel.cpp: Torch extension binding for a simple CUDA kernel.
linking/simpleKernel.cu: Simple CUDA kernel that increments a tensor.
day11 FlashTestPytorch/: Custom Flash Attention in PyTorch, tests and benchmarks.
testbackward.py: Gradient comparison between custom CUDA kernels and PyTorch.
day12 softMax.cu: Additional softmax kernel with shared memory optimization.
NN/kernels.cu: Tiled kernel implementation and layer initialization.
tileMatrix.cu: Demonstrates tile-based matrix operations.
day13 RMS.cu: RMS kernel (V1) with naive sum-of-squares approach.
RMSBetter.cu: RMS kernel (V2) using warp-reduce optimization,float4 +others .
binding.cpp: Torch bindings for RMS kernels.
test.py: Tests and benchmarks RMS kernels vs PyTorch.
day14 FA2/flash.cu & kernels.cu: Second iteration of Flash Attention featuring partial forward/backward logic.
helper.cuh: Utility functions and warp-reduce helpers.
conv.cu: Basic 2D convolution with shared memory.
day15 Attention.cu: Single-headed attention kernel vs. CPU reference.
dotproduct.cu: Batched/tiled dot-product kernel for vectors or matrices.
SMM.cu: Sparse matrix multiplication in CSR format.
day16 attentionbwkd.cu: Extends attention with gradient computation; forward & backward passes.
day17 cublas1.cu, cublas2.cu, cublas3.cu: Various cuBLAS examples for dot products, axpy, max/min, and other BLAS operations.
day18 wrap.cu: Warp-based reduction and max-finding with inline PTX.
atomic1.cu, atomic2.cu: Implement and test custom atomic increment operations.
day19 cublasMM.cu: Matrix multiplication with cuBLAS plus a simple self-attention example.
day20 rope.cu: Rope (rotary positional encoding) kernel and its PyTorch extension.
test_rope.py: Benchmarks for the rope kernel.

How to load into Pytorch:

  • (optional) create tempalte kernel
  • create kernelforward where you set up the grids and other calculations
  • create .cpp file
  • import the header of the file
  • create a wraper so that you can use tensors
  • use PYBIN11_MODULE to create a torchextension
  • in .py file : torch.utils.cpp_extension.load() use it to load the files and it will compile