Project Progress and Tasks

Bro in CUDA 📗 : https://github.com/a-hamdi/cuda

Mentor 🚀 : https://github.com/hkproj | https://github.com/hkproj/100-days-of-gpu

Mandatory and Optional Tasks

Day	Task Description
D15	Mandatory FA2-Forward: Implement forward pass for FA2 (e.g., a custom neural network layer).
D20	Mandatory FA2-Backwards: Implement backward pass for FA2 (e.g., gradient computation).
D20	Optional Fused Chunked CE Loss + Backwards: Fused implementation of chunked cross-entropy loss with backward pass. Can use Liger Kernel as a reference implementation.

Project Progress by Day

Day	Files & Summaries
day1	printAdd.cu: Print global indices for 1D vector (index calculation). addition.cu: GPU vector addition; basics of memory allocation/host-device transfer.
day2	function.cu: Use `__device__` function in kernel; per-thread calculations.
day3	addMatrix.cu: 2D matrix addition; map row/column indices to threads. anotherMatrix.cu: Transform matrices with custom function; 2D index operations.
day4	layerNorm.cu: Layer normalization using shared memory; mean/variance computation.
day5	vectorSumTricks.cu: Parallel vector sum via reduction; shared memory optimizations.
day6	SMBlocks.cu: Retrieve SM ID per thread via inline PTX. SoftMax.cu: Shared-memory softmax; split exponent/normalization steps. TransposeMatrix.cu: Matrix transpose via index swapping. ImportingToPython/rollcall.cu: Python-CUDA integration. AdditionKernel/additionKernel.cu: Modify PyTorch tensors in CUDA.
day7	naive.cu: Naive matrix multiplication. matmul.cu: Tiled matmul with shared memory. conv1d.cu: 1D convolution with shared memory. pythontest.py: Validate custom convolution against PyTorch.
day8	pmpbook/chapter3matvecmul.cu: Matrix-vector multiplication. pmpbook/chapter3ex.cu: Benchmarks different matrix add kernels. pmpbook/deviceinfo.cu: Prints device properties. pmpbook/color2gray.cu: Convert RGB to grayscale. pmpbook/vecaddition.cu: Another vector addition example. pmpbook/imageblur.cu: Simple image blur. selfAttention/selfAttention.cu: Self-attention kernel with online softmax.
day9	flashAttentionFromTut.cu: Minimal Flash Attention kernel with shared memory tiling. bind.cpp: Torch C++ extension bindings for Flash Attention. test.py: Tests the minimal Flash Attention kernel against a manual softmax-based attention for comparison.
day10	ppmbook/matrixmul.cu: Matrix multiplication using CUDA. setup.py: Torch extension build script for CUDA code (FlashAttention). FlashAttention.cu: Example Flash Attention CUDA kernel. FlashAttention.cpp: Torch bindings for the Flash Attention kernel. test.py: Manual vs. CUDA-based attention test. linking/test.py: Builds simple CUDA kernel for testing linking. linking/simpleKernel.cpp: Torch extension binding for a simple CUDA kernel. linking/simpleKernel.cu: Simple CUDA kernel that increments a tensor.
day11	FlashTestPytorch/: Custom Flash Attention in PyTorch, tests and benchmarks. testbackward.py: Gradient comparison between custom CUDA kernels and PyTorch.
day12	softMax.cu: Additional softmax kernel with shared memory optimization. NN/kernels.cu: Tiled kernel implementation and layer initialization. tileMatrix.cu: Demonstrates tile-based matrix operations.
day13	RMS.cu: RMS kernel (V1) with naive sum-of-squares approach. RMSBetter.cu: RMS kernel (V2) using warp-reduce optimization,float4 +others . binding.cpp: Torch bindings for RMS kernels. test.py: Tests and benchmarks RMS kernels vs PyTorch.
day14	FA2/flash.cu & kernels.cu: Second iteration of Flash Attention featuring partial forward/backward logic. helper.cuh: Utility functions and warp-reduce helpers. conv.cu: Basic 2D convolution with shared memory.
day15	Attention.cu: Single-headed attention kernel vs. CPU reference. dotproduct.cu: Batched/tiled dot-product kernel for vectors or matrices. SMM.cu: Sparse matrix multiplication in CSR format.
day16	attentionbwkd.cu: Extends attention with gradient computation; forward & backward passes.
day17	cublas1.cu, cublas2.cu, cublas3.cu: Various cuBLAS examples for dot products, axpy, max/min, and other BLAS operations.
day18	wrap.cu: Warp-based reduction and max-finding with inline PTX. atomic1.cu, atomic2.cu: Implement and test custom atomic increment operations.
day19	cublasMM.cu: Matrix multiplication with cuBLAS plus a simple self-attention example.
day20	rope.cu: Rope (rotary positional encoding) kernel and its PyTorch extension. test_rope.py: Benchmarks for the rope kernel.

How to load into Pytorch:

(optional) create tempalte kernel
create kernelforward where you set up the grids and other calculations
create .cpp file
import the header of the file
create a wraper so that you can use tensors
use PYBIN11_MODULE to create a torchextension
in .py file : torch.utils.cpp_extension.load() use it to load the files and it will compile

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
day01		day01
day02		day02
day03		day03
day04		day04
day05		day05
day06		day06
day07		day07
day08		day08
day09		day09
day10		day10
day11		day11
day12		day12
day13		day13
day14		day14
day15		day15
day16		day16
day17		day17
day18		day18
day19		day19
day20		day20
day21		day21
day22		day22
day23		day23
day24		day24
day25		day25
day26		day26
day27		day27
day28		day28
day29		day29
day30		day30
day31		day31
day32		day32
day33/load_in_pytorch		day33/load_in_pytorch
day34/tensor_lib		day34/tensor_lib
day35		day35
day36		day36
day37/MultiStreams		day37/MultiStreams
day38		day38
notes		notes
nvidiadocs		nvidiadocs
.gitignore		.gitignore
Makefile		Makefile
ReadMe.md		ReadMe.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Progress and Tasks

Mandatory and Optional Tasks

Project Progress by Day

How to load into Pytorch:

About

Releases

Packages

Contributors 2

Languages

1y33/100Days

Folders and files

Latest commit

History

Repository files navigation

Project Progress and Tasks

Mandatory and Optional Tasks

Project Progress by Day

How to load into Pytorch:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages