Hi there, I'm Darshan

A curious mind exploring the intersection of
Computation and Machine Intelligence.

I am a student, a thinker, and an engineer. I craft elegant systems and learn new things(most of the time).

Who I Am

I am a student of everything around me. My roots are humble. I come from a Tier 3 city in Gujarat and a college far from the spotlight, where the path wasn’t already paved for me. I realized early on that knowledge would not be handed to me; I would have to take it. Everything I know—from the fundamentals of systems to the complexities of AI—is the result of relentless self-learning.

My strength is not loud; it lives in my curiosity, in my willingness to keep going when things get complex. I believe that the best work comes from a place of genuine curiosity. While I have a deep background in technical problem-solving, my true passion lies in simplifying complexity.

Currently, I am refining this craft as a Master’s student in Data Science and a Machine Learning Engineer in training. Because I had to teach myself the foundations, I don't just use tools—I deconstruct them. Whether I am optimizing algorithms or architecting systems, I am driven by a single goal: to understand the "why" behind the "how."

I am a student; of code, of math, of systems, of life. Work occupies most of my time, not out of pressure but out of passion. And when I’m not working, I consume ideas: podcasts, philosophy, blogs. I keep learning because it keeps me alive.

I take inspiration from some legendary figures like Elon Musk, Srinivas Ramanujan, Isaac Newton, Albert Einstein, Vikram Sarabhai, and many more :)

Latest Thoughts

I write to clear my mind and share what I learn.

CUDA

The Global GEMM — Putting It All Together

Writing a complete three-level tiled GEMM kernel from scratch using CuTe's TiledCopy, TiledMMA, and swizzled shared memory.

CUDA

Hello, MMA — Your First Tensor Core Instruction

How to use CuTe's TiledMMA to execute a matrix multiply-accumulate on NVIDIA Tensor Cores.

CUDA

Swizzling ; Avoiding Shared Memory Bank Conflicts

How CuTe's Swizzle XORs address bits to eliminate shared memory bank conflicts with a single line of code.

CUDA

The Parallel Copy ; Orchestrating Threads with TiledCopy

How TiledCopy bundles thread layout, copy atoms, and value layout into one declarative object for coordinated, vectorized parallel copies.

CUDA

The Naive Copy ; Scalar vs. Vectorized Memory Movement

Why scalar copies leave 75% of memory bandwidth on the table, and how CuTe's auto-vectorization fixes it.

CUDA

The Art of Slicing ; Partitioning Data Across Blocks and Threads

How CuTe's local_tile and local_partition replace manual index math to slice matrices across CTAs and threads.

CUDA

Hello, Layout! ; Visualizing Memory in CuTe

Understanding CuTe Layouts: how shape and stride turn flat memory into multidimensional grids.

CUDA

Beating PyTorch: Writing a Faster Softmax Kernel in CUDA

Writing a faster Softmax kernel in CUDA than PyTorch's implementation.

Machine Learning

Stable Diffusion 1.5: How I Optimized It

A detailed worklog on optimizing Stable Diffusion 1.5 for performance.

Logic

Propositional Logic

A deep dive into the fundamental building blocks of mathematical logic.

Machine Learning

Raw Dawgging Linear Regression

Understanding Linear Regression by building it from the ground up.