I’m a graduate student in the College of Computing at Georgia Institute of Technology, advised by Dr. Rich Vuduc. Currently, I work at NVIDIA full time while I finish my PhD. At NVIDIA, I collaborate closely with NVResearch, compiler, GPU arch, and our customers to lead the design of next generation linear algebra libraries, namely, CUTLASS 3.0, a project I have been working on since its inception.
Throughout my academic and professional career, my research has focused on the intersection of programming models for accelerated computing and performance engineering to make GPUs go brrrr. My research helps HPC developers write speed of light applications and kernels that take advantage of modern hardware capabilities in a way that does not make them want to pull their hair out.
I was very lucky to get involved at the inception stage of CUTLASS 3.x and work closely with [Cris Cecka][ccekca] for two years and champion its design and adoption from then on. These days I still work on the CUTLASS C++ project but but in a much broader capacity. Here are some things I get to do:
- CUTLASS 3.x and beyond
- Tensor core architecture and PTX / CUDA C++ programming model code-sign
- MLIR dialects and compilers for GPU code generation
- Design of new hardware features for future generations of GPUs
- Direct customer assistance for using CUTLASS and targeting tensor cores
- Collaborations with internal and external research teams on publications (FlashAttention 3, fVDB etc.)
- Staffing, project priority and planning, recruiting.
I have also worked at Arm and Cerebras Systems on hardware modeling, high performance kernels, and library design. Good folks at Oak Ridge national lab’s OLCF were my collaborators on application level projects during my time at GaTech, who I was lucky to have the honor of sharing two Gordon Bell award nominations with.
News
After having worked as an intern for nearly a year and half at NVIDIA, I have decided to join full time as a compute architect in the DL architecture group, in the fast kernels team!
I will continue my collaboration with Cris Cecka from NVR PSA to lead the design of next generation linear algebra library, CUTLASS 3.0, which I will also be basing much of my PhD thesis material on.
I am happy to announce that I will be joining Nvidia Research for an extended internship starting January 2022!
Publications
[DBLP]
Conference Papers
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision.
fVDB : A Deep-Learning Framework for Sparse, Large Scale, and High Performance Spatial Intelligence.
Exaflops Biomedical Knowledge Graph Analytics.
Supercomputing 2022. ACM Gordon Bell award finalist.
Scalable all-pairs shortest paths for huge graphs on multi-GPU clusters.
Scalable knowledge graph analytics at 136 petaflop/s.
Supercomputing 2020. ACM Gordon Bell award finalist.
Conditioning deep generative raw audio models for structured automatic music.