Research Internships at Microsoft provide a dynamic environment for research careers with a network of world-class research labs led by globally-recognized scientists and engineers, who pursue innovation in a range of scientific and technical disciplines to help solve complex challenges in diverse fields, including computing, healthcare, economics, and the environment.
If you are excited about investigating and implementing cutting-edge large language model (LLM) inference techniques and optimizations like quantized KV-caches, flash/paged/radix attention, speculative decoding, and advanced collective communication on graphics processing units (GPUs), come join the AIFX team at Microsoft Azure and contribute to a production-focused, planetary-scale LLM serving stack that is being built on top of excellent open-source efforts like vLLM, SGLang, and HuggingFace. The work includes investigation of cutting-edge, state-of-the-art approaches like "You only cache once (YOCO)" and leveraging them to save memory and compute for serving LLMs at scale. You will get a chance to explore, implement, optimize, and publish your research ideas in collaboration with teams at Microsoft working on real-world production workloads at an unprecedented scale.