Seattle, WA, US
1 day ago
Sr. SDE (L6), ML Ops
The AWS Infrastructure Services (AIS) team is the backbone of AWS, managing the design, planning, delivery, and operation of our global infrastructure. Essentially, we’re the ones who keep the cloud running. Within AIS, the Science team takes on the exciting challenge of using big data and machine learning to optimize power and cooling, the most critical resources in our data centers. In short, we ensure maximum efficiency while preventing overheating and power outages. Our work helps shape future data center designs and drives exceptional cost savings to AWS customers.

As a Software Engineer on the AIS Science team, you will collaborate with scientists, program managers, and data engineers to build, operationalize, and scale machine learning workflows and platform services. Your work will directly impact how server demand is placed by modeling power and cooling load across AWS's global data centers.

You will play a critical role in building infrastructure meant to support all phases of ML models, from R&D to production, including model retraining and iteration. Our team tackles complex challenges in data processing, model hosting, and metric monitoring. As our responsibilities grow and the number of models we manage increases, we’re seeking an innovative senior engineer with a passion for data, machine learning, and MLOps to join our mission-driven team!

If you're passionate about machine learning and model operations, enjoy working in a collaborative and dynamic team that values work-life balance, and want to make a lasting impact on AWS infrastructure worldwide, this is your opportunity. Come join us on this exciting journey!


Key job responsibilities
In this role you will leverage your engineering background and expertise in ML to lead developing platforms for deploying, productionalizing, and scaling machine learning models, with a focus on variant retraining and ongoing model monitoring.

A day in the life
- Lead the design and implementation of a stable and efficient training and inference infrastructure that scales to support a variety of different machine learning models.
- Collaborate with tenured applied scientists and data engineers to develop improved training and inference infrastructure that accelerates innovation and promotes best practice model scoring and model monitoring.
- Quickly learn the ins and outs of AWS infrastructure’s rack planning and forecasting distributed workflows, and engineer solutions to make these systems more robust, fault-tolerant, and efficient across input and output orgs.
Confirm your E-mail: Send Email