Santa Clara, CA, USA
2 days ago
Site Reliability Engineering (SRE) Manager

We are seeking a seasoned Site Reliability Engineering (SRE) Manager to lead a team of SRE staff supporting a Network Automation team in a follow-the-sun support model. The team manages critical applications and infrastructure both Cloud and on prem for datacenter deployment and automated operations. This role is pivotal in not only the maintenance of resilient, scalable systems, but ownership of the general architecture of distributed systems, including a new DNS architecture and a distributed source of truth sync.

This role goes beyond an understanding of standard best practices and operations. We’re combining technical knowledge with development chops to come up with solutions at cloud scale. This is a new team operating in an exciting, groundbreaking environment, and we want you to help us shape it.

What you will be doing:Team Leadership: Manage and mentor a team of Automation SREs, fostering a culture of collaboration, innovation, and excellence in execution.Technical Guidance: Own technical decisions for the team, ensuring alignment with developers and employing industry standard methodologiesOperational Excellence: Implement and maintain robust operational practices, including incident management, monitoring, alerting, and capacity planningShift Scheduling: Coordinate follow-the-sun support across global time zones, ensuring 24/7 coverage and efficient handoversProject Management: Lead initiatives related to the design, deployment, and maintenance of critical infrastructure componentsRelease Management: Oversee release processes and ensure smooth deployments, minimizing downtime and impact on usersRoot Cause Analysis: Conduct thorough post-incident reviews, identifying root causes and implementing preventive measures

What we need to see:8+ years of experience in the industry, with a focus on Site Reliability Engineering, with a strong background in cloud service providers, ISPs, or similar service-oriented networking companiesTechnical Skills: Proficiency in managing distributed web infrastructures, designing scalable and resilient systems, and implementing network automationLeadership: Proven track record of managing technical teams, including performance management, career development, and hiring - 2+ yrs of management experienceProblem Solving: Demonstrated ability to conduct detailed root cause analysis and drive improvements based on findingsCommunication: Excellent verbal and written communication skills, with experience presenting technical information to diverse audiencesEducation: Bachelor’s degree in Computer Science, Engineering, or a related technical field, or relevant industry experience

If you are a strategic problem solver with a passion for leading high-performance teams in a dynamic and technically challenging environment, we encourage you to apply. Join us in shaping the future of our distributed systems and network automation infrastructure.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people on the planet working for us. If you are creative and autonomous, we want to hear from you!

The base salary range is 164,000 USD - 258,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Confirm your E-mail: Send Email