Site Reliability Manager
Clearwater Analytics
Clearwater Analytics is seeking a passionate and detail-oriented SRE Manager to lead our dedicated team in maintaining the reliability and performance of our cloud-based platforms. The SRE Manager will oversee a talented team focused on achieving excellence in cloud operations and observability, fostering a culture of collaboration, innovation, and accountability while upholding the highest standards in service reliability.
Key Responsibilities:
Lead the SRE team to ensure the reliability and performance of our cloud-based monitoring services.Design, implement, and maintain scalable, secure, and resilient cloud infrastructure solutions in AWS, Azure, and GCP.Collaborate with cross-functional teams to define cloud architecture strategies that align with business objectives and drive innovation.Drive automation efforts for cloud deployment, configuration management, and monitoring to enhance operational efficiency.Develop and enforce best practices for Infrastructure as Code (IaC) using tools such as Terraform, Ansible, or CloudFormation.Manage cloud costs and optimize infrastructure utilization for maximum efficiency.Ensure compliance with security standards and best practices in cloud service deployment and configuration.Conduct regular audits and assessments of cloud resources and services to ensure optimal performance and security.Lead initiatives related to SLIs, SLOs, and error budgets in collaboration with the R&D team to proactively manage platform stability.Enhance system observability through effective monitoring, alerting, and metrics reporting.Implement observability solutions (logs, metrics, traces) for cloud foundational platforms and promote best practices in reliability engineering.Mentor and build a high-performing team to achieve both personal and organizational goals.Requirements:
Bachelor’s or Master’s degree in Computer Science or a related field.Over 12 years of experience managing services in large-scale environments, with at least 3 years in a leadership role.5+ years of SRE experience focusing on telemetry, observability, self-healing solutions, and platform automation.Proficiency in several programming languages, including Java, Python, and JavaScript (5+ years of experience).Hands-on experience with build and release tools such as Jenkins, Sonar, Artifactory, JIRA, and GitLab, along with a strong CI/CD understanding.Familiarity with public cloud environments like AWS, Azure, and GCP (5+ years).Experience with observability tools and frameworks, such as Dynatrace, Prometheus, Grafana, and AWS CloudWatch.Strong incident response and management skills, demonstrating a proactive and strategic approach to system reliability.Hands-on experience with Infrastructure as Code (IaC) and configuration management tools (e.g., Terraform, Puppet).Demonstrated integrity, strong ownership, and excellent communication and collaboration skills.
Confirm your E-mail: Send Email
All Jobs from Clearwater Analytics