Staff Site Reliability Engineer
Datavant
What We’re Looking For
Cloud Engineering - Services is dedicated to managing and supporting 3rd party tools and platforms, both managed and unmanaged, such as our Kubernetes clusters, our continuous integration and continuous deployment infrastructure like Jenkins and Github Actions, and shared unmanaged infrastructure like Kafka and Cassandra.
What You Will Do
Increase our cloud efficiency - Working across engineering, security, platform and others you will seek out and eliminate cloud waste Ship - Deliver on the Cloud Engineering - Service’s charter, daily Lead - Actively collaborate with the team of your peers, keep your pod focused and engaged, contribute to engineering-wide decisions on technical strategy, product strategy, and organizational strategyWhat You Need to Succeed
Platform Management: Expertise in managing Kubernetes (EKS), CI/CD tools (e.g., ArgoCD, GitHub Actions), and observability platforms (e.g., Datadog). Automation and IaC: Proficiency in automating platform deployment and maintenance tasks (e.g., cluster upgrades, CI/CD workflows). Third-Party Tools: Familiarity with integrating tools like Terraform, Elasticsearch, Kafka, Cassandra, and Databricks into the broader platform. Reliability Engineering: Knowledge of scaling, failover, and platform reliability best practices. Cross-Team Collaboration: Ability to work with Embedded Teams to meet workload-specific needs.As one of our SREs you will be capable of doing many of the following:
Analyze and improve the efficiency, scalability, and reliability of our backend systems Build and mature automation tools for robust continuous integration and deployment pipelines Build scalable, secure, and measurable infrastructure with code Facilitate capacity planning Champion code health, rigorous testing, and maintainability standards Create automation of engineering deployments Create scalable and reliable monitoring and alerting that works Create actionable documentation and playbooks, and when possible automation, to resolve recurring issues and proactively address issues before impact is felt Design, build, and upkeep tools, systems, and self-service options to elevate engineering team productivity and reduce toil Maintain a stable, scalable, and secure development environment while keeping abreast of the latest DevOps innovations Support disaster recovery design, implementation, and testing Support engineering teams in implementing system reliability When things go bad, perform advanced troubleshooting of our systems
Confirm your E-mail: Send Email
All Jobs from Datavant