Manager- Site Reliability Engineering (SRE)
Zycus
Zycus is looking for a Site Reliability Engineers (SRE) with deep expertise in Kubernetes, automation, and Linux systems. The ideal candidate will have hands-on experience in deploying, administrating, and optimizing large-scale production systems, with a strong focus on microservices architecture, ensuring automation, performance, and reliability across our SaaS platform.
Roles and Responsibilities: System Reliability & Uptime: Ensure high availability, performance, and reliability of applications and infrastructure. Kubernetes & Cluster Management: Deploy, administer, and maintain Kubernetes clusters, managing scaling, upgrades, and troubleshooting. Microservices Management: Handle the deployment, monitoring, and scaling of microservices in distributed environments. Incident Management: Respond to production incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence. Automation & Infrastructure as Code (IaC): Automate repetitive tasks, infrastructure provisioning, and deployment workflows using tools like Ansible and Terraform. Monitoring & Observability: Implement and maintain monitoring tools (e.g., Prometheus, Grafana, Datadog) to track system health and application performance. Performance Optimization: Analyze system performance, identify bottlenecks, and optimize resources for better efficiency. Disaster Recovery & Backup: Design and implement backup and disaster recovery (DR) strategies for business continuity. Capacity Planning: Forecast infrastructure needs based on performance trends and business growth to ensure scalability. Security & Compliance: Ensure infrastructure and applications meet security standards and compliance requirements. Collaboration with Dev & Ops Teams: Work closely with development and operations teams to improve deployment pipelines, release processes, and system reliability. Documentation: Maintain clear and detailed documentation of systems, processes, and incident reports for knowledge sharing and compliance. Continuous Improvement: Identify opportunities for improving system architecture, deployment strategies, and automation workflows. Cloud Infrastructure Management: Manage cloud services (AWS, GCP, Azure) for resource optimization, cost management, and automation. On-Call Support: Participate in on-call rotations to handle urgent production issues and ensure rapid recovery.
Roles and Responsibilities: System Reliability & Uptime: Ensure high availability, performance, and reliability of applications and infrastructure. Kubernetes & Cluster Management: Deploy, administer, and maintain Kubernetes clusters, managing scaling, upgrades, and troubleshooting. Microservices Management: Handle the deployment, monitoring, and scaling of microservices in distributed environments. Incident Management: Respond to production incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence. Automation & Infrastructure as Code (IaC): Automate repetitive tasks, infrastructure provisioning, and deployment workflows using tools like Ansible and Terraform. Monitoring & Observability: Implement and maintain monitoring tools (e.g., Prometheus, Grafana, Datadog) to track system health and application performance. Performance Optimization: Analyze system performance, identify bottlenecks, and optimize resources for better efficiency. Disaster Recovery & Backup: Design and implement backup and disaster recovery (DR) strategies for business continuity. Capacity Planning: Forecast infrastructure needs based on performance trends and business growth to ensure scalability. Security & Compliance: Ensure infrastructure and applications meet security standards and compliance requirements. Collaboration with Dev & Ops Teams: Work closely with development and operations teams to improve deployment pipelines, release processes, and system reliability. Documentation: Maintain clear and detailed documentation of systems, processes, and incident reports for knowledge sharing and compliance. Continuous Improvement: Identify opportunities for improving system architecture, deployment strategies, and automation workflows. Cloud Infrastructure Management: Manage cloud services (AWS, GCP, Azure) for resource optimization, cost management, and automation. On-Call Support: Participate in on-call rotations to handle urgent production issues and ensure rapid recovery.
Confirm your E-mail: Send Email
All Jobs from Zycus