Kansas City, MO, US
2 days ago
Department Manager, Observability Operations - Corporate IT (Kansas City)

Burns McDonnell is seeking an Observability Leader to join the IT Infrastructure Operations team. The ideal candidate will enhance the visibility of our infrastructure's performance and availability by leading efforts in advanced monitoring, logging, and telemetry. You will develop and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure optimal performance metrics. Collaborating closely with our team in India, you will generate comprehensive insights into infrastructure performance, ensuring services are consistently operational and visible. The ideal candidate possesses a deep understanding of observability principles, monitoring tools, and site reliability engineering (SRE) practices, and has a proven track record of developing and executing successful observability strategies.

We are seeking a highly motivated and experienced Observability Monitoring Lead to build and lead a small, but impactful team responsible for ensuring the performance, health, and availability of our cloud-based services and network infrastructure. This role is critical to maintaining a world-class customer experience and supporting our continued growth.

Responsibilities:

Operations Integrations: Build and integrate monitoring tools with ticket systems and PagerDuty, applying needed principles to develop and maintain tools for infrastructure performance visibility. Leadership Team Development: Lead, mentor, and develop a small team of observability and monitoring engineers. Foster a collaborative and high-performing team environment.Strategy Roadmap Development: Define and execute a comprehensive observability and monitoring strategy that aligns with business objectives and supports proactive identification and resolution of performance issues. Create and maintain a strategic roadmap for evolving our observability capabilities.Tooling Implementation: Evaluate, select, and implement appropriate monitoring and observability tools to provide comprehensive visibility into our cloud services, network infrastructure, and applications. This includes experience with network monitoring tools, cloud monitoring platforms (e.g., AWS CloudWatch, Azure Monitor, Google Cloud Monitoring), and APM solutions.Performance Monitoring Alerting: Establish robust monitoring and alerting systems to proactively detect and diagnose performance bottlenecks, outages, and other critical issues. Define meaningful metrics and thresholds. This includes creating and maintaining real-time dashboards and leverage operations center expertise to ensure seamless monitoring and troubleshooting of infrastructure issues.Incident Response Root Cause Analysis: Collaborate with engineering and operations teams to investigate and resolve incidents, perform root cause analysis, and implement preventative measures.SRE Advocacy Implementation: Champion SRE principles and practices within the organization. Contribute to the development and implementation of SLOs, SLAs, and error budgets.Automation Optimization: Identify opportunities to automate monitoring tasks, improve alerting accuracy, and optimize the performance of our systems.Reporting Communication: Develop and deliver regular reports on system performance, availability, and key metrics to stakeholders. Communicate effectively with technical and non-technical audiences.Vendor Management: Manage relationships with vendors of monitoring and observability tools.
Confirm your E-mail: Send Email