Westlake, TX, US
1 day ago
Principal Site Reliability Engineer

Delivers services at high scale, high availability with resilience by using automation and Infrastructure Code. Builds reliability into ecosystem by applying best practices in Resiliency Engineering, Automation, Observability, and Chaos Testing. Manages systems using infrastructure as code tools (IAM, ARM, Terraform, and Chef). Utilizes modern monitoring tools (Datadog, Prometheus, and Splunk). Automates with various scripting languages – Python and Shell scripting. Helps teams scale through production insights, operational automation, developer guidance, real-time metrics, and automation.

Primary Responsibilities:

• Performs Instrumentation with systems skills on building and operating, monitoring, logging, and alerting services of distributed systems at scale.

• Maintains scalability and resiliency in complex environments.

• Implements advanced observability practices and techniques at scale.

• Triages and executes root cause analysis.

• Manages and interprets large datasets using query languages and visualization tools.

• Communicates with both technical and non-technical audiences.

• Presents new software, methods and practices to developers.

• Works with a variety of individuals and groups in a constructive and collaborative manner; and builds and maintains effective relationships.

• Applies Cloud Computing and DevOps concepts including continuous integration and continuous delivery (CI/CD) pipelines in system and infrastructure maintenance.

Education and Experience:

Bachelor’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and five (5) years of experience as a Principal Site Reliability Engineer (or closely related occupation) designing, building, deploying, and maintaining infrastructure and applications in Cloud providers – Amazon Web Services (AWS) and Azure.

Or, alternatively, Master’s degree (or foreign education equivalent) in Computer Science, Engineering, Information Technology, Information Systems, Mathematics, Physics, or a closely related field and three (3) years of experience as a Principal Site Reliability Engineer (or closely related occupation) designing, building, deploying, and maintaining infrastructure and applications in Cloud providers – Amazon Web Services (AWS) and Azure.

Skills and Knowledge:

Candidate must also possess:

• Demonstrated Expertise (“DE”) designing, building, and deploying the open-source Envoy Gateway Infrastructure as an Edge and Internal Gateway to Kubernetes platform running in on-premises and public Cloud ensuring robust and scalable solutions for diverse deployment scenarios.

• DE implementing end-to-end CI/CD pipelines in automated testing and deployment processes, including building efficient and reliable software delivery pipeline incorporating industry-leading security best practices throughout the development and deployment lifecycle.

• DE automating infrastructure provisioning and configuration through Terraform for diverse Cloud platforms, including AWS, AWS GovCloud, and OCI; and applying Infrastructure as Code (IaC) principles using tools (Ansible, Chef, and Puppet) to enhance efficiency and maintainability.

• DE setting up comprehensive monitoring and logging solutions using observability tools (Datadog, Splunk, and ELK) to track system performance and proactively identify issues through log analysis and troubleshooting; implementing tracing mechanisms and providing detailed view of API request flows for effective debugging, performance optimization, and API communication.

Confirm your E-mail: Send Email