Plano, TX, USA
2 days ago
Lead Site Reliability Engineer

DESCRIPTION:

Duties: Engage in and improve the lifecycle of AWS cloud services from inception, design, deployment, and operation. Automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure. Analyze defects, propose improvements and drive efficiencies in systems and processes. Help to develop new cloud engineering strategies and implementations for the firm. Develop observability and telemetry tools. Author and improve the quality of technical engineering documentation. Debug and solve issues in a production environment.


QUALIFICATIONS:

Minimum education and experience required: Bachelor’s degree in Computer Science, Computer Engineering, Information Technology, or related field of study plus 7 years of experience in the job offered or as Lead Site Reliability Engineer, Software Engineer, Application Developer, Software Application Architect, IT Consultant, or related occupation.

Skills Required: Requires experience in the following: distributed systems design and development using application architecture patterns including AWS cloud, Middleware technologies, and containers; application monitoring and observability, including logging, tracing, alerting, dashboard and visualizations using Grafana and Dynatrace; automation and continuous delivery (CI/CD strategy) using Jenkins; software application design including exception handling, retries, timeouts, chaos testing, Configuration as code, and Infrastructure as code; data latency and processing, including relational and non-relational databases, recovery, backup, Indexing, and sharding; software engineering fundamentals including programming language Java or Python, Unit, functional, non-functional testing, identity access management (AWS AIM), data structures, and algorithms (sorts and searches); production management including learning from production, using data to drive decisions, production incident management and post mortems, and culture carrier for blameless post mortems and continuous improvement; reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices; programming languages Python or Java; observability including white and black box monitoring, service level objective alerting, and telemetry collection using tools including Grafana, Dynatrace, Prometheus, Datadog, Splunk; continuous integration and continuous delivery tools including Jenkins, GitLab, and Terraform; container and container orchestration including ECS, Kubernetes, and Docker; troubleshooting common networking technologies AWS Route 53.

Job Location: 8181 Communications Pkwy, Plano, TX 75024.

Confirm your E-mail: Send Email