SRE Practice Architect II
TEKsystems
Description
As an organization, Global Services provides a continuum of services ranging from Engagement Management to full functional Outsourcing, including Offshore Centers in Canada and India. Our model provides the ability to engage customers beyond staffing when asked for more ownership, capabilities, or methodology while enhancing client/consultant support.
We partner with progressive leaders to create opportunities, accelerate business transformation, and help build the enterprises of tomorrow. We work with 80% of the Fortune 500 to address their technology, strategy, and talent needs. We innovate so industries stay ahead of what’s next. As a full-stack technology and talent services provider, we partner with our customers across the globe to own change. Join us.
At TEKsystems Global Services, we live in the tech world. We’re out in front of the trends and tools that shape the industry and create fresh opportunities. All-in, fully engaged, high-energy partnership is how we approach everything – our commitments and our people. Our people are at the center, fueling our high-performance and our inclusive culture.
We’re doers, looking for doers who do the right thing. Roll-up your sleeves thought-leaders focused on creating possible. Team champions who declare success only when everyone achieves their ambitions. Sound like the career experience you’ve been searching for? We’re looking for SRE Architect II to join our team. A practitioner who accelerates outcomes, affects positive change, and moves business forward.
Let’s partner. Together, we can accomplish amazing things.
Here’s what the opportunity supported through our TGS Talent Acquisition Team requires:
We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) Architect to join our dynamic team. The ideal candidate will play a critical role in ensuring the reliability, scalability, and performance of our systems and applications.
The qualified candidate must demonstrate strong communication skills to collaborate with and influence many stakeholders across the organization and possess a deep technical background across technology stacks, including applications, data and messaging frameworks, and infrastructure components.
This individual should also demonstrate exceptional leadership skills in leading a technical team, recruiting team members, and growing the organization.
Responsibilities:
• Design, implement, and maintain scalable monitoring APM solutions (Dynatrace, Datadog, or New Relic) to ensure the reliability and performance of our systems.
• Express a bias for action by identifying inefficiencies and proposing solutions, working independently and collaboratively with the team.
• Develop and maintain automated alerting and incident response processes to proactively identify and address potential issues.
• Collaborate with cross-functional teams to define and implement best monitoring, logging, observability, and incident management practices.
• Drive continuous improvement initiatives to enhance system reliability, scalability, and performance.
• Automate infrastructure provisioning, configuration, and deployment processes using Terraform and other infrastructure-as-code tools.
• Work closely with development teams to integrate monitoring and observability into the CI/CD pipeline.
• Provide guidance and mentorship to junior team members, fostering a culture of continuous learning and growth.
• Contribute to the SRE group in various technology domains and SRE practices, such as observability framework, resiliency, DevSecOps, etc.
• Setup/Enhance SRE best practices in the areas of observability, automation, resiliency, etc.,
• Be accountable for the delivery and performance of SRE Teams.
• Evangelize the SRE practice across organizational boundaries.
• Recommend relevant and implementable technologies that not only represent the state-of-art SRE practices/trends but also benefit the overall application modernization journey.
• Architect new platforms/libraries/toolchains/APIs to enable broad-scope SRE adoption.
• Drive the broader developer community to adopt the SRE best practices.
• Develop playbooks on various SRE, DevOps, and related topics.
Required Skills & Qualifications:
• Bachelor's Degree in Computer Science or Information Technology fields or equivalent professional experiences; a master’s degree is preferred.
• 10+ years of professional Site Reliability and DevOps career experience.
• 10+ years of total IT experience building complex systems.
• Strategize and execute a vision around the KPIs & Metrics defined for the organization
• In-depth familiarity with SRE terminologies, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, incident management, postmortem analysis, Recovery Time Objectives (RTOs), and Recovery Point Objectives (RPOs).
• Ability to identify organization-wide gaps in the SRE domain and identify implementable solutions that contribute to the transformation of the organization.
• Ability to build and lead high-performance SRE teams to consistently achieve business results.
• Expertise with monitoring, APM, and alerting tools like Splunk, Dynatrace, Grafana, Datadog, New Relic, etc.
• Experience with one or more high-level languages such as Python, Go, Java, JavaScript, C#, Ruby, and PHP.
• Experience with CI/CD pipelines like Jenkins, ADO, GitHub Actions, GitLab, etc.
• Demonstrated expertise in cloud platforms like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP).
• Proficiency in containerization and orchestration technologies such as Docker, Kubernetes, or OpenShift.
• Experience with Infrastructure as a Code (IaC) and configuration management tools like Terraform, Ansible, Chef, Puppet, or Salt.
• Exposure to open-source telemetry (OTel) frameworks and tools.
• Experience with tools like ServiceNow, PagerDuty, XMatters, etc.
• Strong communication skills and ability to partner across organizations.
• Ability to create technical documents on SRE best practices and processes.
Certifications:
• At least one certification in any of the public clouds AWS, Azure, or GCP is mandatory.
• Certification in one of the tools like Splunk, Dynatrace, or Datadog is preferred.
• Certification in automation tools like Terraform, Ansible, etc. is preferred.
Pay and Benefits
The pay range for this position is $125000.00 - $190000.00/yr.
We reserve the right to pay above or below the posted wage based on factors unrelated to sex, race, or any other protected classification.
Additional earnings may be available through incentive programs like annual bonuses, profit sharing, etc.
Our full-time, internal employment benefits include the following:
• Medical, dental & vision• Critical Illness, Accident, and Hospital• 401(k) Retirement Plan – Pre-tax and Roth post-tax contributions available• Life Insurance (Voluntary Life & AD&D for the employee and dependents)• Short and long-term disability• Health Spending Account (HSA)• Transportation benefits• Employee Assistance Program• Time Off/Leave (PTO, Vacation or Sick Leave)
Workplace Type
This is a fully remote position.
Application Deadline
This position is anticipated to close on Feb 12, 2025.
About TEKsystems:
We're partners in transformation. We help clients activate ideas and solutions to take advantage of a new world of opportunity. We are a team of 80,000 strong, working with over 6,000 clients, including 80% of the Fortune 500, across North America, Europe and Asia. As an industry leader in Full-Stack Technology Services, Talent Services, and real-world application, we work with progressive leaders to drive change. That's the power of true partnership. TEKsystems is an Allegis Group company.
The company is an equal opportunity employer and will consider all applications without regards to race, sex, age, color, religion, national origin, veteran status, disability, sexual orientation, gender identity, genetic information or any characteristic protected by law.
Confirm your E-mail: Send Email
All Jobs from TEKsystems