San Franciso, CA, USA
3 days ago
SRE Engineer

ClearScale (headquartered in San Francisco, California, USA) - AWS Premier Consulting Partner has been offering a full range of professional cloud computing services for over 10 years, including architecture design, DevOps automation, refactoring and cloud-native applications development, integration, migration, solving all sorts of security issues (from just a security check to preventing cyber-attacks) and 24/7 technical support using the best advanced technologies.

The list of our customers is diverse: from government companies (ClearScale is an official cloud partner of the State of California) and educational institutions (University of California, San Francisco) to well-known global brands (IBM, Samsung, GoPro, HP, Conde Nast, Carl Zeiss, etc.) The number of satisfied customers has been well over 850, some of which can be found on the company's website in the Case Studies section.


We were the third company to gain a new AWS competence: Applied AI and Machine Learning Operations (MLOps). Less than 15 partners have this competency!

We have confirmed status in the Database Freedom program (total less than 20 companies in)

Proven expertise as a Managed Services Provider (total less than 16 companies in). This means that ClearScale can perform a full cycle consultancy and service: from audits, system or software development to the 24/7. You can read more about us on the company page - Managed Services.


Since the very foundation of the company, we work 100% remotely from various cities and countries.


Job Responsibilities:

Execute on Observability Strategy Define and document standards for logging, tracing and SLO definitions for engineering teams to follow Propose effective ways to manage dashboards, traces, monitors, metrics and logs in Datadog Integrate Datadog with incident management tools and Slack Establish comprehensive monitoring using Datadog Centralize logging and developing mechanisms for efficient debugging Implementing systems for distributed tracing visualization Adopting OpenTelemetry standards across microservices Rolling out observability to development and production environments in close collaboration with engineering and operations teams Define training practices for engineering teams to adopt observability standards and operational practises for healthy and sustainable incident management processes Implementing POCs and demonstrating such constructs to engineering teams Introduce engineering practices for healthy alerting mechanisms, dashboard definitions and blind-spots elimination with a focus on eliminating alert fatigue Establish near real time reporting to minimize MTTA and MTTR and improve developer experience


Requirements

Extensive experience with AWS infrastructure at scale Experience working in SRE, DevOps or Developer Experience teams in engineering organizations is a must Deep knowledge of observability tooling (Datadog, Grafana, Splunk, OTEL) and hands-on experience developing, extending and operating them across different environments including high-loaded production systems Expert knowledge of Terraform Ability to propose solutions that scales across engineering teams and balance speed of response and cognitive load Experience leading incident responses utilizing operational tools including logging, tracing, SLO patterns and synthetics Experience establishing technical roadmaps from operational strategies for SRE, DevOps or Developer Experience teams in mid to large sized organizations and ability to drive its adoption in the engineering teams Experience applying analytical practices to define SLAs in close coordination with engineering teams and stakeholders Deep understanding and experience advocating for and rolling out SRE best practices and standards for engineering teams Mindset of "minimal tooling for maximum impact" Experience with on-call rotations, creating and executing scalable practices in engineering teams Experience with integrating observability tooling with Teams and Slack Leadership skills to drive alignment between different departments and get buy-in from different stakeholders Exemplary oral and writing skills for technical and non-technical stakeholders AWS certifications are a plus


Additional Information TL/DR:

Type of Position:

SRE or DevOps Engineers with deep knowledge of observability and reliability practices.

Skills:

Datadog-specific knowledge. Proficiency in Python, Terraform, Git, and pipeline setup. Focus on implementing observability for engineering teams. Demonstrated ability to apply analytical skills to cloud compute environments


We offer

100% remote position Full-time position Annual rate review USD Salary


Professional Development

Work with innovative Silicon Valley companies and traditional American companies at the cutting edge of digital transformation We work with the newest technologies in AWS cloud and open-source tools like Jira, Confluence, Lucidchart, Slack etc We operate in an honest and competitive environment and we are one of the AWS's top 10 key partners. The team willing to share its experience Paid AWS certifications: we provide training material, paid time off and examination itself Horizontal and vertical career growth - We keep growing and people keep growing with us


Confirm your E-mail: Send Email