New York, NY, US
67 days ago
Senior Technical Program Manager, Chaos Engineering Red Team
As a Sr. TPM on the Central Reliability and Response Engineering (CRRE) Red Team, you will spearhead cross-functional programs, products, and problem spaces that impact hundreds of teams, thousands of engineers, and millions of customers. The primary charter of the team is chaos engineering, which means you will be involved in the design and execution of production-based experiments that validate our software resilience and fault tolerances. During peak sales events, the Amazon retail website handles tens of thousands of customer orders and tens of millions of requests....per minute! What if we didn’t have to wait for these events to see how the thousands of micro-services behind the website would behave? What if we knew how these services might act if an availability zone goes down? What if we can create misbehaving clients on multi tenant platforms to ensure tenant isolation works? These are not hypotheticals — our team is designing and executing chaos experiments to simulate these types of scenarios and more! You’ll get the chance to experience Chaos Engineering at an unrivaled scale that only a few companies in the world deal with. We’re fundamentally reshaping how Amazon maintains an always-ready resilience posture by helping to expose issues before our customers do!

You will join a team of 20+ engineers and growing who are building tooling to allow these experiments to be run with safety, repeatability, and scalability. As a Sr. TPM, you will be involved in technical experiment design, client outreach, data analysis, system architecture, risk mitigation, and evangelizing chaos engineering culture across Amazon. You will influence hundreds of teams as well as key decision makers like org-wide directors and VPs.

The problems we are solving for have a fair amount of ambiguity and technical complexity, and you will help to simplify this and distill it into incremental and achievable deliverables. We’re looking for someone who can help lead our teams in identifying problems proactively, direct software-based initiatives, and turn any manual work we’re doing into repeatable processes that our software engineers can effectively automate. You will be getting into the technical weeds with our engineers; as a reliability engineering team, it’s important that we can dive deep into reliability issues that pose risks to the global retail websites.

This role requires strong writing skills to complement Amazon’s doc-focused culture. You will be expected to draft documents that utilize data-driven analysis, an understanding of system architectures, and present arguments from multiple viewpoints to help stakeholders understand our problem space and influence key decision makers. You will frequently be discussing and evangelizing these concepts across hundreds of Amazon teams. Given this we are interested in candidates that can demonstrate working across large organizational structures and effectively manage expectations given varied team cultures.

We often don’t know exactly what we’ll find when we run these chaos experiments, so we place a heavy emphasis on rapid prototypes and agility. We deal with high levels of ambiguity and convert head-tilting ideas into streamlined workflows that are repeatable and safe. We’re building multiple systems from the ground up that focus on chaos engineering at scale and safety mechanisms that help us detect and mitigate real customer impact in production environments. Our work is intentionally provocative and centered around challenging the status quo inside of Amazon. We find this work exciting, rewarding, and fulfilling, and we hope you will too! Our leaders are invested in this work and our teams, and we’re excited to explore your passions as an individual and as a technologist!
Confirm your E-mail: Send Email