New York, NY, US
1 day ago
Software Development Engineer, Engine
As a Software Development Engineer on the Central Response Engineering (CRRE) Red Team, you will build software to automate stress and chaos experiment suites at all levels of granularity, measuring the resilience of software applications and localizing externally facing problems before our customers do it for us. The learnings from these software experiments will inform thousands of services at Amazon of their resilience posture everyday and ahead of high velocity events (such as Prime Day, Black Friday/Cyber Monday).

The primary charter of the team is Stress and Chaos Engineering, which means you will be involved in the design and execution of production-based experiments that validate our software resilience and fault tolerances. During peak sales events, the Amazon retail website handles tens of thousands of customer orders and tens of millions of requests....per minute! What if we didn’t have to wait for these events to see how the thousands of micro-services behind the website would behave? What if we knew how these services might act if an availability zone goes down? What if we can create misbehaving clients on multi tenant platforms to ensure tenant isolation works? These are not hypotheticals — our team is designing and executing chaos experiments to simulate these types of scenarios and more! You’ll get the chance to experience Stress and Chaos Engineering at an unrivaled scale that only a few companies in the world deal with. We’re fundamentally reshaping how Amazon maintains an always-ready resilience posture by helping to expose issues before our customers do!

Key job responsibilities
As a Software Development Engineer on our team, you will play an important role in building tooling to run and automate chaos and stress experiments safely, with repeatability and scalability in mind. You will solve problems with ambiguity and technical complexity, preferably with simple and eloquent solutions. You will have the ability to dive deep into a wide variety of problems and technologies to guide the right technical decisions for the products you will support. You will collaborate with service owners across different organizations to design scalable solutions, with guidance from senior team members to ensure alignment with long-term strategies.

You'll be deeply involved in the full software development lifecycle, from scoping and design to coding, testing, deployment, and maintenance. Collaborating with stakeholders to understand business and customer value will be crucial as you work to deliver appropriate solutions. You'll contribute to documentation, participate actively in code reviews, and demonstrate operational excellence in all aspects of your work. Balancing new feature development with operational needs, you'll make effective priority trade-offs and help resolve root causes of issues. As you grow in this role, you'll have opportunities to mentor junior engineers and support new team members. You should be comfortable working in a dynamic environment, leveraging your problem-solving skills to tackle complex challenges, and collaborating across the Amazon ecosystem.

About the team
The Central Reliability and Response Engineering (CRRE) Red Team drives programs and software initiatives to safely probe the known and unknown failure modes of Amazon; and we do it in production. We’re fundamentally reshaping how Amazon maintains an always-ready resilience posture. Our software explores and measures the resilience of software applications, and localizes externally facing problems before customers do it for us. Our team puts a high value on work-life balance. Striking a healthy balance between your personal and professional life is crucial to your happiness and success here.
Confirm your E-mail: Send Email