Washington, DC, US
20 hours ago
Site Reliability Engineer

Armis, the cyber exposure management & security company, protects the entire attack surface and manages an organization’s cyber risk exposure in real time. In a rapidly evolving, perimeter-less world, Armis ensures that organizations continuously see, protect and manage all critical assets - from the ground to the cloud. Armis secures Fortune 100, 200 and 500 companies as well as national governments, state and local entities to help keep critical infrastructure, economies and society stay safe and secure 24/7.

Armis is a privately held company headquartered in California.

Location Remote anywhere in the US

Join a dynamic and highly-skilled team responsible for maintaining and ensuring the reliability of Armis’s cutting-edge services and applications.  The Site Reliability Engineer / Production Engineer plays a key role in seamless deployments, real-time monitoring, and efficient management of critical services relied upon by federal customers. The SRE is always looking to innovate and better our processes, procedures, and proactive measures while maintaining compliance in a highly regulated environment.

Responsibilities include:

• Build and deploy Kubernetes services using tools such as Git, Helm, and custom tools

• Guarantee uptime and reliability for production systems through proactive monitoring using Prometheus, Grafana, and Alertmanager

• Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time

• Manage vulnerability assessments and facilitate prompt remediation to maintain security and compliance

• Troubleshooting a wide range of issues, from helm templating to sql tuning, to determine root causes and solutions to prevent in the future

• Collaborate closely with members of the DevOps and Engineering teams to ensure smooth upgrades

• Improving or creating new tools that are helpful to enable other engineers to perform their work more efficiently

• Performing ad-hoc tasks requested by other teams (i.e. running SQL scripts)

• Indepth debugging of a wide range of Kubernetes services

• Indepth debugging of complex python code and provide an analysis of findings with potential solutions

• Develop and maintain comprehensive documentation for procedures and processes to ensure knowledge sharing and continuity.

• Identifying and fixing gaps in processes

• Offer on-call support during off-hours (nights and weekends) to address critical incidents and ensure system stability.

REQUIREMENTS

Ideal candidates will have:

• 3+ years experience with Python and Bash

• 3+ years of Kubernetes, particularly with EKS.

• 3+ years of Helm

• Working knowledge of Git

• 3+years of Prometheus/Alertmanager/Grafana and using these tools to increase proactive monitoring and assist in debugging

• Working knowledge of AWS services

• 3+years of experience with ability debug Kubernetes services, provide solutions, and communicate findings to appropriate team(s)

• 3+years of experience with ability Python code, provide solutions, and communicate findings to appropriate team(s)

• Comfortable running SQL queries

Preferred Qualifications:

• Existing FedRamp experience a plus

• Focused on creating detailed documentation

• Think outside of the box, identify opportunities and solutions to process improvement

• Team player both within team and with outside teams (including international)

• Excellent communication

• Self starter

Confirm your E-mail: Send Email