Brentwood, TN, 37027, USA
79 days ago
Engineer, IT Cloud Site Reliability
Engineer, IT Cloud Site Reliability **Overall Job Summary** A Cloud Site Reliability Engineer is a multifaceted role that combines elements of software engineering, system administration, and IT operations. Cloud SREs are responsible for ensuring the reliability, performance, and scalability of systems by focusing on system design, automation, monitoring, incident management, performance tuning, collaboration, and security. Their efforts directly impact the stability and efficiency of critical systems, enabling organizations to deliver reliable and efficient services at scale. This role requires a blend of technical expertise, problem-solving skills, and effective communication, making it essential for the success of modern, complex infrastructures. **Essential Duties and Responsibilities** + Vendor Management + Strong negotiation skills, the ability to build better vendor relationships, network effectively, manage multiple vendors, identify financial risks, and evaluate new vendors. + Industry awareness, strong people skills, and the ability to make effective decisions. + Effective management by monitoring performance, managing risks, tracking key performance indicators, and ensuring compliance with regulations. + Coordinating Teams Efforts + Coordinate efforts with teams located onsite, offshore, nearshore, and across multiple vendors, providing clear direction, setting expectations, and motivating team members to achieve common goals. + Ensure that tasks are assigned, schedules are aligned, and resources are allocated effectively across teams and vendors. + Establish regular communication channels and protocols to ensure that information is shared, feedback is provided, and issues are addressed in a timely manner. + Understand and respect cultural differences and work styles of team members and vendors from different regions. + System Design and Architecture: + Collaborate with software engineers to identify and mitigate risks to system availability and reliability. + Automation and Tooling: + Develop and maintain automation tools to streamline operations and reduce manual interventions. + Monitoring and Incident Management: + Help improve monitoring and alerting systems. + Respond to incidents, perform root cause analysis, and implement permanent fixes to prevent recurrence, maintaining detailed documentation. + Performance and Scalability: + Conduct performance tuning, optimization and capacity management of systems to handle increasing loads and demand. + Collaboration and Communication: + Communicate effectively with stakeholders about system performance, incidents, and improvements. + Foster a culture of reliability and continuous improvement across the organization. + Security and Compliance: + Ensure that systems and infrastructure comply with security best practices and regulatory requirements. **Required Qualifications** _Experience:_ 4+ years related work experience. Experience in the retail industry preferred _Education:_ Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience. Any combination of education and experience will be considered. _Professional Certifications:_ None _High Demand IT Specialized skills:_ _Platform knowledge (UNIX, Linux, Windows):_ Public and Private Cloud Technologies (AWS, Google Cloud, Azure) and containerization technologies (Docker, Kubernetes). Hyper-converged Platforms (Nutanix, Simplivity), VMware vSphere 6, Microsoft Applications (Active Directory, Exchange, O365 and server OS), AHV, Kubernetes, Docker, Saltstack **Preferred knowledge, skills or abilities** + Knowledge of ITIL Foundation concepts, practices, and procedures preferred. + Knowledge of continuous improvement concepts preferred. + Experience with programming and scripting languages (Python, Go, Java, Bash). + Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack). + Excellent problem-solving skills and the ability to work under pressure. + Strong communication and collaboration skills, with a focus on teamwork and knowledge sharing. + Strong Enterprise Application Support experience + Strong Process Management skills + Ability to manage ITSM Tools and Enterprise Support tools + Understand data integration concepts. + SDLC Waterfall and Agile knowledge preferred **Working Conditions** + Normal office working conditions **Physical Requirements** + Sitting + Standing (not walking) + Walking + Lifting up to 20 pounds **Disclaimer** _This job description represents an overview of the responsibilities for the above referenced position. It is not intended to represent a comprehensive list of responsibilities. A team member should perform all duties as assigned by his/ her supervisor._ **ALREADY A TEAM MEMBER?** You must apply or refer a friend through our internal portal Click here (https://performancemanager4.successfactors.com/sf/home?company=tractorsup) **CONNECTION** Our Mission and Values are more than just words on the wall - they’re the one constant in an ever-changing environment and the bedrock on which we build our culture. They're the core of who we are and the foundation of every decision we make. It’s not just what we do that sets us apart, but how we do it. Learn More **EMPOWERMENT** We believe in managing your time for business and personal success, which is why we empower our Team Members to lead balanced lives through our benefits total rewards offerings. fot full-time and eligible part-time TSC and Petsense Team Members. We care about what you care about! Learn More **OPPORTUNITY** A lot of care goes into providing legendary service at Tractor Supply Company, which is why our Team Members are our top priority. Want a career with a clear path for growth? Your Opportunity is Out Here at Tractor Supply and Petsense. Learn More Join Our Talent Community **Nearest Major Market:** Nashville
Confirm your E-mail: Send Email