Mountain View, CA, 94039, USA
23 hours ago
Sr Staff Engineer, ML Infrastructure and Performance
LinkedIn is the world’s largest professional network, built to create economic opportunity for every member of the global workforce. Our products help people make powerful connections, discover exciting opportunities, build necessary skills, and gain valuable insights every day. We’re also committed to providing transformational opportunities for our own employees by investing in their growth. We aspire to create a culture that’s built on trust, care, inclusion, and fun – where everyone can succeed. Join us to transform the way the world works. At LinkedIn, our approach to flexible work is centered on trust and optimized for culture, connection, clarity, and the evolving needs of our business. The work location of this role is hybrid, meaning it will be performed both from home and from a LinkedIn office on select days, as determined by the business needs of the team. This role will be hybrid in LinkedIn's Sunnyvale, CA campus. About the Role We are seeking a Senior Staff Engineer to design, build, and maintain our large-scale GPU infrastructure for machine learning (ML) and AI workloads. In this role, you will be the technical leader responsible for selecting the appropriate hardware SKUs, designing networking and storage architectures, and ensuring the production fleet’s reliability and efficiency. You will collaborate closely with data scientists, ML engineers, and infrastructure teams to support cutting-edge AI initiatives across the organization. Key Responsibilities Infrastructure Architecture & Hardware Selection Define and evaluate GPU server SKUs, memory configurations, and CPU-GPU ratios to optimize ML training and inference workloads. Stay current on industry trends and emerging hardware technologies (e.g., next-gen GPUs, high-speed interconnects, NVMe storage) and provide guidance on future upgrades. Networking & Storage Architecture Design high-throughput, low-latency networking solutions (e.g., InfiniBand, 100/200+ GbE, or RoCE) for large-scale distributed training. Assess, design, and integrate scalable storage solutions (e.g., parallel file systems, object storage, NVMe over Fabric) to meet performance and capacity requirements for ML workloads. Collaborate with network and storage teams on capacity planning, topology, redundancy, and data management strategies. Operational Excellence & Production Support Lead end-to-end operational support for production GPU fleets—troubleshooting hardware, networking, and application-level issues. Develop and maintain observability tooling (monitoring, alerting, logging) to proactively detect and resolve hardware issues and performance bottlenecks. Establish SLAs, SLOs, and best practices for availability, reliability, and cost optimization across the ML platform. Collaboration & Leadership Partner with Data Scientists, ML Engineers, and DevOps teams to understand workload requirements and ensure infrastructure meets performance and cost goals. Guide junior engineers in architecture best practices, capacity planning, and operational troubleshooting. Advocate for and drive adoption of modern infrastructure patterns and MLOps best practices across the company. Strategic Roadmapping Influence the long-term roadmap for ML/AI infrastructure, factoring in technology trends, product needs, and growth projections. Evaluate and introduce new tools, frameworks, and processes that can enhance the productivity of ML teams and improve system performance. Basic Qualifications Bachelor’s degree in Computer Science, Electrical Engineering, or a related field or equivalent industry experience. 8+ years of experience designing and managing large-scale, distributed systems or HPC environments, with at least 3+ years focused on GPU-based ML or AI workloads. Preferred Qualifications Masters degree in Computer Science, Electrical Engineering, or a related field. Proven track record of architecting GPU-based compute clusters (NVIDIA, AMD, or other accelerators) and understanding advanced GPU features (CUDA, Tensor Cores, MIG, etc.). Deep knowledge of high-performance networking (InfiniBand, RoCE, or high-speed Ethernet) and parallel storage systems (Lustre, GPFS, BeeGFS, or similar). Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK stack, etc.) for large-scale systems. Experience with containerization (Docker, Singularity) and job schedulers (Kubernetes, Slurm, or other HPC schedulers). Excellent communication skills, with ability to explain complex topics to technical and non-technical stakeholders Demonstrated leadership in cross-functional, collaborative projects with internal and external teams. Strong organizational skills, capable of managing multiple projects and deadlines simultaneously. Suggested Skills Infrastructure Architecture & Hardware Selection Strategic leadership Networking & Storage Architecture LinkedIn is committed to fair and equitable compensation practices. The pay range for this role is 149,000 to 247,000. Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to skill set, depth of experience, certifications, and specific work location. This may be different in other locations due to differences in the cost of labor. The total compensation package for this position may also include annual performance bonus, stock, benefits and/or other applicable incentive compensation plans. For more information, visit https://careers.linkedin.com/benefits. Equal Opportunity Statement LinkedIn is committed to diversity in its workforce and is proud to be an equal opportunity employer. LinkedIn considers qualified applicants without regard to race, color, religion, creed, gender, national origin, age, disability, veteran status, marital status, pregnancy, sex, gender expression or identity, sexual orientation, citizenship, or any other legally protected class. LinkedIn is an Affirmative Action and Equal Opportunity Employer as described in our equal opportunity statement here: https://microsoft.sharepoint.com/:b:/t/LinkedInGCI/EeE8sk7CTIdFmEp9ONzFOTEBM62TPrWLMHs4J1C\_QxVTbg?e=5hfhpE. Please reference https://www.eeoc.gov/sites/default/files/2023-06/22-088\_EEOC\_KnowYourRights6.12ScreenRdr.pdf and https://www.dol.gov/ofccp/regs/compliance/posters/pdf/OFCCP\_EEO\_Supplement\_Final\_JRF\_QA\_508c.pdf for more information. LinkedIn is committed to offering an inclusive and accessible experience for all job seekers, including individuals with disabilities. Our goal is to foster an inclusive and accessible workplace where everyone has the opportunity to be successful. If you need a reasonable accommodation to search for a job opening, apply for a position, or participate in the interview process, connect with us at accommodations@linkedin.com and describe the specific accommodation requested for a disability-related limitation. Reasonable accommodations are modifications or adjustments to the application or hiring process that would enable you to fully participate in that process. Examples of reasonable accommodations include but are not limited to: -Documents in alternate formats or read aloud to you -Having interviews in an accessible location -Being accompanied by a service dog -Having a sign language interpreter present for the interview A request for an accommodation will be responded to within three business days. However, non-disability related requests, such as following up on an application, will not receive a response. LinkedIn will not discharge or in any other manner discriminate against employees or applicants because they have inquired about, discussed, or disclosed their own pay or the pay of another employee or applicant. However, employees who have access to the compensation information of other employees or applicants as a part of their essential job functions cannot disclose the pay of other employees or applicants to individuals who do not otherwise have access to compensation information, unless the disclosure is (a) in response to a formal complaint or charge, (b) in furtherance of an investigation, proceeding, hearing, or action, including an investigation conducted by LinkedIn, or (c) consistent with LinkedIn's legal duty to furnish information. Pay Transparency Policy Statement As a federal contractor, LinkedIn follows the Pay Transparency and non-discrimination provisions described at this link: https://lnkd.in/paytransparency. Global Data Privacy Notice for Job Candidates This document provides transparency around the way in which LinkedIn handles personal data of employees and job applicants: https://lnkd.in/GlobalDataPrivacyNotice
Confirm your E-mail: Send Email