Senior Software Engineer- AI Hardware
Bloomberg
The Role:We are seeking an engineer to join our hardware management team. This team is responsible for the provisioning, monitoring, and support for thousands of servers supporting dozens of teams within Bloomberg, including the entire AI stack!
The ideal candidate will have experience in designing, implementing, and maintaining system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems. This role will also be responsible for overseeing the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability.
We'll trust you to:Design, build, and maintain highly reliable, scalable, and efficient infrastructure platforms that support our engineering teams and business needs.Participate in system design discussions and contribute to architectural decisionsEnsure code quality through standard methodologies, code reviews, and alignment to clean code principlesBe able to produce clear and consumable documentation for a wide audienceCommunicate effectively across diverse teamsBe willing to participate in on-call rotations as arrangedBe a self starter, manage priorities, and work independentlyStay up-to-date with the latest infrastructure technologies, and industry standard processes, and evaluate their potential impact on existing and future solutionsWho you are?Hold yourself to high standardsExude our ambitious, collaborative, and empathetic valuesA self-starter mentality with an eagerness to solve previously unsolved problemsExcellent collaboration skills and are open to giving and receiving critical feedback across teamsScalability and reliability are hardwired into your DNAYou have publicly available writing samples, blog posts, demos, or recordings of presentations on technical topicsWhat's in it for you?A unique opportunity to be part of a rapidly growing team in one of the most exciting engineering teams in Bloomberg.An inclusive and supportive work culture that fosters learning and growth.Continuous professional development, product training, and career pathingIntra-departmental mentor and buddy program for in-house networkingAn inclusive company culture, ability to join our Community GuildsYou'll need to have:4+ years of proficiency in Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc)BA, BS, MS, PHD, in Computer Science, Electrical Engineering or related fieldHands-on management of GPU-based systems, including kernel and driver management, and developing software tooling to automate provisioning and maintenance of these systems.Design, implemented, and maintained system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systemsOversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliabilityDrive system upgrades, customization, and seamless integration with software developers, network operations, and data center teamsManage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiencyDevelop and maintain expertise in low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others)Monitor and evaluate the efficiency and effectiveness of infrastructure service delivery methods and proceduresPartner with internal teams to develop prioritization, metrics, and processes around capacity planning and infrastructure availability. Periodically present capacity planning and performance reports to senior leaders during presentations and meetingsBenchmark, analyze, and make recommendations for improvement of IT infrastructureWe'd love to see:Expertise with Kubernetes design patterns (operators, helm charts, kustomize, etc)Experience with data center planning, including rack elevations, cabling plan, and cables/transceiversExperience with data center operations and management
Salary Range = 160000 - 240000 USD Annually + Benefits + Bonus
The referenced salary range is based on the Company's good faith belief at the time of posting. Actual compensation may vary based on factors such as geographic location, work experience, market conditions, education/training and skill level.
We offer one of the most comprehensive and generous benefits plans available and offer a range of total rewards that may include merit increases, incentive compensation, [Exempt roles only], paid holidays, paid time off, medical, dental, vision, short and long term disability benefits, 401(k) +match, life insurance, and various wellness programs, among others. The Company does not provide benefits directly to contingent workers/contractors and interns.
The ideal candidate will have experience in designing, implementing, and maintaining system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems. This role will also be responsible for overseeing the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability.
We'll trust you to:Design, build, and maintain highly reliable, scalable, and efficient infrastructure platforms that support our engineering teams and business needs.Participate in system design discussions and contribute to architectural decisionsEnsure code quality through standard methodologies, code reviews, and alignment to clean code principlesBe able to produce clear and consumable documentation for a wide audienceCommunicate effectively across diverse teamsBe willing to participate in on-call rotations as arrangedBe a self starter, manage priorities, and work independentlyStay up-to-date with the latest infrastructure technologies, and industry standard processes, and evaluate their potential impact on existing and future solutionsWho you are?Hold yourself to high standardsExude our ambitious, collaborative, and empathetic valuesA self-starter mentality with an eagerness to solve previously unsolved problemsExcellent collaboration skills and are open to giving and receiving critical feedback across teamsScalability and reliability are hardwired into your DNAYou have publicly available writing samples, blog posts, demos, or recordings of presentations on technical topicsWhat's in it for you?A unique opportunity to be part of a rapidly growing team in one of the most exciting engineering teams in Bloomberg.An inclusive and supportive work culture that fosters learning and growth.Continuous professional development, product training, and career pathingIntra-departmental mentor and buddy program for in-house networkingAn inclusive company culture, ability to join our Community GuildsYou'll need to have:4+ years of proficiency in Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc)BA, BS, MS, PHD, in Computer Science, Electrical Engineering or related fieldHands-on management of GPU-based systems, including kernel and driver management, and developing software tooling to automate provisioning and maintenance of these systems.Design, implemented, and maintained system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systemsOversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliabilityDrive system upgrades, customization, and seamless integration with software developers, network operations, and data center teamsManage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiencyDevelop and maintain expertise in low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others)Monitor and evaluate the efficiency and effectiveness of infrastructure service delivery methods and proceduresPartner with internal teams to develop prioritization, metrics, and processes around capacity planning and infrastructure availability. Periodically present capacity planning and performance reports to senior leaders during presentations and meetingsBenchmark, analyze, and make recommendations for improvement of IT infrastructureWe'd love to see:Expertise with Kubernetes design patterns (operators, helm charts, kustomize, etc)Experience with data center planning, including rack elevations, cabling plan, and cables/transceiversExperience with data center operations and management
Salary Range = 160000 - 240000 USD Annually + Benefits + Bonus
The referenced salary range is based on the Company's good faith belief at the time of posting. Actual compensation may vary based on factors such as geographic location, work experience, market conditions, education/training and skill level.
We offer one of the most comprehensive and generous benefits plans available and offer a range of total rewards that may include merit increases, incentive compensation, [Exempt roles only], paid holidays, paid time off, medical, dental, vision, short and long term disability benefits, 401(k) +match, life insurance, and various wellness programs, among others. The Company does not provide benefits directly to contingent workers/contractors and interns.
Confirm your E-mail: Send Email
All Jobs from Bloomberg