Moffett Field, CA
22 days ago
Senior HPC Engineer

ASRC Federal is searching for a Senior HPC Engineer

 

ASRC Federal InuTeq provides High Performance Computing services throughout the HPC lifecycle for computational requirements, architecture, acquisition, and operations to federal government customers. Our employees embrace innovation and are committed to a culture of continuous, standards-driven process improvement, and assimilation of industry best practices. We are seeking to fill a role that primarily provides development for Supercomputing Batch Scheduling with Supercomputing Systems Administration secondary support for our NASA NACS High Performance Computing (HPC) contract.

 

Summary: The successful candidate will be an active supporting member of the ASRC Federal team reporting directly to the  HPC Systems Team Manager.

 

An individual at this skill level should have demonstrated extensive operational and design experience working with advanced HPC subject areas such as: HPC Systems Architecture, InfiniBand, batch schedulers e.g. (PBS, Slurm, or Moab/Torque) while contributing to the support of users of HPC resources on the various issues they might have getting applications to run efficiently. This individual should demonstrate experience installing, maintaining, and upgrading HPC systems. The individual, along with the entire HPC team, will be engaged in the day-to-day operations and support of the HPC resources. Activities may include system patching, OS upgrades, deploying new systems, writing scripts, and troubleshooting system issues on the HPC system. The ability to interact with users to determine symptoms, and then reproduce their issues to isolate the causes is critical skills for this work. There will also be activities in testing, benchmarking, user tool scripting, and analyzing trouble tickets to find patterns indicating system or user education issues.

 

Duties and Responsibilities:

Designs, deploys and maintains HPC clusters with over 2000+ nodes with InfiniBand, 100+ petabytes of data storage in production. Write and shepherd scalable feature designs through the entire software development process, from requirements and use cases to release Designs and develops scripts for system administration, monitoring and usage reporting. Modify existing software to correct errors and/or improve performance Designs and develops scripts for system regression test and performance (file systems (Luster), scheduler (PBS), interconnect (HDR/NDR, Slingshot, ), high availability, etc.). Troubleshoots, isolates and resolves application, system and other technical problems (hardware, software, and network). Understands research use cases, researches and deploys new technologies, defining cost, performance and other trade-offs. Manages and maintains tools for configuration management (HPCM, Ansible & GIT), resource management, scheduling and all necessary aspects of HPC in accordance with best practices. Researches, deploys and manages networking and security infrastructure, including development of policies and procedures. Assists in developing and writing proposals and publications. Creates and provides clear documentation.
Mentoring junior staff and cross training peers After hours/weekend support as required Moderate Supercomputing System Administration that contributes to: Day-to-day operations of the Linux HPC clusters and storage systems Proactive monitoring, analyze, and correct system issues Development of scripts to automate repetitive tasks or tools to enhance support of the HPC systems System performance analysis and tuning Building, installing, and supporting user-requested software Supporting evaluation and assessment of new HPC technology Resolving user report issues and manage support tickets requests in Remedy
Confirm your E-mail: Send Email