We are looking for a skilled Linux System Administrator to support our Machine Learning and Artificial Intelligence operations. The successful candidate will be responsible for ensuring the stability, scalability, and security of our Linux-based infrastructure, which includes but is not limited to, clusters, grids, and clouds. This role requires strong technical expertise in Linux system administration, as well as experience with containerization (e.g., Docker) and, orchestration (e.g., Kubernetes). The ideal candidate will have a passion for ML/AI and be eager to collaborate with our data science and engineering teams to optimize our workflows.
It would be nice if you also had:
Experience with ML/AI frameworks and libraries (e.g., TensorFlow, PyTorch). Knowledge of data storage solutions (e.g., HDFS, Ceph). Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack).Manage and maintain the health of our Linux-based infrastructure, including servers, clusters, grids, and clouds.
Ensure system uptime, performance, and security by monitoring logs, metrics, and alerts. Implement automation tools (e.g., Ansible, SaltStack) to streamline system deployment, configuration, and management. Collaborate with data science and engineering teams to design and implement optimized workflows for ML/AI workloads. Provide technical guidance on Linux system administration best practices and standards. Troubleshoot complex system issues and provide timely resolution. Develop and maintain documentation of system configurations, procedures, and troubleshooting guides.