The "Hertie Institute for AI in Brain Health" (Hertie AI) is a research institute of the Faculty of Medicine, funded by the Gemeinnützige Hertie Stiftung, with the aim of detecting diseases of the nervous system earlier and treating them better with the help of artificial intelligence. Currently, Hertie AI is in a dynamic build-up phase. Hertie AI cooperates with the strong and innovative AI ecosystem in Tübingen (e.g. Cyber Valley, Cluster of Excellence “Machine Learning in Science”, Tübingen AI Center). Hertie AI uses and benefits greatly from shared infrastructures with these initiatives, like the Machine Learning Cloud (ML Cloud), but has special compute requirements due to its goal to analyze brain data and simulate neural circuits. The ML Cloud, is a state-of-the-art compute infrastructure with powerful AI CPU and GPU compute capacities, petabyte-scale storage volumes, used by more than 400 researchers and engineers.
We are seeking a skilled and proactive Cluster System Administrator to join our team, responsible for managing and optimizing our high-performance computing environment specifically designed for AI workloads. In this role, you will work closely with a team of HPC experts, AI researchers, and IT specialists to ensure that our systems operate at peak performance, supporting AI and ML teams with reliable, scalable computing resources.
- Cluster Management: Oversee and manage daily operations of the compute infrastructure, including configuration, deployment, and optimization of nodes and networks to maximize performance for AI workloads
- System Monitoring and Maintenance: Monitor system performance, storage, and network utilization to ensure the clusters operate efficiently. Address hardware and software issues as they arise
- User Support: Provide technical assistance to AI researchers, data scientists, and developers on efficient use of cluster resources.
- Documentation and Reporting: Create and maintain comprehensive documentation on system configuration, maintenance tasks, and troubleshooting procedures. Generate regular reports on system performance, uptime, and resource usage for management
- Education and Experience: Specialist knowledge and professional experience in information technology, applied computer science or computer engineering equivalent to the level of a Master's degree
- Technical Skills: Proficiency in HPC cluster management tools (e.g., SLURM, PBS, or Torque), Linux system administration
- Scripting and Automation: Strong scripting skills in Python, Bash, or other languages to automate tasks, optimize processes, and improve system reliability
- Networking and Storage: Solid understanding of high-speed networking, parallel file systems, and large-scale storage solutions (e.g., Lustre, Ceph)
- Problem-Solving: Excellent troubleshooting abilities and a proactive approach to resolving system issues before they impact users. Interest in artificial intelligence and motivation to collaborate with scientists and professionals in the field of AI research
- English proficiency
- Experience with automation tools for configuration management (e.g. Ansible, Puppet, Chef) and revision control systems (e.g. Git)
- Experience with containers (Docker/ Singularity/Podman / Kubernetes)
- Collaboration in the multifaceted environment of a modern university hospital, which in addition to patient care, also focuses on medical research and teaching
- Future-proof workplace and location as well as attractive remuneration including a company pension scheme (VBL) and at the same time the most flexible working hours possible
- Subsidization of the job ticket for public transport and attractive discounts on employee offer platforms
- Structured onboarding phase, clinic's own academy to develop professional, social and methodological skills
- Preventive health care through a wide range of sports activities
We offer remuneration in accordance with TV-L (collective wage agreement for the Public Service of the German Federal States). In line with its internationalization agenda, the University of Tübingen welcomes applications from outside Germany. The University of Tübingen is committed to equal opportunity, diversity and inclusion and wishes to enhance the share of women and under-represented categories employed in research. Applications from equally qualified candidates with disabilities will be given preference. Women are expressly encouraged to apply. In principle, the position can be shared. Employment is based on the relevant provisions of university law. Please observe the applicable vaccination regulations. Presentation costs can unfortunately not be covered.
To apply, please send a cover letter and your CV in English and all relevant certificates in your application as a single PDF file by 01.01.2025. For more information or questions about technical aspects of the position, please contact Dr. Kristina Kapanova at kristina.kapanova@uni-tuebingen.de.