Job Description
We are seeking a visionary Senior AI Infrastructure Engineer to join our elite team at Nexus Horizon AI. As we prepare for the next wave of generative intelligence, you will be responsible for architecting the scalable, high-performance systems that power our future. This role is not just about maintaining servers; it is about building the foundation for the AI advancements of 2026.
Why Join Us?
Work on cutting-edge Large Language Models (LLMs) and autonomous agents. We offer a competitive benefits package, equity packages, and the chance to shape the technological landscape of tomorrow.
Responsibilities
- Architect & Scale: Design and maintain resilient, distributed computing infrastructure optimized for training and inference of large-scale AI models.
- Model Optimization: Implement techniques to reduce latency and increase throughput for real-time generative AI applications.
- DevOps Integration: Establish CI/CD pipelines specifically tailored for machine learning workflows using Kubernetes and containerization technologies.
- Resource Management: Oversee GPU cluster management and cloud resource allocation to maximize cost-efficiency and performance.
- Reliability: Implement advanced monitoring and alerting systems to ensure 99.99% uptime for critical AI services.
- Collaboration: Partner with research scientists to translate theoretical models into production-ready software.
Qualifications
- Education: Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.
- Experience: Minimum of 5+ years of experience in systems engineering, DevOps, or backend software development with a focus on AI/ML infrastructure.
- Programming: Proficiency in Python, C++, or Rust, with deep understanding of memory management and parallel computing.
- Cloud Expertise: Strong experience with major cloud providers (AWS, GCP, or Azure) and specific ML platform services.
- Containerization: Extensive experience with Docker and Kubernetes orchestration.
- Problem Solving: Ability to troubleshoot complex performance bottlenecks in high-concurrency environments.