Job Description
We are at the precipice of the artificial intelligence revolution, and Nebula AI Systems is building the infrastructure to power the future. We are seeking a visionary Senior AI Infrastructure Engineer to lead the architectural design of our next-generation neural processing systems. As we prepare for the paradigm shifts of 2026, your role will be pivotal in deploying scalable, fault-tolerant, and high-performance computing environments that support advanced generative models.
In this role, you will bridge the gap between cutting-edge AI research and robust, production-grade engineering. You will work in a high-velocity environment where your code directly impacts the capabilities of AI agents worldwide. If you are passionate about optimizing compute resources and building systems that scale to petabyte levels, we want to hear from you.
Responsibilities
- Architect High-Performance Systems: Design and implement distributed computing architectures capable of handling massive inference loads for 2026-level generative models.
- Optimize Inference Latency: Work closely with ML researchers to optimize model weights and kernels for specific GPU hardware, reducing latency and increasing throughput by up to 40%.
- Cloud-Native Deployment: Leverage Kubernetes and containerization strategies to ensure seamless deployment across hybrid cloud environments with zero-downtime rollouts.
- Infrastructure Automation: Develop IaC (Infrastructure as Code) pipelines using Terraform and Ansible to automate provisioning and scaling of GPU clusters.
- Security & Compliance: Implement rigorous security protocols to protect sensitive training data and ensure adherence to industry standards.
- Performance Monitoring: Establish real-time monitoring and alerting systems using Prometheus and Grafana to proactively identify and resolve bottlenecks.
Qualifications
- Education: BS, MS, or PhD in Computer Science, Electrical Engineering, or a related technical field.
- Experience: 5+ years of experience in software engineering, with at least 3 years specifically focused on AI infrastructure or high-performance computing.
- Programming: Expert-level proficiency in Python and C++. Deep understanding of GPU programming (CUDA, OpenCL) or NPU architectures.
- Systems: Strong working knowledge of Linux internals, distributed systems theory, and message queues (Kafka, RabbitMQ).
- Tools: Experience with container orchestration (Docker, Kubernetes), cloud providers (AWS, GCP, or Azure), and ML frameworks (PyTorch, TensorFlow, JAX).
- Soft Skills: Exceptional problem-solving abilities and the ability to communicate complex technical concepts to cross-functional teams.