The Machine Learning Operations Engineer supports our machine learning infrastructure by ensuring seamless model training, optimization, and deployment. This role is perfect for a tech-savvy individual who enjoys managing machine learning systems and hardware configurations rather than focusing solely on programming, although coding experience would be a strong plus. The ideal candidate is a computer enthusiast with a knack for machine learning infrastructure and model optimization with a passion for working in a collaborative, fast-paced environment.
Responsibilities
- Maintain and manage the software configuration of on-premises machine learning hardware to support optimal performance for training neural networks.
- Set up and maintain cloud-based training environments, primarily on Google Cloud Platform, to facilitate model experimentation and scalability.
- Automate training workflows to drive continuous improvement of vision models, reducing manual overhead and enhancing efficiency.
- Develop automated accuracy assessments and generate reports to evaluate and compare the performance of newly trained neural networks against existing models.
- Ensure predictable and efficient turnaround times for training models with updated datasets to meet project timelines.
- Organize and manage model weights and associated documentation in various formats for deployment across on-premises, cloud, and edge environments.
- Apply quantization and pruning techniques to models to enhance computational efficiency without sacrificing accuracy.
- Design and deploy infrastructure for low-latency inference to enable real-time performance for large-scale models (e.g., vLLMs).
Requirements
- Proven experience with Linux server maintenance, including both on-premises and cloud environments.
- Proficient in scripting with Bash and Python to streamline system and model management.
- Hands-on experience with neural network training, data loaders, and data pre-processing pipelines.
- Familiar with data and model parallelism strategies for improving training speed and efficiency.
- Knowledgeable in neural network model conversion and optimization for deployment on diverse hardware.
Preferred Qualifications
- Familiarity with Google Cloud Platform for machine learning operations.
- Experience with specialized hardware platforms such as Nvidia Jetson, Triton Inference Server, and NIM.
- Skilled in OpenVINO and ONNX for model conversion and optimization.
- Experience training or fine-tuning large language models (LLMs) would be a significant advantage.
- Programming experience in Python and C++ is beneficial but not mandatory.
- Strong written and verbal communication skills for documentation and collaboration.
- Passion for machine learning technology and an aptitude for problem-solving in fast-paced environments.
At Simbe, you will be at the forefront of retail innovation, working with cutting-edge AI and robotics technologies to transform retail operations. Our culture is dynamic, inclusive, and driven by a passion for improving the way retailers operate and serve their customers. Join us to be a part of a team that is not only reshaping the future of retail but also offering immense value to our clients worldwide.
Simbe Values: R. E. T. A. I. L.
Result Driven - We are customer-centric and results-driven. We strive to create immense value for our team, partners, customers, and investors.
Empathetic - We are sensitive and mindful. We support each other in challenging times, both professionally and personally.
Transparent - We highly value open communication internally, and with our partners and customers. We are receptive to feedback.
Agile - We are agile and always eager to learn. We quickly adapt to changes and customer needs.
Innovative - We are bold and innovative, with an intense focus on product design and user experience.
Leaders - We strive for excellence. We are accountable, the best at what we do, and leaders in our field.