Machine Learning Engineer - MLOps Team
We are seeking a skilled Machine Learning Engineer to join our growing MLOps team. In our fast-paced startup environment, you'll work closely with our data scientists and data engineers to productionize models and build robust, scalable classification systems. You will have the autonomy to drive the evolution of our ML infrastructure—from development through deployment—and make a direct impact on our product's success.
Key Responsibilities:
• Develop and Optimize ML Pipelines:
Write efficient, clean, and maintainable Python code to implement machine learning pipelines for various classification projects. Ensure these pipelines meet production-grade standards for performance and scalability.
• Production Deployment:
Deploy and optimize diverse classification models—including cross-encoders, bi-encoders, transformers, and custom PyTorch networks—ensuring effective GPU/CPU resource management, memory optimization, and scalability tuning.
• End-to-End System Ownership:
Take full responsibility for deployed ML systems, including incident response, performance monitoring, and ongoing quality maintenance with minimal supervision.
• Data Integration and Analysis:
Collaborate with the data engineering team to analyze model inputs/outputs, validate predictions, and explore potential feature improvements using SQL.
• MLOps Infrastructure & CI/CD:
Build and maintain robust, extensible, and reproducible MLOps infrastructure. Establish and manage CI/CD pipelines, and set up observability for system metrics, logs, and alerts.
• Collaboration and Continuous Improvement:
Contribute to technical design discussions, help break down implementation tasks, and address technical debt as needed. Work cross-functionally with both engineering and data science teams to continuously refine our deployment processes.
Day-to-Day Work
You'll be responsible for implementing and maintaining the classification systems that form the backbone of our data platform. This includes:
- Translating research models into production-ready code.
- Developing and optimizing inference pipelines.
- Managing compute resources and scaling solutions.
- Responding to and resolving production incidents.
- Collaborating with data scientists to improve model deployment efficiency.
- Writing and analyzing SQL queries to monitor model inputs/outputs and validate prediction quality.
We're looking for someone who can work independently and take ownership while maintaining high standards. You'll have a real impact in shaping our MLOps practices and building scalable ML systems that matter. If you're passionate about turning ML research into production-ready solutions and care about writing quality code, we'd love to hear from you.
Requirements
Experience:
- 2+ years in machine learning engineering with a strong focus on production deployment.
- Proven experience deploying and optimizing transformer-based models (e.g., RoBERTa, BERT) in a production environment.
• Technical Skills:
- Strong proficiency in Python and production-grade software development.
- Solid understanding of GPU acceleration, resource optimization, and scalable ML inference pipelines.
- Strong SQL skills and experience with data warehouses.
- Familiarity with AWS (particularly S3) and broader cloud computing concepts.
• DevOps & Testing:
- Experience setting up observability (metrics, logging, and alerting) for ML systems.
- Proficiency with CI/CD practices and testing frameworks (pytest for unit, integration, and model evaluation testing).
- Knowledge of version control (Git) and best practices in documentation.
• Problem-Solving & Communication:
- Excellent problem-solving skills with a proven ability to debug complex production issues.
- Strong communication skills to effectively collaborate with both technical and non-technical team members.
Preferred Qualifications
- Experience with containerization (e.g., Docker) and orchestration platforms (e.g., Kubernetes).
- Familiarity with serverless compute platforms (e.g., Modal) and workflow orchestration tools (e.g., Dagster).
- Knowledge of additional classification models (e.g., XGBoost, FastText) and broader data engineering concepts.
- Understanding of data versioning, experiment tracking, and model registry practices.
- Experience with observability platforms (e.g., Datadog) and data warehouses like Snowflake.
- Experience working with LLMs in production.