Solvd Inc. is a premier software engineering company. We have 8 offices across the globe and over 800 international employees on staff. With over 12 years of experience, highly skilled teams around the world and deep industry knowledge, we help clients create software that improves their operations and opens new markets. We have built an impressive roster of digital-native enterprise clients including some of the biggest brands in retail and social media. We are looking for a seasoned ML Ops Engineer to join our dynamic team. In this role, you will be responsible for managing the machine learning infrastructure and deploying models to production environments. You will work with tools such as AWS, AWS CloudFormation, AWS SageMaker, and the Databricks Platform, including Unity Catalog, to optimize and maintain our end-to-end ML pipelines.
Key Responsibilities:
ML Infrastructure Management: Architect, manage, and maintain scalable ML infrastructure using AWS services like EC2, SageMaker, S3, and CloudFormation templates.
Model Deployment: Automate the deployment of machine learning models to production using AWS SageMaker and Databricks, ensuring continuous availability and performance.
CloudFormation Automation: Use AWS CloudFormation to define and provision infrastructure for ML workloads, ensuring infrastructure as code best practices.
Data Management & Governance: Leverage Databricks Unity Catalog for data governance, security, and compliance, ensuring high data quality and streamlined model training processes.
Monitoring & Optimization: Implement and monitor models in production using tools like AWS CloudWatch and Databricks monitoring solutions. Address performance bottlenecks and ensure model accuracy over time.
Collaboration with Data Teams: Work closely with data scientists and data engineers to streamline model development and production workflows, ensuring seamless collaboration.
Automated ML Pipelines: Build and maintain CI/CD pipelines for ML models using AWS and Databricks. Ensure models are consistently tested, monitored, and retrained when necessary.
Required Skills:
AWS Expertise: Strong knowledge of AWS services, including AWS SageMaker, CloudFormation, EC2, S3, and CloudWatch.
CloudFormation: Experience with creating, managing, and automating AWS resources using AWS CloudFormation templates.
Databricks Experience: Expertise in the Databricks platform and Unity Catalog, with the ability to manage large-scale data pipelines and ensure model performance at scale.
ML Deployment: Proven experience in deploying machine learning models to production environments using AWS SageMaker and Databricks.
CI/CD Pipelines: Solid understanding of CI/CD pipelines and version control, particularly in the context of machine learning models.
Programming Skills: Strong coding skills in Python, with knowledge of ML libraries such as TensorFlow, PyTorch, and scikit-learn.
Monitoring & Logging: Experience setting up and managing monitoring and logging for production models, tracking performance and detecting anomalies.
DevOps Mindset: Familiarity with DevOps principles and infrastructure automation, with experience using Docker, Kubernetes, or other containerization/orchestration tools.