Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at the customer site.

Responsibilities

  • Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events.
  • Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly.
  • Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency.
  • Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring the customers’ infrastructure can handle increasing workloads.
  • Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times.
  • Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations.
  • Customer Focus: Working closely with the AI program manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction.
  • Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.

Basic Qualifications

  • Bachelor's degree in computer science, engineering, or a related discipline
  • 5+ years of industry experience as a Site Reliability Engineer
  • Experience with cloud platforms (e.g., AWS, GCP, Azure).
  • Experience with scripting languages (e.g., Python).
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana).
  • Experience in ticket management, issue resolution, and troubleshooting
  • Strong problem-solving and troubleshooting skills.
  • Ability to work independently and as part of a team.
  • Excellent customer communication and interpersonal skills.

Preferred Qualifications

  • Knowledge of containerization technologies (Docker, Kubernetes).
  • Knowledge of AI/ML infrastructure and workloads.
  • Knowledge of big data technologies (Hadoop, Spark).
  • Fluency in verbal and written English

Location

Yeoksam, Seoul

Job Overview
Job Posted:
3 months ago
Job Expires:
Job Type
Full Time

Share This Job: