Observability Administrator

at Blackfluo.ai

Full Time Remote

Job Description:

Location: Fully remote, Central Europe Time Zone
Start date: To be defined
Languages: English is mandatory

The primary objective of this role is to design, implement, upgrade and maintain a robust observability infrastructure using Elastic, Prometheus and Grafana, with complementary capabilities provided by SCOM and Checkmk. The resource will work closely with our DevOps, IT, and development teams to ensure comprehensive monitoring, alerting, and visualization of our systems. The resource should have advanced experience in complex enterprise environments. Canonical Observability Stack (COS) will be used, therefore advanced experience in COS would be ideal.

Duties and Responsibilities:

Assess current monitoring and observability setup and identify gaps.
Design, implement and upgrade Prometheus-based monitoring solutions in on-premises setup with multi-tenant and several support teams design.
Configure and maintain Grafana dashboards for real-time visualization with multi-tenant and several support teams design.
Integrate Prometheus with other systems and tools (e.g., Loki, Mimir, Tempo, Thanos).
Design, implement and upgrade Elastic (ELK Stack) for on-premises setups.
Develop and document monitoring and logging strategies and best practices.
Set up alerts and notification mechanisms to preemptively address system issues.
Train internal staff on the use and maintenance of Prometheus, Grafana, and Elastic.
Provide ongoing support and improvements to the observability framework.
Ensure high availability and performance of the monitoring and logging systems.
Provide stand-by services on a rotation basis during weekends, holidays and outside of normal working hours.
Perform other duties as required.

Required Qualifications & Experience

At least 5 years in a similar role
Proven experience in deploying and managing Elastic, Prometheus and Grafana in on-premises setup with multi-tenant and multi-support teams design.
Strong understanding of observability concepts and best practices, including APM.
Experience with related technologies (e.g., Kubernetes, Docker, Kibana, Mimir, Loki, Tempo, Thanos, on-premises infrastructure).
Proficiency in scripting and automation (e.g., Bash, Python).
DevOps experience and practice.
Familiarity with infrastructure-as-code tools (e.g., Ansible, Terraform).
Experience with log management and tracing solutions (e.g., Loki, ELK stack, Jaeger).
Knowledge of other monitoring tools is desirable, especially SCOM and Checkmk.
Programming skills is desirable, especially .NET C# and Python.

Education and Certifications:

Bachelor's or master's degree in information technology is desirable.
Monitoring certifications in SCOM, Checkmk, Elastic, Prometheus, Grafana is desirable. Linux and/or Windows System Administration
Network Administration

Location

Cairo, Egypt

Remote Job

Job Overview

Job Posted:

1 month ago

Job Expires:

Job Type

Full Time

Location

Remote Job

Share This Job:

AI Jobs

Companies

Support

Job Details

Location

Remote Job

Share This Job:

AI Jobs

Companies

Support