Site Reliability Engineer – IDM Team

Work model: Hybrid (2 days in the office per week)
Job Type: Full Time
Job Location: Málaga, Madrid or Sevilla

As a Site Reliability Engineer (SRE) in the IDM team, you will be responsible for contributing to the reliability, availability, and performance of mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, applying your technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.

The role requires prior experience as an SRE or in similar functions, as well as solid knowledge of the technologies and methodologies described below. A collaborative mindset, focus on continuous improvement, and strong teamwork skills will be key to success in this role.

Candidates should ideally have a background in open-source systems and Linux, although knowledge and experience with Microsoft systems will also be considered positively.

Responsibilities

Reliability & Availability

Contribute to maintaining and improving system reliability, uptime, and performance across production environments.
Support tracking of service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
Assist in improving incident response processes and implementing fault-tolerant systems.

Automation & Infrastructure

Develop and maintain automation tools for infrastructure management.
Collaborate with development teams to integrate reliability practices into CI/CD pipelines.
Contribute to improving scalability and resilience of cloud infrastructure.

Monitoring & Observability

Implement and maintain monitoring systems and alerts to proactively identify issues.
Help define key performance metrics and support the implementation of logging and observability solutions.

Incident Management & Root Cause Analysis

Participate in incident response, assisting with root cause analysis and post-mortems.
Document findings and collaborate on improving procedures and playbooks.

Collaboration

Work closely with other SREs, software engineers, and cross-functional teams to ensure service reliability.
Contribute to continuous improvement initiatives to reduce toil and optimize resource utilization.

Requirements

Required Soft Skills

Problem-Solving & Critical Thinking

Ability to analyze and troubleshoot complex technical issues.
Continuous improvement mindset with innovative problem-solving skills.

Communication & Collaboration

Strong verbal and written communication skills to explain technical issues.
Ability to collaborate with multidisciplinary teams.

Adaptability & Flexibility

Comfortable working in dynamic environments with shifting priorities.
Open to new technologies and adaptable in improving processes.

Ownership & Accountability

Strong commitment to production system reliability.
Proactive in identifying and resolving issues.

Resilience under Pressure

Ability to remain calm and focused during critical incidents.

Required Technical Skills

Infrastructure Automation & Configuration Management

Experience with IaC tools such as Terraform, Ansible, AWX, or Puppet.
Knowledge of Docker and Kubernetes.
Familiarity with cloud platforms (AWS, GCP, or Azure). This is not mandatory, but it will be considered positively.
Administration of hypervisors (VMware or OpenStack is a plus).
DNS management in Microsoft and open-source environments (BIND, CoreDNS, etc.).

Monitoring & Observability

Hands-on experience with tools like Prometheus, Grafana, Icinga.
Knowledge of logging and tracing (ELK stack, Fluentd, OpenTelemetry).

Authentication & Identity Management

Familiarity with authentication protocols: LDAP, SAML, OAuth, OpenID Connect.
Experience with tools such as Active Directory, FreeIPA, Keycloak is a plus and ADFS.
Knowledge of MFA solutions (PrivacyIDEA, Azure MFA, Duo, Okta, etc.).

Incident Management

Experience supporting incident management and documenting post-mortems.

Operating Systems

Administration of Ubuntu and CentOS. We will consider Microsoft operating systems favorably, but it is not a requirement.
Knowledge of security, performance tuning, and patch management.

Microsoft Systems Management

Knowledge of Active Directory, GPOs, DNS, and replication.

Scripting & Programming

Proficiency in PowerShell, Bash, Python and Ansible.
Ability to automate tasks and manage infrastructure as code.

Containerization & Orchestration

Experience with Docker, Podman, and Kubernetes.
Deployment and management of containerized applications.

Performance Tuning & Optimization

Ability to identify and resolve bottlenecks in distributed systems.

Site Reliability Engineer – IDM Team

Responsibilities

Reliability & Availability

Automation & Infrastructure

Monitoring & Observability

Incident Management & Root Cause Analysis

Collaboration

Requirements

Required Soft Skills

Problem-Solving & Critical Thinking

Communication & Collaboration

Adaptability & Flexibility

Ownership & Accountability

Resilience under Pressure

Required Technical Skills

Infrastructure Automation & Configuration Management

Monitoring & Observability

Authentication & Identity Management

Incident Management

Operating Systems

Microsoft Systems Management

Scripting & Programming

Containerization & Orchestration

Performance Tuning & Optimization

Apply for this position

More positions