Site Reliability Engineer – IDM Team

Work model: Hybrid (2 days in the office per week)
Job Type: Full Time
Job Location: Málaga, Madrid or Sevilla

As a Site Reliability Engineer (SRE) in the IDM team, you will be responsible for contributing to the reliability, availability, and performance of mission-critical applications and systems. You will be part of a team that bridges the gap between development and operations, applying your technical expertise and problem-solving skills to implement best practices in infrastructure automation, monitoring, scaling, and incident response.

The role requires prior experience as an SRE or in similar functions, as well as solid knowledge of the technologies and methodologies described below. A collaborative mindset, focus on continuous improvement, and strong teamwork skills will be key to success in this role.

Candidates should ideally have a background in open-source systems and Linux, although knowledge and experience with Microsoft systems will also be considered positively.

Responsibilities

Reliability & Availability

  • Contribute to maintaining and improving system reliability, uptime, and performance across production environments.
  • Support tracking of service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs).
  • Assist in improving incident response processes and implementing fault-tolerant systems.

Automation & Infrastructure

  • Develop and maintain automation tools for infrastructure management.
  • Collaborate with development teams to integrate reliability practices into CI/CD pipelines.
  • Contribute to improving scalability and resilience of cloud infrastructure.

Monitoring & Observability

  • Implement and maintain monitoring systems and alerts to proactively identify issues.
  • Help define key performance metrics and support the implementation of logging and observability solutions.

Incident Management & Root Cause Analysis

  • Participate in incident response, assisting with root cause analysis and post-mortems.
  • Document findings and collaborate on improving procedures and playbooks.

Collaboration

  • Work closely with other SREs, software engineers, and cross-functional teams to ensure service reliability.
  • Contribute to continuous improvement initiatives to reduce toil and optimize resource utilization.

Requirements

Required Soft Skills

Problem-Solving & Critical Thinking
  • Ability to analyze and troubleshoot complex technical issues.
  • Continuous improvement mindset with innovative problem-solving skills.
Communication & Collaboration
  • Strong verbal and written communication skills to explain technical issues.
  • Ability to collaborate with multidisciplinary teams.
Adaptability & Flexibility
  • Comfortable working in dynamic environments with shifting priorities.
  • Open to new technologies and adaptable in improving processes.
Ownership & Accountability
  • Strong commitment to production system reliability.
  • Proactive in identifying and resolving issues.
Resilience under Pressure
  • Ability to remain calm and focused during critical incidents.

Required Technical Skills

Infrastructure Automation & Configuration Management
  • Experience with IaC tools such as Terraform, Ansible, AWX, or Puppet.
  • Knowledge of Docker and Kubernetes.
  • Familiarity with cloud platforms (AWS, GCP, or Azure). This is not mandatory, but it will be considered positively.
  • Administration of hypervisors (VMware or OpenStack is a plus).
  • DNS management in Microsoft and open-source environments (BIND, CoreDNS, etc.).
Monitoring & Observability
  • Hands-on experience with tools like Prometheus, Grafana, Icinga.
  • Knowledge of logging and tracing (ELK stack, Fluentd, OpenTelemetry).
Authentication & Identity Management
  • Familiarity with authentication protocols: LDAP, SAML, OAuth, OpenID Connect.
  • Experience with tools such as Active Directory, FreeIPA, Keycloak is a plus and ADFS.
  • Knowledge of MFA solutions (PrivacyIDEA, Azure MFA, Duo, Okta, etc.).
Incident Management
  • Experience supporting incident management and documenting post-mortems.
Operating Systems
  • Administration of Ubuntu and CentOS. We will consider Microsoft operating systems favorably, but it is not a requirement.
  • Knowledge of security, performance tuning, and patch management.
Microsoft Systems Management
  • Knowledge of Active Directory, GPOs, DNS, and replication.
Scripting & Programming
  • Proficiency in PowerShell, Bash, Python and Ansible.
  • Ability to automate tasks and manage infrastructure as code.
Containerization & Orchestration
  • Experience with Docker, Podman, and Kubernetes.
  • Deployment and management of containerized applications.
Performance Tuning & Optimization
  • Ability to identify and resolve bottlenecks in distributed systems.
Menu