Site Reliability Engineer – Platform Engineering

We are looking for a Site Reliability Engineer experienced with Linux systems to join our expanding IT Operations team. This role involves a high level of collaboration with other technical staff.

Responsibilities

  • Work with the Architecture Team, Principal Engineers and the Development Teams to translate company needs into infrastructure solutions that will suit those needs and requirements in terms of performance, resource usage, scalability, resilience and observability. The proposed solutions may include on premises virtualised/bare-metal, cloud or hybrid architectures and must ensure the use of Continuous Integration and Continuous Delivery, Infrastructure as Code and GitOps approaches.
  • Invest time in developing and maintaining pipelines, scripts and playbooks to continuously reduce the human tasks required to operate the production services (toil).
  • Collaborate with the Architecture Team, Principal Engineers and the Development Teams in projects for moving production services to cloud environments.
  • Manage OpenStack virtual cloud clusters, performance tuning, upgrades and operational duties.
  • Troubleshoot OpenStack related issues.
  • Provisioning, operational tasks (performance, scaling, organization, routine patching, security…) and decommissioning of Linux servers.
  • Provisioning, operational tasks (performance, scaling, organization, routine patching, security) and decommissioning of OpenStack clusters and running VMs.
  • Provisioning, operational tasks (performance, scaling, organization, routine patching, security…) and decommissioning of Kubernetes clusters and deployed resources.
  • Provide comprehensive handover, top tier technical assistance and documentation to the operating and monitoring teams.
  • Management of infrastructure services such as web, DNS, SNMP, DHCP, and others.
  • Participate in shared on-call rotation.

Requirements

  • Experience in automating configuration management tasks using Ansible playbooks.
  • Wide experience with Unix/Linux systems (Canonical Ubuntu and Redhat/CentOS Linux) in a large-scale operations, distributed Linux production set-up.
  • Experience in writing scripts for automating infrastructure tasks (Python, shell script…).
  • Experience working with OpenStack platform (COA certification is a plus).
  • Strong experience managing Ceph storage clusters; perform maintenance and tuning on the shared storage platform.
  • Experience with centralized logging management tools (Splunk, ELK, Fluentd).
  • Experience in centralized management systems (Puppet,Canonical Landscape).
  • Experience in using Terraform to apply Infrastructure as Code.
  • Experience in writing automation pipelines (Argo Workflow GitHub Actions…) is a plus.
  • Clued-up on enterprise level virtualisation (VMware, KVM).
  • Demonstrated ability to troubleshoot systems and network problems.
  • Nice to have experience and knowledge Windows systems administration and investigation, especially of event log and services.
  • Nice to have experience and knowledge on Data Analytics/AIOPS.
  • Extremely organized with a strong attention to detail.
  • Ability to work well under pressure.
  • Demonstrated ability to manage multiple tasks and competing priorities.
  • Great communication, interpersonal and teamwork skills.
  • Fluent in English.

Job Category: Infrastructure
Job Type: Full Time
Job Location: Malaga, Madrid or Seville

Menu