Lead/Senior SRE - Permanent - Fully Onsite - London - £90,000pa

Posted 3 hours 21 minutes ago by Robson Bale Ltd

£90,000 Annual

Permanent

Not Specified

Other

London, United Kingdom

Job Description

Lead/Senior SRE - Permanent - Fully Onsite - London - £90,000pa

Responsible to perform end to end Self-Healing automation solution to reduce manual effort/TOIL.

Primary Skill -Ansible, Terraform, Python, DevOps, SRE, Dockers, AWS (Atlas), ECS Based internal tooling

Secondary Skill -Shell Script, Linux, Monitoring tools - Datadog, Splunk, Dynatrace, Grafana, Thousand Eyes, Gremlin etc.

Experience with Automation principals and tools (Ansible etc.).should have worked with Toil identification and quality of life automation.
Advanced working experience with two or more of the following: Unix/Linux, Windows Server, Oracle, MSSQL, MongoDB.
Experience with Python, Java, Curl Scripting or any other types of Scripting.
Experience with JIRA, Confluence, BitBucket, GitHUB, Jenkins, Jules, Terraform.
Experience with two or more of the following observability tools: AppDynamics, Geneos, Dyanatrace, ECS Based internal tooling, Datadog, Cloud watch, Big Panda, Elastic Search (ELK), Google Cloud Logging, Grafana, Prometheus, Splunk, Thousand Eyes etc
Experience in creating Dashboard for Infra/APM/E2E workflows.
Monitoring, logging, Alerting and Error budget (99.9, 99.99, 99.999 %) for software, Operations & Business.
Define SLO, SLI, SLA with business/operations/Engineering team
Experience with logging, monitoring, and event detection on Cloud or Distributed platforms.
Experience creating and modifying technical documentation such as environment flow, functional requirements, nonfunctional requirements.
Effective production management - Incident & change Management, Production control, ITSM, Service Now, problem solving and analytical skills with ability to turn findings into strategic imperatives.
Technical operations application support and stability, realiability and resiliency experience.
Hands-on experience into SRE implementation of monitoring system- Dashboards development for application reliability using Splunk, Dynatrace, Grafana, App Dynamics, Datadog, Big panda.
Experience working on Configuration as Code, Infrastructure as code, AWS(Altas)
Provides technical direction regarding monitoring and logging to less experienced staff or develops highly complex original solutions. Acts as an Expert technical resource for modelling, simulation and analysis efforts.
Overall, we are looking for an Automation Engineer, who could reduce the toil issues and enhance the system towards reliability and scalability.

Nature of the Job:

Collaborate with Production support team, identify the existing manual activities, and automate.
Identify toil area where it can be automated to avoid manual intervention
Build Monitoring system and observability platform for more Stack traces and alerts and Dashboards.
Ability to define SLA, SLO and SLI and implement the same for better monitoring
Scalability, reliability, and observability are the primary goals for reduction of MTTD and MTTR.