Site Reliability Engineering Manager - Illinois, Chicago

Site Reliability Engineering ManagerPermanent

Illinois, Chicago - 60290
  • Applications 0
  • Post Date: 2022-10-22
  • Views 58
  • Job Categories:Engineering
  • Job Type:Permanent
  • Published Date:2022-10-22
  • Salary Period:Annual
  • Company Name:Atos
  • Company Type:Agency

Job Simplification

The announced job offer is made public by the firm: Atos and it was included in jobs list the date of: 2022-10-22 in the website greenenergyjobsonline.com.

It is announced that they have a job offer at the category of Engineering and the jobs location is in the state of Illinois at the city Chicago, in the country US - at this current ZipCode: 60290

The information residing in this page is not directly announced by our websites, we help users in the intent of finding the jobs they want and redirect them back to jobs8 for apropriate applying to any of the jobs listed on www.greenenergyjobsonline.com.

Job Overview

Join to our team!

Atos is the global leader in secure and decarbonized digital with a range of market-leading digital solutions along with consultancy services, digital security and decarbonization offerings.

• Worldwide Digital leader • €11 billion in Revenue • 105,000 employees • 71 countries • Olympic & Paralympic Games Worldwide Partner

We inspire candidates and our employees to make the right choices, collectively and individually, to shape the future of the information space.

SRE (Site Reliability Engineering) Job Description

Site Reliability Engineers (SREs) are responsible for keeping production systems running smoothly. SREs are a blend of pragmatic operators and software crafts people that apply engineering principles, operational discipline, and mature automation to our operating environments.

SREs specialize in systems (operating systems, networks, observability), while implementing best practices to continuously improve availability, reliability, and scalability.

As an SRE you will:

Develop and run SRE own tooling and observability using automation like CI/CD, and Kubernetes.

Build monitoring that alerts on symptoms rather than on outages.

Document every action so your findings turn into repeatable actions and then into automation.

Debug production issues across services and levels of the stack.

Plan the growth and reliability of services.

Use your on-call shift to prevent incidents from ever happening.

Be on an on-call rotation to respond to "Code Red" incidents to help restore customer impacting service.

You may be a fit for this role if you have some of these inclinations:

Have an urge for delivering quickly and effectively and iterating fast.

Think about systems: edge cases, failure modes, behaviors, specific implementations.

As an engineer, when you see something broken, you cannot help but fix it.

Have an urge to document all the things so you do not need to learn the same thing twice.

Strong knowledge of SDLC (System Development Life Cycle)

Strong knowledge of git, Docker, Kubernetes, Jenkins, AWS (Amazon Web Services) or similar technologies

Know what the use of configuration management systems like Chef, Ansible

Have strong programming skills in one or more of the following languages: C, Ruby, Python, Shell, Java

Good understanding of hybrid infrastructure

Projects you could work on:

Automation like CI/CD, self-healing of services, end-to-end or performance testing

Improve monitoring (data Dog, AppD etc.) and building new smart metrics

Develop a relationship with a product group and help define their SLO/SLI

Work directly with AppDev to improve product by Non-functional and production readiness

Improve operability, latency, capacity planning, change management and improve MTTR (Mean Time to Repair)

Leveling of Site Reliability Engineering

Technical

Configuration management: use Chef and Ansible to effectively manage our infrastructure

Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet our goals

Systems: manage, configure, and troubleshoot operating system issues, storage (block and object), networking VPC (Virtual Private Cloud), proxies and CDN (Content Delivery Network) and administer high-availability PostgreSQL and Redis clusters

Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations

Engineering practices: availability, reliability, and scalability, as well as disaster recovery

Use and contribute to code to git

Experience coding in one or more of the following languages: C, Ruby, Python, Shell, Java

Execution

Planning: familiar with agile methodologies; use epics and issues to drive projects

Organization: workload organization, OKR (Objective and Key Result) leadership

Management: a manager of one, able to self-organize and report asynchronously

Collaboration and Communication

Leading and contributing to scope and designs for issues, epics, and OKRs (Objective and Key Result)

Contributing to the Handbook, create and update runbooks, general documentation, and write blogs

Completing Root Cause Analysis (RCA) investigations and performing readiness reviews

Improving team practices through code reviews, handoffs of work and incidents

Influence and Maturity

Knowledge sharing, mentoring.

Self-awareness, handling conflict in the team, and providing and receiving feedback

Maintaining good relationships with other engineering teams that help improve the product

Accountability: willing to proactively step in and do the right thing while providing candid and constructive feedback

Page BreakLevels for Site Reliability Engineer

Site Reliability Engineer - 1

Technical

General knowledge of 4 technical expertise areas, with deep knowledge in 1 area

a. AWS Cloud Practitioner, resources provisioning and configuration through CLI/API

b. Chef (basic syntax, recipes, cookbooks) or Ansible (basic syntax, tasks, playbooks)

c. Working knowledge of CI/CD, Jenkins, Nexus, pipelines, jobs

d. Kubernetes basic understanding, CLI (Command Line Interface), service re-provisioning

e. Provision and setup metric in AppD or Grafana or Datadog

f. Provision and setup logs and queries for frequent questions

g. Networking VPC, proxies and CDN (Content Delivery Network)

Working knowledge of git

Execution

Provides emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed

Proposes ideas and solutions to debug, optimize code, and to automate tasks.

Plan, design and execute solutions within Card/Bank to reach specific goals agreed within the team.

Plan and execute configuration change operations both at the application and the infrastructure levels.

Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation

Experience designing, analyzing, and debugging distributed systems

Collaboration and Communication

Self-organize through issues and epics

Improves documentation all around, either in application documentation, or in runbooks, explaining the why, not stopping with the what.

Root cause analysis and corrective actions

Influence and Maturity

Shares the learnings publicly through issues, runbooks, documentation, and blog posts.

Contributes to the hiring process in review questionnaires or being part of the interview team to qualify SRE candidates

Act as a reliability champion.

Levels for Site Reliability Engineer

Site Reliability Engineer - 2

Technical

General knowledge of most technical expertise areas, with deep knowledge in 2.

a. AWS Cloud Architect / Operations, resources provisioning and configuration via automation

b. Chef (advance syntax, recipes, cookbooks) or Ansible (advance syntax, tasks, playbooks)

c. Advance knowledge of CI/CD, Jenkins, Nexus, pipelines, jobs

d. Kubernetes: cluster provisioning and new services

e. Advance AppD or Grafana or Datadog monitoring rules

f. Log shipping pipelines and incident debugging visualizations

g. Advance Networking VPC, proxies and CDN

Contributes to Card/Bank codebase to resolve performance and observability issues

Hands on with creating self-healing and/or self-servicing solutions via automation and tooling

Execution

Identifies significant projects that result in substantial improvements in reliability, cost savings and/or revenue.

Identifies changes for the product architecture from the reliability, performance and availability perspectives with a data driven approach.

Influences the product roadmap and works with engineering and product counterparts to influence improved resiliency and reliability of the product.

Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage optimization

Identify parts of the system that do not scale, provide immediate palliative measures, and drive long term resolution of these incidents.

Identify Service Level Indicators (SLIs) that will align the team to meet the service level objectives.

Experience designing, analyzing, and debugging distributed systems

Collaboration and Communication:

Leads initiatives and problem definition and scoping, design, and planning through epics and initiatives.

Leverage experience and technical knowledge perform RCA / Incident Reviews and technical presentations

Perform and run blameless RCAs (Root Cause Analysis) on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

For stable counterpart assignments, maintain awareness and actively influence stakeholder planning and execution to improve product reliability

Act as a champion for reliability.

Influence and Maturity:
..... click apply for full job details
Apply This Job

Employer Overview

Atos

Illinois, Chicago - 60290
  • Agency