The Work Itself
- Collaborates with Cloud Engineering, Agile squads/developers, sustain and business partners and provides significant contributions to develop specifications to resolve problems, and to address enhancement needs focusing in areas of logging, monitoring and metrics for operational readiness.
- Uses technical knowledge, creativity, and company practices and to drive down occurrences of incidents through development of proactive alerting and monitoring.
- Facilitate day to day execution of real time technical support and troubleshooting
- Drive knowledge transition from development to sustain team for each functional deployment
- Lead maintenance of master documents i.e. Runbook, Playbook and help maintain accurate application documentation
- Work with IT business and development partners to gather inputs to develop new capabilities in displaying/monitoring/alerting on key performance indicators (KPIs) by tracking business transactions (BT) in real-time
- Plan for validation and verification of changes deployed by infrastructure teams, development teams and sustain team
- Participate in Discovery Phase to assist with scoping effort for application teams
- Represent application requirements for Infrastructure activities
- Participate in technical meetings and status meetings with the business
- Co-ordination between upstream applications to resolve incidents
- Communicate with several different technology areas in a highly matrixed organization
- Attend CAB Meetings and approve changes
The Skills You Bring
- Holds BS (preferably MS) in Computer Science or related field preferred
- 5 + years of experience in a similar sustain role and extensive knowledge of associated processes
- Shows deep knowledge and understanding of enterprise-scale platforms and architectures
- Possesses strong analytical, problem-solving skills and exhibits strong leadership skills
- Experience with Splunk, Datadog, AppDynamics or other similar monitoring tools preferred
- Experience with Co-ordination between upstream applications to resolve incidents
- Grasps new technologies and can adapt to rapid shifts in priorities
- Experience with implementing sustainable, audit-ready processes to support IT controls such as executing deployment, access management, audits, incident management, change management, etc.
- Good AWS/Cloud components understanding including monitoring.
- Correlate environment conditions and metrics to application events
- Experience debugging problems in a distributed system
Brooksource provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, national origin, age, sex, citizenship, disability, genetic information, gender, sexual orientation, gender identity, marital status, amnesty or status as a covered veteran in accordance with applicable federal, state, and local laws.