- Leading emergency response efforts in conjunction with Engineering, Infrastructure, and Database teams to establish root cause
- Leading the efforts to build robust monitoring solutions while expanding our current monitoring and alerting footprint
- Participate in the design of solutions increasing the holistic stability of NH Platforms and identifying potential risks
- Conduct Blameless Postmortems and Anomaly Investigations after incidents to further analyze root cause and create permanent solutions to improve serviceability and prevent future outages
- Establish a Don’t Repeat Incidents (DRI) culture by learning from past issues and always looking to improve monitoring and dashboarding capabilities
- Ensuring applications are performing efficiently, collaborating with development teams and architecture to resolve application performance issues
- Consults with management in the analysis of short- and long-range business requirements and recommends innovations
- Championing automation efforts to reduce or eliminate repetitive, manual processes
- Partner with project management to define Service Level Objectives (SLO) and identify and implement Service Level Indicators (SLI) to track compliance
- Championing capacity management and disaster recovery testing efforts
- Bachelor’s degree in computer science OR equivalent 6+ years’ progressive experience in IT Operations and/or systems management
- 6+ years direct experience in a technical role dealing with complex enterprise software landscapes (DevOps focused development)
- 6+ years’ experience with scripting and automating technical activities
- Experience with best-in-class application monitoring (APM) tooling (New Relic, Dynatrace, AppDynamics)
- Direct, hands-on experience with automated software and system management.
- Strong knowledge of change control best practices and methodologies
- Experience with Ansible, Terraform, Python, or Docker (or similar) is a plus
- Experience with Agile development methodology and/or ITIL ITSM is a plus
- REQUIRED HARDWARE EXPERIENCE
- Servers, Workstations, Load Balancers, Switches, Routers, Firewalls, SAN, NAS and other storage hardware
- PowerShell scripting, and coding standards
- Best-in-class application monitoring (APM) tooling (New Relic, Dynatrace, AppDynamics)
- Azure and/or AWS PaaS/IaaS
- Linux OS and Apache (e.g. SALT, etc.)
- Direct, hands-on experience with automated software delivery and system management.
- Agile development methodology
- Working understanding of Platform Engineering work model in a software development environment
- Proven project management skills and/or substantial exposure to project-based work structures, project lifecycle models, etc
- Proven experience in architecting and overseeing the direction, development, and implementation of technology solutions
- O/S - Windows and Linux, VMWare, Powershell, Azure Administration, PRTG and other systems monitoring software, DNS Management, IIS, TomCat, Docker, APM Monitoring, ITSM tools, SSL/TLS certificates, JavaScript, Json, Python, Ansible, Terraform, Vsphere, Kubernetes, Service Fabric, Azure Management, Elastic, Citrix, JIRA, New Relic, Project Management Tools, ADO, DUO, Secret Server, Qualys, Pager Duty Application, Couchbase, Redis, API gateways, DNS, Security, IP Routing, SSH, FTP, LDAP, HTTP/HTTPS, Email Routing, Jenkins, GitHub, AWS , Cloud development pipelines using CI/CD tooling, Bash scripting