Key Responsibilities:Production Support: Monitor, triage, and resolve production issues in real-time, ensuring minimal impact on end users.Incident Management: Work on the resolution of production incidents, track issues through to resolution, and ensure follow-up tasks are completed.Service Reliability: Work with engineering to identify recurring problems and implement long-term solutions to improve platform reliability.Process Improvement: Continuously improve production support processes, including escalation paths, incident tracking, and communication protocols.Documentation: Ensure incident reports, post-mortems, and runbooks are maintained and updated to reflect best practices.On-call Rotation Management: Maintain on-call schedules to ensure 24/7 support coverage.Basic Debugging: Perform initial technical investigation into issues, analyzing logs, understanding system behavior, and documenting findings.
Required Skills & Experience:
- 2+ years in a production support or operations role.
- Excellent Communication: Ability to communicate complex issues clearly with both technical and non-technical stakeholders.
- Collaboration & Teamwork: Proven ability to collaborate with cross-functional teams, including engineering, solutions, product and customer success.
- Incident Management Tools: Experience using tools like PagerDuty, Jira, or similar for incident tracking and management.
- Technical Understanding: Familiarity with basic debugging techniques, reading logs, and understanding of SaaS infrastructure.
- Fast-Paced Environment: Experience working in a high-paced, growth-oriented company.
- Familiarity with scripting languages (Python, Bash) for basic automation.
- Experience with cloud platforms (AWS, Azure, GCP) and monitoring tools like Datadog, Grafana, or Prometheus.