Principal Site Reliability Engineer

Verizon Basking Ridge, NJ

engineer verizon engineering cloud operational architecture scripting ai grafana percona sql db data

December 17, 2022

Verizon

Basking Ridge, NJ

When you join Verizon

Verizon is one of the world’s leading providers of technology and communications services, transforming the way we connect across the globe. We’re a diverse network of people driven by our shared ambition to shape a better future. Here, we have the ability to learn and grow at the speed of technology, and the space to create within every role. Together, we are moving the world forward – and you can too. Dream it. Build it. Do it here.

What you’ll be doing...

Our Global Technology Solutions group is looking for a Principal Site Reliability Engineer, focused on managing AIOPS rollout to the business units (Verizon Business/Consumer) & Engineering teams.

Leading and collaborating with portfolio teams across all LOB’s to support a framework that combines engineering and application development to drive operational stability.
Leveraging some of the latest AIOPS technology to develop a holistic approach to enhance systems and application reliability with a focus on superior customer experience.
Collaborating with the core teams combining software practices and engineering to strengthen the application/system reliability along with operational support.
Utilize advanced knowledge of system architecture, network, application development, testing, and operational stability to help transform the way the teams operate today.
Utilizing advanced scripting and coding capabilities to develop artifacts for alert & event correlation ingested from diverse monitoring sources and leverage AI/ML to automate recovery actions.

Where you’ll be working…

This hybrid role will have a defined work location that includes work from home and assigned office days as set by the manager.

This role can sit out of any U.S. based valid work location.

What we’re looking for...

You'll need to have:

Bachelor’s degree or four or more years of work experience.
Six or more years of relevant work experience.
Willingness to travel up to approximately 25% of the time.

Even better if you have one or more of the following:

Experience in Site Reliability Engineer and support an SRE framework across multiple teams.
Knowledge of Developed automated recoveries to prevent problem recurrence and enhance SLO trending and centralized reporting (ex. Grafana dashboard integration).
Identified opportunities to improve architecture/engineering practices.
Knowledge of MYSQL, PERCONA, and SQL queries, DB replication, and Grafana reporting, Systems & Network Architecture, Virtualization technologies (ESX/VMWare) AI/ML, and Splunk experience.
Mentor staff to replace manual processes with automation.
Knowledge of Data ingestion & enrichments – Webhooks, REST API design, JSON, XML, SMTP.
CI/CD - Deployment pipeline experience (Jenkins, Ansible).
Good knowledge of Python, bash, or similar scripting languages.
Knowledge of Unix/Linux based systems, and experience troubleshooting applications running on these systems.
Experience in software lifecycle including design, implementation, and delivery.
Experience in designing, analyzing, and troubleshooting large-scale distributed systems.
Ability to apply a systematic approach to solve problems with a sense of ownership and focus.
Effective communication & collaboration skills with the ability to articulate technical details to a diverse audience.
Experience in AIOPS (emphasis on Moogsoft).
Installation, Infra & Config - Linux Systems Administration and Operations experience, Network Administration experience.
Experience in the Moogsoft installation procedures and apache/NGINX webserver.
Integrations & Development experience: Data Ingestion & integrations with WebHooks, REST API, JSON, XML, SMTP, SNOW, SaaS-based solutions (New Relic, Catchpoint, AWS Cloud Watch), SAP, DB Mon, AI/ML, SPLUNK, MySQL, PERCONA, SQL Query, Python, JavaScript, and Unix Shell Scripting, Jenkins & Ansible and Grafana reporting tools.
Experience in Clustering & Workflows: Operations (SRE) workflows, responsibilities, and organizational structures, predetermined and dynamic correlation, entropy, anomaly detection concepts, SQL/PERCONA DB.
Systems/Network/LDAP Administration and Operations experience.
Experience in Moogsoft components and data flows.
Understanding of monitoring and metrics concepts. (Volume, Performance, Capacity).
Experience in cloud technologies such as architecting, developing, or maintaining cloud solutions in public cloud environments (AWS/OCI/GCP).
Experience in supporting enterprise container-based platforms.
DevOps container/orchestration tools (Kubernetes, Docker, Puppet, etc.) experience.

If Verizon and this role sound like a fit for you, we encourage you to apply even if you don’t meet every “even better” qualification listed above.

Equal Employment Opportunity

We're proud to be an equal opportunity employer - and celebrate our employees' differences, including race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, and Veteran status. At Verizon, we know that diversity makes us stronger. We are committed to a collaborative, inclusive environment that encourages authenticity and fosters a sense of belonging. We strive for everyone to feel valued, connected, and empowered to reach their potential and contribute their best. Check out our diversity and inclusion page to learn more.

Report this job

Similar jobs near me

site reliability engineer jobs near me