Core Evolution Engineering Teams and how they work: ProdOps and SRE
Hi there, this is Evolution, and we decided to name this blog “Explore Evolution”. Here is why…
Our mission here is to introduce you, our dear reader, to our teams and people who directly contribute to Evolution’s success. We are a large company of 7000+ brilliant people worldwide, who are our true superstars and the reason we continue to grow our product portfolio and provide the best service in the industry.
Let’s start with a brief introduction to our business. We are a world-leading B2B provider of video-streamed Live Dealer gaming. Founded in 2006 in Riga, Evolution now grew into over 15 locations across the Globe.
We work with more than 300 world’s biggest casino operators and have been crowned as “Live Casino Supplier of the Year” for 11 consecutive years! Thousands of users interact with our products every hour, across all imaginable platforms, enjoying the lowest latency video in our industry. Many of the solutions we implement for our customers are self-invented and patented. With that said, we can proudly call ourselves the industry leaders and trend-setters.
And now it’s time to talk about our people who stand behind our business and make it happen. One of our fundamental company values is to work together, and for us, those are not just words you put on the wall; we genuinely believe that every team is a part of a big mechanism called Evolution, where every single part of the chain counts.
We would like to start our series of “Explore Evolution Engineering” blog posts with one of the core Engineering departments — the Production Operations (further: ProdOps) department.
To produce this article, we spoke to Nikita (Head of ProdOps department), Renars (Infrastructure Architect) and Dmitry (SRE Department Lead), who helped us to understand the topic and the very essence of their role.
So, what is ProdOps at Evolution? In short, ProdOps is in charge of handling all infrastructure-related questions, such as production environments, networking, system engineering, 24/7 support of the business systems and many more. The platform that ProdOps builds and maintains is enabling stable application operations, fast development iteration, transparency and effective feedback channels via metrics and alerts.
System engineering stack consists of Kubernetes (being hosted on-prem and in the cloud), a number of no-SQL technologies (Kafka, Cassandra and more), and various tooling around it — starting from Prometheus for metrics and ending with in-house developed Kubernetes operators.
Our Head of Production Operations, Nikita Duhovnijs, joined Evolution in January 2017, with a mission to create a new efficient version of the department. Nikita is proud of the achieved results to date, as not only department managed to address business needs (such as stable platforms), but also keep a cool engineering culture, where people truly enjoy working:
“One of the priorities for me was to build a department that I would love to be working at myself as a hands-on Engineer.”
After Nikita joined the company, ProdOps organization kept rapidly growing and is now divided into four departments:
- Site Reliability Engineering or SRE department covers server infrastructure, platform orchestration and tooling around it (more on that will follow below).
- Networking department takes care of the network infrastructure — internal studio infrastructure, studio connections to colocation facilities, traffic distribution and more. EVO Network Engineers produce quite a lot of code to solve their challenges and make sure that everything works smoothly.
- ITO department is responsible for business IT systems. Engineers often create cloud-native applications to wire various business systems together. Another critical mission of this department is to provide global user support coverage. Not an easy task when there are more than 7k employees in a company.
- Service Support Tier 2 department ensures 24/7 monitoring and the second line of escalation handling. In cases of production problems, Tier 2 escalates issues to the relevant Tier 3 party (either development or Systems Engineering Team).
As already mentioned before, ProdOps is a crucial basis for our Engineering and Business operations and a lot could be said about that, as well as about ProdOps department subdivisions. So now we will deep dive into the topic of Site Reliability Engineering!
We spoke with Dmitry, who is the SRE Department Lead, and asked him to tell us more about his team and what he does on a daily basis. Dmitry joined Evolution 3 years ago as an SRE Engineer and later got promoted to the SRE Team Lead position. Nowadays, Dmitry is coordinating SRE teams’ work and is responsible for the delivery and further development of the department. As a manager, Dmitry has a lot of employee management and organizational duties on his shoulders.
Dmitry’s team was in charge of production environment migration from the Riga server rooms to the Frankfurt data center. This is not an ordinary task indeed, as you need to orchestrate and coordinate as much as you can.
“It was a complex task involving most of the Evolution Engineering teams. The biggest challenge was to migrate all infrastructure components during a couple of hours of downtime and coordinate multiple teams’ actions, such as — global network switchover, all database failover to new datacenter and promotion to the primary source of data. We had to ensure our main application backend migration went smooth and that all supporting microservices were properly tested before launching to real customers.”
As for the team structure, SRE consist of 3 smaller teams:
- Data SRE Engineers who are responsible for technologies such as Cassandra, Kafka or Clickhouse. This team makes sure that systems run, clusters operate and new ones are being created, backups happen, and our applications work as they should.
- CI/CD Engineers are responsible for the build and delivery stack composed of GitLab, Jenkins and Artifactory. This team carries out a pretty critical mission, bearing in mind that Evo Engineering is continuously building and shipping around 200 unique microservices with more than 250 releases a month.
- SRE, which covers aspects starting from server provisioning and ending with running the Kubernetes platform in multiple locations across the globe (plus tooling around it).
Renars, our Infrastructure Architect, is responsible for technical infrastructure vision. Here, at Evolution, we are on a constant mission of switching from manual work to automating as much as we can.
Initially, there were only 6 microservices, which eventually grew into more than 200. To handle them, we are putting more effort into Kubernetization of our infrastructure stack and thus shifting infrastructure automatization and management to Kubernetes native tools and approaches.
For example, our SRE team started to use and develop K8s operators for automation and resource life-cycle management. Our goal is to unify technologies and approaches used in infrastructure change management and application development practices, which leads to a well-defined framework for reliable infrastructure change orchestration. We identified the “cloud-native” stack as the best approach to solve our problems in an ever-growing, global organization.
“We start to practice chaos engineering techniques in the SRE and other Evolution Engineering departments. We want to be confident in our systems’ reliability and that the failure recovery mechanisms and overall stack resilience are constantly validated. For that, Product and Data Engineering teams are involved, so that the overall system state assessment procedures can be developed.”
So now we have defined the scope of SRE and ProdOps teams, what they do and how they help Evolution to grow and prosper. But let’s get back to the topic of people. Technical expertise, solid knowledge and experience are crucial to succeed in IT. But what about personality traits? Which are essential and worth nurturing to become a pro in the Production Operations and SRE positions? Our experts identified the following:
- You should be an expert in your field. You have to know OS internals, practice automation and coding, have a “black belt” in systems engineering and more.
- Be open-minded and always stay curious about new technologies. In our world, everything is rapidly changing, technologies are emerging.
- Don’t shy away from challenges and don’t hesitate to get what you need from others. If you believe that something is not working — then change it! If needed, change the processes, try different approaches — this is essential for professional growth not only in the SRE field but as a general personal development principle.
- Have a strong desire not to repeat manual work. Time is a very precious resource nowadays, so your task here is to learn as much as you can to turn the manual into an automatic. In other words — work harder, yes, but also smarter.
- Learn from others! If you come across a good solution, just take a bit of your time and understand what makes it so good and find out why it is working. Learn both from good examples and bad.
Stay tuned because the second part of the interview is coming! Next time we will continue with a rather complex topic — how to become a good System Engineer / Site Reliability Engineer. What to learn, where to go, what to do. Nikita, Renars and Dmitry will share their career path and tell us a bit more about how they became who they are.