Introduction
signageOS runs a cloud-based system that controls and monitors thousands of digital signage devices all around the world.
As our customer base grows, and with it increasing traffic, we must maintain the same levels of quality and performance that our global partners have come to trust. After careful and considerable evaluation the decision was reached to select AWS as our cloud provider so that we may manage scaling, availability, and security most efficiently.
The ecosystem of signageOS mainly consists of two parts - devices that run in various locations around the world and the cloud system that controls and monitors them.
Devices
Devices are signageOS’ main focus. Everything we do aims to make devices more reliable and powerful, through various methods. These devices are the SoC displays, media players, and Raspberry Pi modules. They run signageOS software and each of our customer’s unique HTML5 players.
Cloud
All devices connect to the signageOS cloud system. They send various real-time information about the state of the device and it’s content to the cloud where it is processed and presented to the user. Various commands can be issued from the cloud and sent back to the device in real-time and the device acts upon those commands.
Scaling
The cloud infrastructure is running on several separated virtual machines powered by AWS. All of our proprietary micro-service applications are managed by Kubernetes Cluster technologies EKS developed by AWS. Our expert experience gives us a chance to write robust infrastructure as a Code (Terraform + Helm charts) which is very generic and can handle all standard cases and many special cases of system load.
Autoscaling using Kubernetes and AWS EC2
We utilize a combination of auto-scaling options from Kubernetes itself with Auto-Scaling groups provided by AWS EC2. So the continual traffic is efficiently distributed to all servers based on predefined rules. In the event of any peak traffic load, the system automatically starts reacting on that and begins creating new server instances immediately. These EC2 instances are a part of EKS cluster and all overloaded instances are greatly lightened long before they go to die. All traffic changes are continuously monitored by AWS CloudWatch, which notifies all DevOps responsible persons in case something is not functioning correctly. Most of the peak-performance cases are currently handled automatically.
Database scaling
The databases are separated from EKS cluster to utilize the full performance of EC2 instances. We are combining HA replication (master-master) of all our databases to achieve not just HA mode, but to have the most utilized system for read and write operations. In the long-term, we aim to keep CPU and Memory utilization of all database servers under 20% CPU and 60% RAM in common traffic load. When the long statistics indicate the overuse of this limit, we are prepared to scale up the database cluster, mostly in a horizontal manner. In case of the unexpected peak of traffic coming to database instances, we rely on EC2 T3 burstable credit, which handles it in a short amount of time.
High-Availability
Our system has to be available from all locations around the world 24/7/365. Which means that all components should not have any downtime during expectable deploying releases of new features, bug and security fixes. These requirements are satisfied using master-master replications of all databases in different availability zones. Any maintenance of the database is done using rolling deployment. Database patches are made one by one and during patching, the traffic is temporarily redirected to the rest of cluster nodes. So deployment is always processed with zero downtime. The exception for this is just architectural changes in signageOS system SOA. These changes are always announced in advance through our status page.
Proprietary components of the system are also set to HA mode. All Kubernetes nodes are spread out into more availability zones (AZs) and each service is always running at least once in every availability zone. So in the case of deployment of a new version or outage of any AZ, the traffic is handled by a different AZ. Kubernetes also tries to continuously distribute services to most of AZs based on set up rules. Each AZ has self separated electricity source, internet access, cooling systems and failover plans including safety system. For more details, you can look at the official article about AZs by AWS.
Backup
Even with all of the practices listed above in place to ensure that the system is always up and safe, accidents happen and we have to be prepared for these critical scenarios. From universal best practices experience, the reason for fatal failures is usually hardware degradations or failures, which cannot be recovered in real-time.
Access Management
Another possible reason for losing data or outage of part of the infrastructure is a human failure. We strictly grant access to internal, selected staff only for necessary operations. Most of which are read-only and the deployment is done automatically by the defined and tested process with several review steps.
Backup process
In addition, we have an independent backup processes that is storing all data and infrastructure state into system snapshots every second hour. This means that everything is duplicated to separated storage and can be easily restored in case of needs within a few minutes. This process is not currently automated but we are not considering automation right now because this dramatic failure of the system could happen only in very rare cases. Also, it is not automated because it could have a different, unknown origin. So the first step of this failover process is detecting the problem cause. Then our engineers would deep dive into preparing a solution. In this case, we expect at least partial recovery of the system in several minutes.
Conclusion
signageOS’ skilled team of software engineers maintains this complex system to the highest capability in order to offer you the best services possible. signageOS continually reviews processes and performance to ensure a flawless system that functions at the greatest capacity. Additionally, signageOS maintains the highest level of transparency so our partners remain aware and knowledgeable on the current state of signageOS’ services. Building a trustworthy relationship between signageOS and our partners is a core value of our business that we put at the forefront in all aspects of development and maintenance.