CaaS Cloud

12 Tips to get the most out of a Containers Environment

12/04/21 17 min. read

A few months ago I talked in this post about all the different types of Cloud services so if you are still not clear about them, you can read it here.

What is a container environment?

As Google says, Containers offer a logical packaging mechanism in which applications can be abstracted from the environment in which they actually run. This decoupling allows container-based applications to be deployed easily and consistently.

One of the most common problems is that we do not take advantage of the full potential of these these environments.

For this reason, in this post I am going to talk about the operations that affect the use of Containers as a Service (CaaS), concepts, details and 12 tips based on my experience that you cannot forget if you want to get the most out of a Containers Enviroment.

In production environments there are many configurations that require more finely tuned specific knowledge than in pre-development or pre-production environments.

Working with Docker🐳

The use of docker technology requires multiple configurations and technical expertise, as well as knowledge of the architectures and systems on which they are based.

This makes the provision and maintenance of a large number of services manually or watertight over traditional O.S. quite complex.

Therefore, the best advice is to rely on a container orchestrator, which will give you the following features or functionalities already implemented, available for use or configuration out of the box.

Therefore, the best advice is to rely on a container orchestrator, which will give you the following features or functionalities already implemented, available for use or configuration out of the box.

A container orchestrator offers us:

  • Automatic configuration
  • Automatic container deployment and start-up
  • Load balancing
  • Auto-scaling capability
  • Auto-restart capability
  • Healthchecks or “health” control of each container
  • Control of data exchange and networking
  • Maintenance of secrets and configurations, e.g. config-maps

12 Tips to improve the experience on these systems

12 Tips to get the most out of a Containers Environment

👉 1º Correctly define the application architecture

Knowing how to discern which elements of it are susceptible and which are not to be containers, making an assessment if the current architecture is to be containerised, or making a new definition accordingly in the case of a new product.

It is not as easy to work with Java/Maven in microservices as it is with databases or clustered dockerised products. To help make this decision we should base ourselves on the concept of Cloud Readiness.

In the case of the software to be developed – which will form part of our containers, as the content of this software will be deployed and compiled inside them – we would have to take into account the following criteria:

  • Design or use of microservices.

https://www.redhat.com/es/topics/microservices

  • Design or use of microservices.

Data should be held on external systems, if possible with enhanced security or encrypted.

It should also be noted that containers are ephemeral and the data stored inside are lost after a reboot, and only the container itself is able to access them, so try to use to support the use of shared data, data systems and external protocols (JDBC, SFTP, S3).

The use of caching systems can also be very important, but it is always necessary to analyse both the need and the efficiency of these systems, and that they meet the resilience criteria explained in the next point. (e.g. Redis or Datagrid are commonly used systems).

  • Make implementation as resilient as possible

The software and its configuration needs to be optimised as much as possible to withstand any kind of failure caused by external problems.

“These failures are usually associated with communications with other systems or other parts of the architecture”

This requires working on the proper use and configuration of connection pools, with fine-tuned parameters, as well as the ability to reconnect in the event of failures without the need for reboots or any manual or human action.

  • Performance and Scalability

It is also very important to optimise the resources used, classes, libraries, etc…. They need to be as fast and efficient as possible.

Additionally it is very important to take into account the possibility that the SW runs standalone and is compatible with the Auto Scaling functionality provided by these systems, to run in high availability mode in the same way allowing a smooth experience to the end user and without failures.

  • Building security into the code from the outset

It is increasingly important, especially in cloud systems, that our systems are secure, and not just the systems are secure, and not only the systems themselves, it is possible to start with the code.

For this there are also code control systems that can analyse and indicate security problems from the developer’s point of view. If you want to know more details about these concepts, here are some other interesting posts.

👉 2º Deployment Environment Review

For the experience to be complete, we will need to have an environment ready to integrate the compilation of the SW, with the compilation of our docker images, as well as the integration and deployment of the same on the final orchestration environment.

This group of tools is usually defined by the nomenclature ALM (Application Lifecycle Management), and typically consists of:

  • Compilers
  • Version Management and Version Control tools (e.g. Git)
  • Task Orchestrators (e.g. Jenkins)
  • Binary repositories (e.g. Nexus)
  • QA or code analysis tools, whether technical, functional or security (SonarQube, Kiuwan, Fortify, etc…)
ALM
Source: Miro Medium

And to complete this “simple” stack of tools, we will need to use a repository of docker images and have templates associated with each of the images, which will allow our final environment to understand what resources, configurations and variables we will need to use every time we deploy one of these containers.

👉 Minimise the content of the container

There are several versions of Lightweight O.S. on the market that are prepared to run more efficiently on top of containers. These contain some specific design constraints: a read-only file system, a minimal set of packages and a single command to manage updates.

What we are interested in is that the container starts up as fast as possible and that it consumes as few resources as possible on the host that contains it.

This will allow us to host a larger number of dockers on the same base infrastructure of the cluster.

👉 4º Manage and standardise the catalogue of available images

In large or enterprise environments, it is advisable to have a catalogue of predetermined base images that can meet 80/90% of the architectural requirements. This will also allow us to have optimised and approved images in terms of all kinds of requirements (especially security requirements).

At the same time, having a standard environment for monitoring and using them avoids having an infinite number of different images, whose behaviour or analysis in terms of support would be infinitely complicated.

👉 5º Safety is very important

Just as a point to keep in mind, it is not the purpose of this post to zoom in on the security part, but it must be present at all times, especially if we are going to expose some of the deployed services to the internet. We always recommend the use of WAF’s like Imperva if we want to take advantage of the cloud.

Also do not forget to have all traffic secured by SSL certificates with the latest encryption algorithms as well as not allowing insecure protocols (e.g. TLS 1.0 or 1.1 at the moment).

In terms of generic parameters such as information leakage or security headers, we must be concerned that our entire base image catalogue complies with all of these out of the box.

In addition to having the latest operating systems, patch versions, etc., we must also be concerned about having the latest versions of products or application servers deployed and configured in such a way as to avoid any kind of information leakage that would allow an attacker to know what he is up against.

👉 6º Use of Healthchecks

Openshift offers two types of healthchecks for each group of containers or pods.

  • Readiness

It allows us to indicate when, once the container’s processes are up and running, it becomes available to the service (basically it decides whether to add it to the list of existing containers to allow traffic to enter this new instance from the moment the check is fulfilled).

Inform Service: Readiness Probe Passed
Source: blogger.com

It allows us to indicate when, once the container’s processes are up and running, it becomes available to the service (basically it decides whether to add it to the list of existing containers to allow traffic to enter this new instance from the moment the check is fulfilled).

Therefore, a bad or non-existent configuration of this health, will cause failures during start-up if there are requests that are diverted to this instance without being completely started.

  • Liveness

It allows us to periodically check the correct behaviour of our container, checking it via url, or checking the connection to a specific port (you can also check the execution of the output of an internal script that makes the relevant checks).

Openshift allows us to indicate how often we are going to perform this check, as well as how many failed times is really a KO for us, making from that moment onwards that, after a KO, the container is restarted and returns to its original state (it will not give service again until it does not meet the first Readiness check mentioned before).

This health allows us to have a quick recovery tool in the event of an obvious malfunction of the main functionality deployed.

Source: blogger.com

It is as important to make these configurations as it is to make them well, and that means that a standard configuration is not valid in any case for all of them and that we have to zoom in on the behaviour of the container in order to adjust the values to times that allow us to recover the service as soon as possible, but without going overboard, as we can generate a good problem of loss of service with a bad configuration, or one that is not sufficiently worked on.

The use of url’s as check, only allows to recognise as KO, error codes different to 200 in the time range that we have indicated (as if it were a timeout), it does not allow to recognise patterns in the return code or visual errors that contain 200 as exit code.

In the case of springboot, and making use of springcloudconfig, the best use for this type of checks is the url /info , which simply gives us information about ourselves, making use of other types of healths such as /health which is supposed to be the health status of our java application, we can fall into the error that this /health validates dependencies or third party systems, being able to give KO’s by timeout waiting for the health information of a third party, but the system would end up throwing our pod in execution, even if our state was really correct.

👉 7º Autoscaling, criteria and best usage

We must know the behaviour of the application in order to make the best possible use of the HPAs, understanding whether high CPU, memory or any other monitoring criteria can affect us.

This will make the metrics to be configured more efficient in terms of Horizontal scaling, avoiding the creation of a lot of unnecessary replications in the face of a specific peak that in the end will not be used, making the environment more inefficient, especially during controlled load tests.

👉 8º Use of limits and implications

These configurations are extremely important not only for the proper functioning of the container, but for the overall health of the hosts where they live, because if we do not limit consumption, a single pod could occupy the entire resources of its host, and leave the rest of the containers that coexist with it without processing capacity or memory, generating a problem in the environment.

It is important to know that we must carefully control and monitor both the memory and cpu consumed by the container itself (as an O.S.) which is what affects the host, as well as knowing in detail which processes or services internal to the pod are the ones that cause that memory or CPU to reach its limits.

Sometimes, when faced with outofmemory failures, it seems logical to raise the limits of the pod, and we forget to check what problem within it is generating it and why… is it software? Is it the software, is it some problem or configuration of the application server, is it some other process running in the docker that is not working correctly?

If we can do this detailed analysis, we can probably fix the problem without increasing the container’s memory, and even optimise it so that it can be reduced without affecting the application’s performance.

CPU is just as important to control and monitor, microservices should not consume too much CPU or be too heavy with their logic.

control and monitor the CPU

An example host would be a 64GB machine with 8CPU’s, taking into account that the microservices could run with 512MB of RAM, we would have a capacity of more than 100 pods running on the same host.

If we do the calculations, and taking into account the CPU’s reserved by the orchestrator, we would have a situation where the distribution of CPU per microservice would be of the order of 6 CPU/100 pods, so that on average they should consume about 60 millicores (practically nothing!!!).

👉 9º Load testing, check your limits

It is obvious that this type of testing is needed to ensure that the environment and its capacity are working properly. Above all, to check that the cpu/memory and autoscaling configurations work and allow the environment to change according to the load needs, giving reasonable or expected response times.

However, as with everything else, it is very important to carry out these tests with use cases that are as complete as possible, as we can fall into a trap, thinking that the system will hold up if we test certain operations, but not others (heavier ones).

It is advisable to have a large capacity in order to have the possibility of scaling at all times in the event of a peak, without the risk of running out of resources. Auto-scaling of cluster nodes is a very interesting option, although it is more complex.

👉 10º Complete Monitoring/Observability Environment

It is highly recommended to have tools and export data that are generated by the orchestrators and that in a grouped form allow us to have access to the following data as well:

  • Container log management
  • Management of incoming/outgoing HTTP traffic through routers.
  • Event Management (including healthchecks)
  • Audit management (who does what)
  • Infrastructure information, both node and container status at CPU/RAM/Process level.

👉 11º Explore Chaos Monkey Processes

The idea is to be able to do resilience tests on the environment or specific projects, setting up crashes and errors in a random or controlled way, in order to be able to check that the systems are robust and resilient enough to withstand this kind of “noise“.

👉 12º “Self-contained” containers, is it the best option?

In conclusion, I would like to make a reflection and show that nothing is as good or as bad as it is painted.

“The most important thing is to choose the system that best suits the needs of the rest of the environment, and above all, the work model and methodology of the team, which also has a great influence on making this type of decision”

The self-contained model, where the SW goes together with the base image of the container and they are compiled together in Continuous Integration, is a model that has multiple advantages (such as the security that the image is exactly the same in all environments with a good versioning management) or not depending on external systems in the start-up of the pods (e.g. nexus to bring the SW, or a git for the configuration files etc…)

These are points of failure that we avoid and often fail at the wrong time but require a highly agile team to manage changes and issues, as any changes that are brought to production must be defined prior to compilation.

In addition, they also require a robust continuous integration system that ensures the availability of the integration environment and continuous deployment to be able to resolve a problem or bug quickly.

This would be in most cases the best option… especially nowadays if we have DevOps teams and Agile procedures.

On the other hand, having the Dockers/SW/Configuration elements decoupled allows different teams to work in each of the areas, making it easier to delegate support to a third party, and very importantly, facilitating massive changes derived from required changes in the base images, whether due to obsolescence, malfunctioning or possible security risks.

Let’s imagine an environment with 10,000 java springboot microservices, which make use of a java V1 base image and are in production. A new critical vulnerability issue is detected in this image that needs to be fixed and deployed urgently, but given a new V1.1 image…

With the self-contained model, we need 10,000 builds, and 30,000 SW deployments to secure the production environment. With this model, one parameter could be changed in an automated way in 10.000 yml files (in a controlled way of course so as not to saturate the J environment), but it could be done from the operations team without the need for development, implementation or deployment teams and in only 1 step!!!!.

The same applies to the other parts, as you could modify the SW version to fix a fix instantly without the need to compile a new docker, or you could make an urgent configuration change to fix a critical parameter in just seconds.

Miguel Angel Salas

Miguel Ángel Salas

Santander Global Tech

Cloud & Transactional Manager, currently managing multidisciplinary teams, especially from the middleware layer and associated with emerging technologies. I am specialized in APIs, Cloud Systems, Containers and DevOps and Agile Methodologies. One of the things I enjoy the most is being updated and always trying to do my bit, change-oriented and above all, people-driven.

 

👉 My LinkedIn profile

 

Other posts