Despliegues heterogéneos complejos sin pérdida de servicio Development

How-to: Complex heterogeneous deployments without loss of service

04/08/21 17 min. read

1. Introduction

One of the big problems to be solved by development teams, integrators and DevOps is how to coordinate complex deployments. This is especially true in cases where different technologies, languages, platforms and, above all, teams of people are involved.

Moreover, this must be achieved without loss of service (being able to do them, in most cases, outside of maintenance windows… we IT guys will do anything to avoid working on Saturdays!)

At Santander Global Tech we are used to this type of deployments due to the wide variety of systems in which banking application software works successfully.

In this article we want to show how we manage to ensure that these deployments are successful in 99% of cases… and for that 1% there is a contingency with immediate resolution.

For this we will use the well-known shadowing technique, i.e. it is deployed in a productive environment but not visible to the end customer, where it can be evaluated that the software and its configuration is working properly.

There are many other ways of testing the software (blue-green, A/B Testing or canary). This way has been chosen in order to be able to integrate the various technologies and infrastructures in a simpler way, as not all of them allow these more modern techniques or it would complicate their integration and to be 100% sure that it works correctly before opening it to the client.

Throughout the document we will discuss the theory and use a practical example to better understand it, involving WAS and PaaS servers, in an Active/Active high availability environment, which talk to each other both via browsing and Web Services.

In our example we will detail how to manage the different technologies so that they are treated as one, and integrate them for user testing in the shadow environment as if it were a real integrated final environment.

It should be noted that previous DEV and PRE environments are also available to perform more extensive UATs without so much paraphernalia, and actually what is described in this article is done in PRO in order to perform both a technical and functional satisfaction test, which ensures that the software and its configuration will be open to the public without avoidable errors (e.g. connectivity errors).

2. Prerequisites

2.1.  Infrastructure, a big part of success

To be able to provide continuous, resilient service and deployments without loss of service, as well as to have high Active/Active availability, it is necessary to use the same strategy as NASA: have everything in duplicate.

For example, with respect to the IAS infrastructure (dedicated servers), in production there are two clusters with two servers each (and each server in each cluster in a different DPC). Each cluster can run one or more WAS instances, for example, depending on the load requirements of the environment.

As for PaaS infrastructures with, for example, Openshift, two different clusters in different data centres are also used, and at least two PODs per service in each cluster (usually each POD runs on a different physical machine, gaining resilience).

In summary, there are at least four elements that can provide service in Active/Active mode, and it is ensured through load testing that half of this infrastructure is capable of supporting the total load necessary to provide an optimal service. And that in the event of an unusual downtime of a physical machine or an entire DPC, the service would not be interrupted.

In all cases a load balancer is in charge of managing the TLS layer. This network element will make it possible to change DNS to avoid loss of service and to verify the quality of the software. 

In generic terms, this would be the infrastructure that would allow us to have High Availability, as well as allowing us to make the passes in a more secure way:

Generic structure for High Availability
Generic structure for High Availability

Let’s put here the necessary infrastructure for our example:

Infrastructure of our example
Infrastructure of our example

We will call the first clusterShadow“, which will be the part that will first be “Shadow” from the end user in order to be able to update the software and perform the tests while “Real” (the second cluster) will provide service.

Throughout the document we will illustrate how the load balancer will be configured to do this.

2.2. A trick: task-specific DNS

There are several approaches to solve the problem of having to provide service through several technologies for the same application. Somehow you have to know where the requests have to go!

One approach is to use a different DNS subdomain (third or top-level domain) for one of the technologies, and another is to use paths in the URL that distinguish which servers are to be addressed (configured in the load balancer, a reverse proxy or a lightweight gateway).

In our example we will choose the first way, although it is easy to apply to the second (simply, for example, the subdomain name would be the first level of the URL path).

We will use this to be able to correctly perform all the UAT (the commonly called user tests), so it is necessary to have DNSs pointing to each environment of each system so that the tests can be performed in an integrated way.

As usual, we will have our own DNS that will serve our application (please, don’t mind the names, it’s an example! For your applications, use cooler appropriated names).

In our particular case, we will use two, one for each technology: servicewas.myapplication.com and servicepaas.myapplication.com

In addition we need to create at least 2 more for each technology (one per cluster):

  • shadowwas.myapplication.com
  • realwas.myapplication.com
  • shadowpaas.myapplication.com
  • realpaas.myapplication.com

Painted in architecture it is better understood:

DNS subdomains pointing to each cluster
DNS subdomains pointing to each cluster

Actually all DNS subdomains point to the load balancer and the load balancer is the one who knows that servicexxxxx has to point to both clusters, shadowxxx has to point to cluster 1 and realxxxxx has to point to cluster 2.

2.3. In addition, our secret ingredient… multistep pipelines

Now, please be quiet, we will proceed to introduce … our secret ingredient!

One of the keys to being able to perform these passes as unobtrusively as possible is the use of multistep pipelines.

And what does that mean?

That the deployment files are defined in several steps. This allows step-by-step execution instructions to be given, allowing different configurations or different software to be launched at each step, with progress being marked by an orchestrator process or by a human launching it manually.

The structure of one of the files is as follows:

steps:
  - name: Shadow
    next: SetUpShadowForService
    deploys:
      - deploy:

         […]

      - deploy:

         […]


  - name: SetUpShadowForService
    previous: Shadow
    next: ShadowInService
    deploys:
      - deploy:

         […]

[…]

This way you can indicate which actions are to be performed at each step automatically.

For example, in the first step you can lower the number of PODs, take Shadow out of Service on a Load balancer or configure certain IAS in a specific way.

To move to the next step (or the previous one if something went wrong), you would simply launch that, the next or previous step, naming it or following a “next” or “previous” order in your deployment manager (e.g. UrbanCode or Jenkins/Cloudbee).

3. Deployment steps

Afterwards, the deployment strategy is what allows this to be carried out in an orderly fashion and allows the new software to be tested in a controlled environment and without downtime!

These would be the steps we would follow…

3.1.  Shadow Deployment

In this step we will isolate the shadow clusters in order to deploy the software and its new configuration, so that any changes we make cannot be seen by the end customers. But the new software can be tested.

These would be the steps to follow:

a) Load balancers: Take Shadow Cluster out of Service and leave Real Cluster giving Service.

b) WAS Configuration: Configuration to use in Shadow (Modify to point to Shadow DNSs, e.g. redirects, logins, WS).

c) WAS Deployment: Deploy WAS software and configuration on Shadow clusters.

d) PaaS configuration: Deploy configuration files and secrets. They can be deployed in Shadow and Real (in our case they are versioned, so even if they are installed in Real they are not used until the new SW is deployed, since the version to be used is indicated in the PODs’ Enviroments).

e) PaaS Deployment: Deploy all the services (e.g. Angular and Springboot) to be deployed in Shadow with the secrets configured pointing to Shadow.

f) Shadow Shakedown

And this is what our architecture will look like:

Shadow out of service

The shadow part will no longer provide service and will be accessed through the Shadowxxx subdomains. As everything is configured to point to the Shadowxxx domains, the application can be navigated as if it were its “service” domain and testers can test their environment in a more natural way.

It is important to configure the software so that all the URLs it calls are those of Shadowxxx, both navigation URLs and calls to possible services (e.g. web services) so that the experience is fully “test the new software and configuration”

3.2.  Preparing Shadow for Service

Once the software has been tested and integrated, it is configured to where it should actually point, to the service DNS, and that there are no problems before providing service.

These are the steps that will be followed in this part of the pass:

a) WAS Configuration: Configuration to use for Serving (Modify to point to Service DNSs, e.g. redirects, logins, WS).

b) PaaS Deployment: configure all services (e.g., Angular and Springboot) with configured secrets pointing to Service (as it will work).

c) Shadow Shakedown ready to provide Service

At this point we test the environment that works as it will be in PRO. Connectivities and redirections are tested, which are the correct ones for the future productive environment.

This is the architecture of the example:

Shadow ready to go into Service
Shadow ready to go into Service

This point is important and the software must be configured so that all the URLs it calls are those of servicexxx, both the navigation URLs and the calls to possible services (e.g. web services) so that when it enters service it already has the correct configuration.

3.3.Put Shadow into Service

At this point the new software comes into service.

As this is an instantaneous change in the load balancer, there is no loss of service.

These are the tasks:

a) Load balancers: Put the Shadow cluster into service and take the Real cluster out of service.

b) Service shakedown

c) Wait several days

This is how the architecture of our example would look like:

Shadow into Service

After this point, it is usually left for a few days (often referred to as “fallowing”), so that if there is a problem, it can be reversed immediately.

The way to revert is to re-point the Load balancer as it was at the previous point, i.e. Real Service. This is immediate and would leave the previous software and configuration working again in case of any serious problem.

Although half of the infrastructure is in service, it is sized to support normal usage and even occasional peaks without significant delays in response.

3.4.  Deployment in Real

Now it is time to level the two clusters, i.e. leave the same configuration and software as in Shadow. We will follow the same strategy, although in this case Real is already out of service.

Therefore, what has to be done is to deploy the software in the Real cells and its configuration pointing to the Real DNS (realxxx).

These are the tasks:

(a) WAS configuration: Configuration to use in Real (Modify to point to Real DNSs, e.g. redirects, logins, WS).

b) WAS Deployment: Deploy WAS software and configuration on Real clusters.

c) PaaS Deployment: Deploy all services (e.g. Angular and Springboot) to be deployed in Shadow with the configured secrets pointing to Real.

d) Real Shakedown

The example architecture would look like this:

Real ready to be tested
Real ready to be tested

Again, minimal conformance testing and some connectivity verification techniques are performed in an integrated manner between the two systems.

3.5. Preparing Real for commissioning

Once it has been verified that the software in the Real clusters is working correctly, the configuration has to be prepared for service.

These would be the tasks:

a) WAS configuration: Configuration to use giving Service (Modify to point to Service DNSs, e.g. redirects, logins, WS).

b) PaaS Deployment: configure all services (e.g. Angular and Springboot) with secrets configured pointing to Service (as it will work).

c) Real Shakedown ready to provide Service.

For this we apply the configuration with the DNS pointing to Servicexxx and check that it is correct.

This is how it would look like:

Real ready to go into Service
Real ready to go into Service

And finally we would be ready for the last step…

3.6. Putting Real into Service

Already being the last step, the real clusters are put into service, so we are back to the initial situation, but with the new software deployed.

These would be the tasks:

a) Load balancers: Put the Real cluster into service (Shadow already is into service).

b) Service shakedown

c) End of the deployment and thank the participating teams for the result.

This is what the architecture looks like at the end:

All in Service
All in Service

Our software will then be ready for full power and waiting for a new deployment.

4.Recommendations and next steps

And our last but not least recommendation… the preparation and documentation of the deployment to production.

It seems obvious, but it is essential to plan the go-live as thoroughly as possible and leave as little to improvisation as possible.

It is also important to get in touch with the departments that will be involved in the production run, to determine if there is anything missing or if you can anticipate any data they may need.

Therefore, all tasks that can be brought forward should be brought forward, especially if they require special configurations that take time to complete or if the implementation teams need data from third departments.

Examples that can delay a deployment to production and lead to failure: firewall rules, user or client id/client secrets of WS or APIs, access to third party applications, CA certificates of the destinations, client certificates if necessary, permissions on directories…

All these points are easy to request in advance from the corresponding teams. In many cases, it is also possible to carry out a verification test before the pass (for example, asking for a simple execution from the source machine to the destination machine will verify the connectivity, saving us many surprises).

Therefore, in particularly complex passes or where new or special configurations are required, a quick meeting with the coordination departments and executors showing the steps to be taken and describing actions to be taken, can anticipate problems that we might otherwise encounter and delay the pass.

In addition, another maxim to follow is “give instructions as you would like to receive them yourself”. In other words, indicate the tasks unambiguously, as precisely and completely as the team would like to receive them.

That way there will be no confusion, last minute questions due to not having the right information or errors in the execution.

A good practice is to have a document in which all the steps to be carried out and in what order are indicated in a summary, and a section with a detailed description of each of the points to be carried out.

Simple example:

Deployment 22333 of the application “My application”

Order of tasks:

  1. Configure WS
  2. Modify configuration file Config.xml

Details

  1. Configure WS

Configure a new WS in the configuration of the application “My application” located on the server myserver.corp.

The URL is https://app.otherapp.es/application/ in which the credentials previously provided by XXXXX in the mail thread “Passwords for WS otherapp” will be used

2. Modify Config.xml configuration file

Modify the file /configs/config.xml for the application “My application” located in the server myserver.corp.

Edit the <language> tag:

It now contains:

<language>es</language>.

And change the content to:       <language>en</language>.

This simple document, in addition to facilitating pass tasks for the teams, serves as a planning exercise and also allows for the identification of potential gaps before they occur.

Logically, using a tool for coordinating passes and requests is a great help when it comes to managing it, for example Remedy or Service Now.

In the next steps, although in Santander Global Tech the vast majority of technologies can already be deployed with full CI and many of the tasks are automated or semi-automated, Santander Global Tech is working hard to achieve full CI even in this type of complex multiplatform passes, minimizing manual tasks and allowing the passes to be fully automated, through coordination processes of the different pipelines involved in the CIs.

This will make it possible to automate even the most complex passes with almost no human intervention.

We cannot end this article without thanking all the teams that allow such complex passes to always come to fruition, because without their great knowledge, dedication and enthusiasm it would not be possible: Change Management, Deployment Support, Web System, DevSecOps, Data Network, Service Establishment and so many others that participate occasionally but fundamentally and working as a single great team allow this continuous success.

Santander Global Tech is the global technology company, part of Santander’s Technology and Operations (T&O) division. With more than 2,000 employees and based in Madrid, we work to make Santander an open platform for financial services.

Do you want to join this great team? Check out the positions we have open here and Be Tech! with Santander 🙃

Follow us on LinkedIn and Instagram.

Ruben Rodríguez Martín

Rubén Rodríguez Martín

Santander

Computer Engineer programming since I was a child (10 years old!), user and internet administrator since 96’s, software development specialist, systems, networks, security, home automation, complex deploymens, PaaS… and a new technologies big fan, process optimization, innovation, science fiction ¡and much more!

 

Other posts