What is New is “Old”

In the last few weeks I was walking with a thought gnawing in my head, and it finally settled.

Cloud functions/ Lambda or that #serverless trend in general is just going back to the Mainframe work methodology ( Yes, I know I am showing my age).

In the early days of computing, when you wrote some code you had to ask/beg the system administrator to allocate you some CPU & Memory to run your application on the shared compute resource (sounds familiar?) and in many cases plead that they allocate these resource enough time to complete the execution.

With the proliferation of personal computers and servers people slow lost that “resource awareness” that was instilled within the early programmers and each language and product became more resource hungry, till we got to software that is larger than 20GB in storage and takes almost as much memory to run.

And then came the (not so) new trend #serverless, writing some code and then somewhere somehow you get allocated the resources to run it in the #cloud, But again you need to ask for the CPU and Memory, and then you are limited by the execution time allocated to the code.

In Today’s “Mainframe” model the allocation of these resources is more readily available, but it only shows that “Nothing new under the sun”.

Cloud Cost Management Vs. Optimisation

I just noticed that it has been a year since the last posting, it seems it takes me a long time to linger on ideas before I publish them.

I have been working as a FinOps in the last year and saw many consulting companies that offer “Cost Optimisation” services and I find it frustrating. The true name of what they offer is “Cost Management” which is the 1st step in understating cloud spend, reducing waste.

Cost optimisation on the other hand is not about reducing cost, it is ensuring that the spending we do is maximised to gain as much performance for the lowest spend, but we will spend money.

The main difference is in the time and planning of the action, the “Management” is going over Cloud resources and cleaning the unused volumes, terminating virtual machines and dealing with RI’s which can be done in a 2 hour review of the Cloud account.
“Optimisation” on the other hand is a process that involves the product team, the architects, the developers and the Cloud team to ensure that the new feature will use the cloud resource most effectively while spending the least amount of money, this is an iterative process that can take anywhere from several days to even months, and in some cases I saw to the realisation that the proposed feature is not financially viable and dropping it completely.

I know that selling a “cost management” is harder than telling a client that we can “optimise” his costs, but we should also need to provide honesty to our clients.

Thankfully I am in a position in my current position to try and preach that change and I am trying to instil that understanding in the company, and from there to our clients.

Enter the world of FinOps

After the loss of my previous position due to the COVID-19, I was fortunate to land a position as a FinOps. The title and the concept was new to me, not the practice, that I have been doing both as a consultant and also as part of my previous job.

To learn that like “DevOps” some one took the “Finance” and “Operations” and combined them together to “FinOps”. To try to explain this to someone outside the Hi-Tech industry is confusing “I am a technological person responsible on overseeing the technological expenditures of the company and working on reducing them while ensureing not to hinder productivity or inovetion” , it’s that last part that most people have a hard time with, as most associate R&D with unchecked spending and calling it “research” (somewhat like throwing a rock into a lake and hoping to hit a fish).

But the FinOps, in parallel to the CISO role is a person that will have his reach in SO many aspects of a company, be it reigning in the AWS spending, working with the developers to ensure that they get the tools they NEED but not nesserily the tools the want, negotiate and review contracts for new services and tools, evaluate the new tools both from functionality and cost effectiveness the FinOps person has his plate full of things to do.

All this and the roles I held before helped me realise that a many place make a crucial mistake when they employee or hire a person for the FinOps position, many places hire an analyst to the position, someone that comes from an financial background not a technological. Placing the emphasis on the Fin aspect of the title, like those that only want developers to DevOps positions.

That method, in my opnion, is the incorrect one.
What you get is someone with a limited scope that have no “operational” skill set and will have a slow learning about the technology and understanding how to talk to the technical people he needs to convince with his recommendation “You can run the database on a ECS with a presistent volume on EFS and t2.medium backend, instead of a m5.large” (if you don’t understand what I said here, drop me a line).

A good FinOps person needs a good Operational background and a solid understanding on how to consider financial implications, but the skill most tech savvy people have is the hard one to find – People skills.
FinOps need to negotiate contracts, talk to sales people and tech people, and ( I know from personal experiance) a lot of us do not have that.

In summery, I am glad that I took this change in direction and hope I will be successful in it, and that I can give from my learning to others.

This post was originally meant to be posted on #peerlyst, but due to the demise of that community, I posted it here.

New directions

The Covid-19 has caused many things to go wrong to a lot of people and lots of industries, I worked for a company that catered to one of those industries that was hit hard – Public transport.

The company I worked for had to make cutbacks and I found myself in the lookout for a new position, during the lock down I took the time to enhance my AWS knowledge and test it against AWS accreditation and managed to obtain the AWS SA associate accreditation

I am now going for the SysOps accreditation.

I found more time to pay attention to the Icinga Ansible playbooks and I am trying to be more responsive for the PR’s and Issues raised.

Time to Say Goodbye

It has been a long time and a long journey for Aiki Linux, but sadly things must come to an end.

Due to many factors, some of our business ventures were not as successful as we hoped for and we had to forgo those, one of those is our partnership with Icinga, not due to any bad blood or any doubt in the product, quite the opposite, it is just the the venture did not end up profitable to us.

We will be closing Aiki Linux for future commercial activity at the start of 2020.

We thank all that worked with us and taught us along the way.

Icinga Camp Berlin 2019

We just came back from the @icinga #icingacamp #berlin 2019 and it has been a great day of meeting new people and hearing great talks about how people use Icinga and what the road map both for community development and actual code features will be.

We are very excited about the upcoming reporting module and the IcingaDB feature that will help speed up the IcingaWeb presentation and help elevate the slowness that the IDO added to the flow.

Also the announcement for the 2 day Icinga conference in Amsterdam next year is an exciting thing, it is great to see a project we have been involved with from early stages grow so much and to such a success.

Production distributed system – pt. 2

Once we were able to have the Galera databases sync and aware of each other is was time to tackle the issue of “How do we register the service?”

So it was time to work on the Consul cluster, we considered using 3 different nodes for this cluster to add the another layer of redundancy to each component but the customer elected to run the Consul service on the same nodes as the Galera. It might seem like an odd point to have the discovery server run on the same node as the service is it monitoring, but the logic was “if the Galera node is down, then the Consul service is also degraded, and we will address them together”

So we build a 3 node Consul service, with agents on each of the Galera nodes.

each node was configured to join the cluster with 2 other nodes specified in the “start_join” directive

{
"server": false,
"datacenter": "foo",
"data_dir": "/var/consul",
"encrypt" : "",
"log_level": "INFO",
"enable_syslog": true,
"start_join": [ "172.2.6.15","172.2.7.10" ]
}

The file was located in the /etc/consul.d/client/config.json  this took care of the client/server sign up, but when about knowing if the Galera is up … Simple, we created a check that queries the backend database and reports back, this file , aptly named galera.json was located on the main /etc/consul.d   directory

{
"service":
{
"name": "galeradb",
"tags": ["icinga-galera"],
  "check": {
    "id" : "mysql",
    "name": "Check mysql port listening",
    "tcp" : "localhost:3306",
    "interval": "10s",
    "timeout": "1s"
   }
  }
}

this ensured that the Consul checked the response of the database and reported back to the cluster in case of a failure and make sure to allow election and deletion to the other nodes.

At this stage , then the backend was ready we started the Icinga installation, with 2 master and 2 web servers in a redundant connectivity (that documentation is found here ), and then we needed to configure the IDO to the Galera database, we hit an issue.

As we changed the /etc/resolv.conf on the Icinga nodes to use the 3 consul nodes , icinga use the Consul as the DNS and be able to resolve for the database:

/**

* The db_ido_mysql library implements IDO functionality
* for MySQL.
*/

library "db_ido_mysql"

object IdoMysqlConnection "ido-mysql" {
  user = ""
  password = ""
  host = "galeradb.service.consul"
  database = "icinga"
}

but considering that many checks of the system relied on DNS resolving of external IP’s .. we were stuck with how we can ensure that the service returned the correct IP.

So we had to connect Icinga  to a named server, in our case Bind9. We build a named service on the same nodes so we can make as little change on the icinga server and use the already configured DNS requests on port 53 [UDP] going to the consul servers to work for us.

A very basic named.conf :

options {
  directory "/var/named";
  dump-file "/var/named/data/cache_dump.db";
  statistics-file "/var/named/data/named_stats.txt";
  memstatistics-file "/var/named/data/named_mem_stats.txt";
  allow-query { any; };
  recursion yes;

  dnssec-enable no;
  dnssec-validation no;

/* Path to ISC DLV key */
  bindkeys-file "/etc/named.iscdlv.key";

  managed-keys-directory "/var/named/dynamic";
};

include "/etc/named/consul.conf";

Notice the inclusion of the consul.conf file , this is where the “magic” happens:

zone "consul" IN {
  type forward;
  forward only;
  forwarders { 127.0.0.1 port 8600; };
};

This file tells named to forward all DNS request to external DNS server except for those with a “consul” domain , which are then forwarded to the localhost on port 8600 ( consul’s default DNS port) ,and thus provide the IP of the Galera cluster, for any other IP is will go to the DNS of choice configured when the consul service was build, we choose the all too familiar  “8.8.8.8” ( this is added to the cluster bootstrap stage )

"recursors":[
"8.8.8.8"
]

So the next stage was to test the resolving and the system survival.

Production distributed system – pt. 1

A customer came to AikiLinux requesting our assistance in designing and implementing a highly distributed and resilient monitoring system based on Icinga, with a planned scope of monitoring it’s own internal cloud service and for some of the services it provides for it’s external customers.

In the initial step we evaluated the requirements of the cluster and also build a small scale lab for them (master, 2 satellites and a host to monitor), and then set out to understand the network topology and limitations that might impact performance.

The things that we found were “normal” for a large multi continent organisation:

  • remote separated data centres
  • very restrictive IT department
  • ESX resources
    … nothing new or things we haven’t encountered before.

So we set out to design the solution and thought on what components will help us provide a truly redundant system, without relying on any cloud provider service, all done in house.

The Stack we ended up with was fairly simple : MariaDB Galera , HashiCorp Consul, Named for the database, and a standard HA setup for the Icinga itself.

The first challenge in building this system was ensuring that the Galera cluster was up and running so we modified the  /etc/my.cnf.d/server.cnf

# this is read by the standalone daemon and embedded servers
[server]

# this is only for the mysqld standalone daemon
[mysqld]
log_error=/var/log/mariadb.log
#
# * Galera-related settings
#
[galera]
# Allow server to accept connections on all interfaces.
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=”gcomm://172.32.6.15,172.31.6.15,172.33.6.15″
innodb_locks_unsafe_for_binlog = 1

## Galera Cluster Configuration
wsrep_cluster_name=”icinga-galera”

and started the nodes….no sync between the master and the nodes.

We tested several solutions, modifying the security policy and the firewall, but in the end the only way to get the cluster up and running was to disable SElinux (mind you, it was after the 3rd firewall that you need to get through to gain access to the server) .

Once the node “saw” each other, we started testing data replication and we saw that 2 nodes replicate data but the 3rd did not.

It turns out that NTP was disabled and the time diff between the servers was more then 1900 seconds, uncomment the ntp records and ensure the sync of the clocks … and now we have replication. YAY!!

 

 

 

Updates and plans for the future

It has been a couple of busy months for our team at AikiLinux since FlossUK, with good things happening:

  • We have started working with Icinga on organising an Icinga Camp in Tel Aviv later this year, the provisional dates are 10-16 of December .
  • We have expanded our personel by bringing a new person to the fold in the UK.
  • As we strive to expand our knowledge, our team members have been working and have implemented a Prometheus monitoring solution at on of our customers, and also building a DR solution for their AWS  based on Terraform .
  • 2 new clients started the engagement with AikiLinux :