So i have been doing some deep dives into the MicroServices development area. Playing with Mesos, Marathon, Consul, Docker, Traefik as well as tools to make things easier like Ansible, Chef, Terraform and Mantl(more of a platform though). I found it quite interesting what is and is not possible with the current tools. I also began playing with cloud service providers paying for resources to setup these services once i realized my local vmware workstation environment wouldnt be enough for true testing.
My first impression of Microservices was that this was a revolutionary way to deal with computing loads. It essentially takes the old mainframe setup and distributes the load across a cluster of machines that you can consider cattle. Those machines are disposable, unique in name only, places to run whatever load its capable of. The hardware you run can be new, old, refurb, doesnt matter, its expendable. Redundancy? Dont really need it, just need enough machines to stay ahead of the attrition. Need more resources, just rack and install the basics, done. The one minor limitation i saw was that there was no way to say "power this machine down if not in use". A true scale up and scale down on the physical side mechanism is not built into mesos. I am testing a solution thats currently being pushed in OpenStack called Ironic which allows you to provision bare metal automatically (something mesos doesnt do either). A fleet management of sorts.
Mesos is easy to install, consul a bit more time consuming, which makes ensuring DNS is working properly in this environment even more necessary. Many folks start off with DNSMasq, which is a good stop gap. But after realizing that mesos by itself was incomplete ( referring to the open source version here, not MesosSphere ), i started looking at what others were doing about the lack of management and monitoring. I found a few projects online along with a few how to brew your own guides that seemed a bit tedious for my timeline. One project stood out among them all, one called "Mantl". Its a project that was created and published on GitHub by Cisco ( of all companies ). They put together a platform deployment tool which builds a mesosphere like environment with various components. Interestingly enough, it seems Mesossphere is now mirroring the gist of the features in their enterprise setup. While they are not using the same tools, they brewed their own, the outcome is strikingly similar. I assume its probably slightly better because what they are doing is naturally integrated, but i cant test it at the moment since they dont offer the mesosphere product to my current cloud services provider Digital Ocean ( DO ).
So on the surface, you are confronted with a wow system. When you start asking questions about monitoring it, management of nodes, communication between components, scaling strategy, task management; things start to fall apart. Luckily with these additional tools that Mantl has melded together using Ansible, you can answer most of those questions. For monitoring, there are two aspects: Monitoring the environment and monitoring your applications. For monitoring the environment, they created an addon that uses ELK stack which are setup as microservices and run in Marathon. For Microservice health checks, they have consul and of course you can use the native Marthon healthchecks per application instance. You can use another addon called Chronos for task management, though apparently in the upcoming releases of marathon, they are integrating that into the product( not 100% sure if that will be in the free product or if thats a MesosSphere component and part of the DCOS ). Scaling is partially handled by three components: marathon, mesos and traefik. Mesos provides the resources, marathon task distribution and traefik handles edge communication/load balancing. You could supplement that with haproxy nodes on the perimeter if you really wanted to get fancy. Auto-scaling is not quite a feature yet. I did find a project in Github that runs in chronos, checks the load of a task and then can initiate a scaling request to marathon. Along with healthchecks, Consul also handles the communication between applications by acting as a DNS manager of sorts that keeps and distributes a list of services in the cluster. I must note that MesosSphere is constantly evolving and becoming inclusive of components with every release. Their stated goal is to be a DCOS that includes every feature to manage itself. Features are added with every release.
Mantl has an upside which is also a downside( just like MesosSphere ), they push new releases a lot. Both are still in development, so this really should be expected, though Mantl is worse because every component they use in development. This means that in order to fix bugs, you may have to deploy a whole new environment or attempt to upgrade just one component(which may or may not go through properly as i have found out). While they are VERY helpful in the chat and via issues, sometimes the answer is: Blow it away and start anew. While i would hate that with other software, i dont mind it as much with Mantl, in fact with Terraform, its kinda like "I HAVE THE POWER" Heman style. Mantl has support for DO deployments, so once i figured out the nuances of deploying on DO, i can change/update the Terraform configuration, then tell Terraform to destroy my environment. Terraform then using an API in DO to delete the running virtual machines and create new ones. Once that is complete, i can then run the mantl install script which uses ansible to deploy the environment and all the tools. I of course have to manually update the external DNS entries with the new IPs that DO assigns to the new machines, but thats something i can do while the script is running. Once the script is done running(assuming no errors), i can simply log into the web ui and see the environment. If something goes wrong, simply have Terraform blow everything away and start again.
For the developers, the new version of marathon includes a spot where they can just copy in their JSON configurations to create new applications. They could also use the marathon API if desired. This process of creation and destruction does cause some anxiety for them as they basically have to load all their application again, which given the amount they currently run ( 75 ), it takes an hour or so to put them all back in. Luckily, the 1.06 Mantl release has been super stable and has much more recent builds of all the components. I ended up having to use CentOS as the OS of choice because the developers wanted to use Java and Ubuntu doesnt support the latest version 1.8 yet. It also doesnt handle the install well if you try to force it from a third party source. Mantl also seems to prefer CentOS, so it was a rather easy decision to switch from my usual Ubuntu installs to CentOS. Little bit of an adjustment for me, but i used to use Slackware back in the day, so pretty much anything is easier to deal with than Slackware, though slackware is pretty secure by default.
The one downside i can find with Mantl is that there is no user management. There is one username and password for the environment. Mesosphere has enterprise grade user management now, so they are officially ahead in this category. Mantl has a stated goal of doing something about this, but its medium-high hanging fruit at the moment. However, they are steadily going through their list of things to do. I would prefer they get auto-scaling working...
I have been contributing as best i can to the Mantl project, which is why i am not publishing a guide here. I have contributed their their wiki guide when possible as well. They have a great guide for DO now and i was able to help them fine tune their Bare metal guide since i originally deployed mantl in my VMWare workstation environment. I also plan to try their vmware vsphere guide in my vsphere cluster. I was thinking of using mantl to run ELK stack for the companies log management system. Elastic search is really impressive and the integration of elk stack takes advantage of all the features you would need to get this done ( logstash, elastic search and kibana; hence elk stack ).
On a different note, i find the Traefik project to be quite interesting. It integrates with Marathon directly allowing you to add tags to applications that then tell Traefik how to handle external traffic to that application. You can specify that communication is only SSL or use secure sockets or both. It also automatically adds new apps to your load balanced domain. So say your application is named "Jack" and your domain is "awesomeness.something.com", the new URL becomes "jack.awesomeness.something.com". Simply have users or your other applications point to and Traefik will route it to the correct internal component. End to end SSL now works as well. The only complication is that you must build your application to include the mesos host's certificate to ensure Traefik sees it as valid. Luckily, Traefik is adding a feature that will allow you ignore internal bad certificate errors. I spent some serious time working with Devs to get the end to end part working, but this coming feature will make everyone's lives more easy. Yes, unverified certs are bad for security, but if you dont allow third parties in your environment, its not that bad of an issue when you consider that you get end to end encrypted communications. And before i forget, you will have to modify the default traefik configuration after mantl finishes the install to use SSL, but just save the certs and the configuration before you destroy a deployment then restore it after.
There is one thing i wanted to cover here, thats using the web ui in DO. I had to figure this out on my own because i setup multiple edge nodes as well as multiple control servers( masters ). You need to add your web domain to DO so that you can manage the DNS in DO. Once this is done you can use the DNS manager to direct traffic. This is needed for web ui access and edge node access. To do this, create a CNAME entry for your domain, so lets say your domain is "something.com" and you want to stub everything off a sub domain called "awesomeness", then CNAME entry needs to be "*.awesomeness" and it needs to point to "awesomeness.something.com." ("*.awesomeness" in the first box in DO, "awesomeness.something.com." in the second box ( if that makes more sense... ). This ensures that any applications you create are accessible via "appname.awesomeness.something.com". The next step is to create an A record entry called "awesomeness" for each of your edge nodes( traefik servers ) to ensure you can reach your applications from the outside. The other thing i do is create an A record called "control" for each master/control node instance. This will allow you to connect to the control node via a name instead of using direct IPs. Mantl uses Nginx to redirect calls to the webui to the correct leader node, so no worries about that.
Of course none of these systems are 100%, they dont do everything i need and so i hope they get through their checklists sooner than later. Auto-scaling is important to the survival of microservices. I also think there should be infrastructure auto-scaling as well, or at least a notification that states, hey you need more control nodes or hey you need X to ensure i can auto-scale the applications. Of course auto-scaling needs rules as well, so thats probably a pretty big feature set. It would be nice to see built in node health dashboards ( mesossphere might have this now ). Application health dashboards too. Would love to see a Calico type system that doesnt require IPs per application ( there are two camps of thought on this and i am in the port camp, its already a pain to manage IP, not to mention auto-scaling it on a large scale, millions of IPs just for security? ). Calico is like a firewall and router that sits on each mesos host and has access control managed by grouping applications at the marathon level, used in microservices as a distributed firewall that has exceedingly easy management that doesnt require you to touch every server, you just manage one config. I do hope your devs or the person managing it have the presence of mind to ask what SHOULD be accessing what, otherwise, just like a normal firewall, there will be things getting blocked that shouldnt be, which may of course break your application. Now that Mesosphere supports IP per app deployments, if you plan to use Calico, you will need to have a robust IP management platform in place. I also wish they had bare bone auto-scaling; this would mean you could present hardware nodes to the cluster and it would assimilate them as needed. Openstack Ironic helps with the bare bone deployments, but are you going to setup an openstack deployment next to a mesos deployment in a data center just to have this? You could also use ubuntu's MAAS open source project to deploy new nodes in a more seamless way.
I think that covers everything. Again, i should note that everything stated above is as of this writing. Microservices are evolving at warp speed and a year from now, i doubt the landscape will look the same. As with other open source projects, i sincerely hope everyone stays up on their docs, because having to figure out little nuances is annoying.. and I WILL create issues in github!