Table of Contents

KubeFlex On-Prem Operations Guide

Walter Hanau Updated by Walter Hanau

Basics

The essence of KubeFlex is powerful flexibility through simple operations. KubeFlex tackles complex real world operational problems by breaking down operational needs in to smaller simpler parts of a larger puzzle. By understanding those parts, you can unlock the power to solve larger more complex problems.

This document, in no particular order, breaks down these important smaller parts of KubeFlex:

  • Boot Logic discusses on how KubeFlex goes about booting machines
  • Assignment Logic addresses how KubeFlex attaches meaning to machines, such as roles and duties
  • Runtime OS touches on special OS requirements and how KubeFlex runs your machines
  • Enterprise explains advanced features that integrate KubeFlex in to your enterprise

Terminology

These are some terms that are important to know as they can have a specific meaning:

  • attribute KubeFlex metadata that are assigned to machines by rules
  • Boot Proxy the on-prem component of KubeFlex
  • cluster any Kubernetes cluster that KubeFlex is managing
  • discovery KubeFlex temporary boot target that gathers facts about machines
  • DHCP a standard network protocol commonly used to assign IP addresses
  • facts ground truth information about machines
  • machine the base hardware that KubeFlex will be managing (eg: server, VM, computer, NUC, etc)
  • node a machine that has successfully joined a Kubernetes cluster
  • proxyDHCP a standard network protocol intended to allow other services to extend DHCP without changes to existing DHCP services
  • PXE a standard network boot technology developed by Intel
  • rule user controlled JavaScript logic that assign attributes to machines

KubeFlex Deployment Architecture

In terms of architecture, KubeFlex has several components shown on the following diagram:

  1. Control Plane UI and Hub CLI that are used to configure and manage KubeFlex deployments. Control Plane UI can be deployed in the cloud or on-premises.
  2. KubeFlex Prime, which is a management cluster responsible for deploying and upgrading KubeFlex clusters for user workloads. KubeFlex Prime controls all clusters in a single on-prem data center or regional datacenter for edge deployments.
  3. KubeFlex Kubernetes clusters are based on the upstream Kubernetes, and managed using kubeadm tool.
  4. KubeFlex Prime provides essential platform services to multiple on-prem Kubernetes clusters: DNS, PXE booting, execution of automation tasks such as Terraform and Ansible scripts, private Docker registry, state store for Terraform, OIDC authentication/authorization, encrypted password storage, and other services as may be required by on-prem Kubernetes clusters. KubeFlex Prime allows dynamic provisioning of infrastructure in on-prem data centers, without dependency on cloud-based infrastructure such as DNS and cloud storage.
  5. One KubeFlex Prime cluster can manage hundreds of Kubernetes clusters in the data center or edge locations. For a larger number of managed clusters, a tiered deployment architecture is recommended with multiple regional KubeFlex Prime clusters.

Boot Logic

In KubeFlex, the high-level lifecycle of a machine could be reduced to:

  1. Contact Boot Proxy for PXE boot information
  2. Boot in to the special KubeFlex Discovery mode, where we learn about your machine and reboot
  3. Contact Boot Proxy again for PXE boot information
  4. Boot in to the Runtime OS and query KubeFlex via Boot Proxy for assigned duties
  5. Transform the machine in to a Kubernetes Cluster Node of your specification
  6. Later, reboot and do it all over again

So, as you may have noticed, booting is critical part of the KubeFlex experience. In particular, KubeFlex makes heavy use of "PXE booting" which is a special way of booting a machine over one of it's network connections.

PXE booting works by sending extra information about network booting to a network interface over DHCP, the same protocol used for assigning IP addresses. Once the interface has an address, it will then know to contact other parts of the network over various different protocols to get what it needs until it is up and running. This can be problematic for many reasons though.

  1. First, modifying DHCP services in established environments is not usually allowed because it is possible to easily and unknowingly destabilize existing systems. For that reason, KubeFlex makes use of the lesser known "proxyDHCP" protocol which is specifically intended for supplying extra DHCP information, such as what we need for PXE boot, to DHCP clients without touching existing DHCP services. It essentially listens for the normal DHCP conversation and safely adds it's extra information to it.
  2. Next, some of the protocols involved operate at such a low level that they don't work over the internet. Which is great, because the protocols involved are also considered insecure and not internet safe anyways. For those reasons Agile Stacks has developed the "Boot Proxy" for KubeFlex which acts as an extension of KubeFlex in to your specific local environment. Boot Proxy handles all of the low level and insecure communication with machines on behalf of KubeFlex creating a reliable and secure system.
  3. Finally, PXE booting can introduce problems of machine identity. In the PXE process, all that is really and reliably known about a machine is the MAC address of the interface that is currently booting. Frequently, the MAC of a network interface is not a definitive identity for a machine. What if a NIC was replaced? What if the machine has more than 1 MAC? There are so many reasons it won't work and it can get complicated fast. For that reason, KubeFlex performs a machine discovery step during boot to establish identities based on what a machine actually is at a deeper level.

KubeFlex Boot Proxy

The KubeFlex Boot Proxy is software that Agile Stacks developed to be an extension of KubeFlex in your local environment. It enables KubeFlex to interact with the low level protocols used by the PXE boot process. In addition, it establishes a secure control connection back to the Agile Stack Control Plane so that your on-prem machines and clusters can be easily managed.

The Boot Proxy is made available as a container image available from Docker Hub here.

It could be run manually from the command line with something like the following:

docker run -d \
--env=PUBLIC_IP=$PUBLIC_IP \
--env=BP_CONFIG=${config} \
--env=BP_LISTEN_ADDR=$LISTEN_ADDR \
--net=host \
--mount=source=bprx,target=/var/lib/bootproxy \
--name=$RUNNAME \
--restart on-failure \
agilestacks/bootproxy

However, this requires a special variable to be defined for BP_CONFIG which is an cryptographic secret provided dynamically by the Agile Stacks Control Plane. So, Boot Proxy should actually be run by a provided script called get_launch_script.sh that connects securely to the Agile Stacks Control Plane and downloads the information required to launch the Boot Proxy. It has the following help text:

get_launch_script.sh  --- A tool to download a bootproxy launch script from the command line

Before running this script, you must run 'hub login' and then
'hub api environment get --service-account-login-token'
This will provide a login token which can be used by the script to fetch assets

Usage: ./get_launch_script.sh -u <API_URL> -l <LOGIN_TOKEN> -p <BP_UID>
-u Full API Url
-l Login Token (from 'hub api environment get --service-account-login-token')
-p Boot Proxy UID

The flags needed by the script are:

  • -u <API_URL> this is the URL to your specific Agile Stacks Control Plane, it will be provided by Agile Stacks.
  • -l <LOGIN_TOKEN> this is a cryptographic secret that allows the Boot Proxy to talk to the Agile Stacks Control Plane as a fully authenticated user associated with specific Cloud Account and Environment in the Control Plane.
  • -p <BP_UID> this is a unique identifier for this Boot Proxy that is created when a Boot Proxy is linked to a Cloud Account in the Agile Stacks Control Plane. Simply copy the value and use it here.

The login token can be acquired using the following steps which downloads and runs the Agile Stacks hub CLI tool as a Docker container:

# your Control Plane user name:
API_USER='user@agilestacks.com'
# the same Control Plane API URL from above:
API_URL='https://api.app.superhub.io'

docker run -ti --rm -e API_USER=${API_USER} -e API_URL=${API_URL} --entrypoint /bin/bash agilestacks/toolbox:stable -c \
'echo -n "Password: " && read -s PASS && echo && \
$(hub login --api "${API_URL}" -u "${API_USER}" -p "${PASS}" | grep -v "#") && \
hub api environment get --service-account-login-token'
The 'hub api environment get' command will print information about all environments, so be sure to choose the 'loginToken' for the correct user and environment.

Once Boot Proxy successfully connects to the Agile Stacks Control Plane it will begin to download the machine images that it will be serving to your machines. Once the images have been downloaded then it will be fully initialized and ready to begin booting machines and building clusters.

If Boot Proxy encounters an error at any point in the process it can be safely restarted continuously until the situation is resolved. All state about booting machines and your clusters is kept within the Agile Stacks Control Plane so there will be no loss of information.

Machine Boot States

KubeFlex uses states to keep track of all machines because PXE booting can be an unreliable process otherwise. By tracking machine states KubeFlex has the opportunity to manage the process much better than it would otherwise. A given machine may possess any of the following states on it's way to being fully booted:

  • new the machine exists only logically and has yet to be reported by any Boot Proxy
  • boot-1 the machine is attempting to PXE boot the KubeFlex discovery process
  • discover the machine is now running the KubeFlex discovery process
  • post-discover the machine has been discovered by KubeFlex assigned a role in a Kubernetes cluster
  • boot-2 the machine is attempting to PXE boot the runtime OS
  • post-boot-2 the machine is now running the OS and cloud-init has been seeded
  • join-cluster the machine is doing it's machine-specific preparations to join it's cluster as directed by KubeFlex
  • ready the machine has successfully joined it's cluster; is now an available Kubernetes "node"

Assignment Logic

A KubeFlex user can expect to boot a collection of heterogeneous machines with specialized storage nodes, GPU nodes, etc. and get a proper working Kubernetes cluster, as easy as that. But how is it possible? The answer is that Agile Stacks has integrated in KubeFlex a powerful system of logic that handles this. In KubeFlex, the power to wrangle complicated assignments of roles, resources, and special configuration to machines based on the needs of the moment are baked in. The KubeFlex assignment logic supports very simple scenarios, like if it has a GPU then it is for compute, to more sophisticated scenarios that involve cluster state or loading a special hardware driver.

In general, the assignment for a given machine is the result a set of rules that evaluate a combination of machine facts and state, to resolve a set of attributes that determine what the machine will become when applied by KubeFlex. Rules are fragments of code, totally controlled by the user, that allow for powerful comparisons of state and machine facts. Here, state is the internal KubeFlex understanding the machine and of the cluster to which is assigned, if any. Machine facts are comprehensive ground truth information about a particular machine that are collected during the discovery boot process.

Rules

Rules are essentially standard JavaScript code intended to perform arbitrary comparisons of your logic against the running state of the cluster and machine facts. The JavaScript code is safely executed in a sandbox environment and only granted access to certain state variables and functions. The goal of a rule is to resolve to being either true or false. When the rule resolves true then the rule's attributes are applied. When the rules resolves false then the rule's attributes are not applied.

In addition to standard JavaScript, Agile Stacks provides the following objects to the rule sandbox for use in your rules:

  • candidate is an object containing information about the machine that triggered evaluation of rules. Additionally, Agile Stacks provides some convenience shortcuts to common information that would usually have to be calculated like the number of disks and gpus.
    • role holds the machine's currently assigned role in the cluster (eg: master, worker)
    • state holds the current booting state of the machine
    • facts holds the most recent information about the machine's hardware collected from discovery
    • disks is a shortcut that provides the total number of storage devices in the machine
    • gpus is a shortcut that provides the total number of GPU device in the machine
  • members is a collection of candidate objects for all known machines in the cluster
  • quorumCount is a reference to the number of masters the cluster requires in quorum
Agile Stacks is pleased to expand the collection of objects available to rules. Let us know your use case!

Discovery

The discovery boot process is a special PXE boot target that KubeFlex uses to collect ground truth information about a machine. The discovery process consists of a static discovery boot image based on CentOS and a simple application that enumerates hardware in the machines and reports it back to KubeFlex. The hardware reports sent back to KubeFlex are known as "facts" and include information such as make, model, and quantity of CPU, RAM, hard drives, etc. Once facts are transmitted successfully, KubeFlex will make a number of decisions about what to do with the machine then allow it to reboot. Many of the decisions are actually the machine assignment process that you are reading about now. In the event of an error, or if the cluster to which the machine is assigned isn't ready, the machine will be left in a "holding pattern" during discovery until the situation resolves, at which point it will be allowed to reboot.

Facts

The machine facts collected by the discovery process are numerous. Too many to list here but we may provide a comprehensive reference elsewhere in the near future. In the meantime, here are some common facts used when writing rules:

  • facts.processors.count is the number of CPUs
  • facts.processor0 will give the full name of the processor
  • facts.lspci contains a full enumeration of the PCI device tree
  • facts.lspci["VGA compatible controller"] lists, in this case, all "VGA compatible controller" devices

Attributes

Attributes are metadata that exist in the KubeFlex system and are used to apply, extend, or enhance the capabilities of booting machines. They consist of a type and a value, each encoded as a string. The meaning of a type is defined elsewhere in the KubeFlex system and could be anything, current types are listed below. The meaning of the value is defined entirely by the type.

The following attribute types are available:

  • role is the KubeFlex role assigned to the machine which directly relates to the general Kubernetes role
    • Holds a single value consisting of one of the following: "master-prime", "master", "worker"
    • Subsequent uses overwrite the previous value
    • Only applicable when a machine is considered "new", not relevant on a reboot
  • label provides a way to add a Kubernetes label to the node upon joining a cluster
    • May be any valid label
    • Subsequent uses append to the previous value
    • Only applied when a machine is considered "new" because it is applied when the node joins its cluster
  • taint provides a way to add a Kubernetes taint to the node upon joining a cluster
    • May be any valid taint
    • Subsequent uses append to the previous value
    • Only applied when a machine is considered "new" because it is applied when the node joins its cluster
  • include allows the inclusion of additional cloud-init templates during OS boot
    • May be any string
    • Subsequent uses append to the previous value
    • Applicable on every boot
  • variable sets an arbitrary value in the machine's metadata that can optionally be referenced by other parts of KubeFlex
    • The value requires the extra form: var_name=var_value
    • Subsequent uses overwrite previous var_value for a given var_name
    • Made available as machine metadata within status.attributes.var.var_name
    • Applicable on every boot
  • karg appends extra arguments to the runtime OS kernel boot command
    • May be any string
    • Subsequent uses append to the previous value
    • Applicable on every boot

Assignment Logic Example

Here is a simple example of a set of rules that will create a cluster where all nodes are accessible by ssh but any machine that has more than 1 hard drive will not be allowed to become a master. Meanwhile, an extra kernel argument is defined but disabled. These rules are evaluated sequentially in descending order everytime new facts are received.

Attributes

Rule

Description

role->worker

true

All nodes can always be normal workers.

role->master

members.filter(m => m.role.startsWith("master")).length < quorumCount

Any node may become a master if quorum has not been reached.

role->master-prime

members.filter(m => m.role.startsWith("master")).length < 1

The first node assigned as master will also create the cluster.

role->worker

candidate.disks.length > 1

Actually, if the machine has more than 1 disk, it must be a worker. This rule takes precedence over the earlier rules. It allows multi-disk machines to perhaps become storage nodes instead of a dedicated master.

include->debug

true

Include the debug configuration to enable ssh in the runtime OS for debug purposes.

karg->console=ttyS0,115200

false

Echo console messages to the serial port, but disable it globally for now.

Runtime OS

KubeFlex is based on the idea that machines managed by it will be running a lightweight OS that is effectively ephemeral because it runs from RAM, "just enough OS for Kubernetes". This means that when a machine is rebooted, all system state is lost, and the machine effectively performs first-boot on every boot. All of the uniqueness a system might have or need is provided by cloud-init user-data that is generated for the machine by KubeFlex on each boot. This is great because if a node becomes unhealthy it can just be rebooted. In other words, KubeFlex treats servers it manages "as cattle".

That said, KubeFlex does not always keep all of the OS in RAM because that would require too much space. Instead, our system offloads some parts of the OS to local storage that we assume is dedicated to the system. This local storage will be automatically wiped and formatted for any system that KubeFlex initializes as a "new" machine. Once the disk is ready, certain directories are moved to it and their contents are removed from RAM. The directories moved include those that consume the most space, including the directory that Docker will use to store it's images and containers. Any container using local volumes will have its data persisted to the local disk, crucially this includes containers such as etcd. When a machine is rebooted but it has not been marked as "new" by KubeFlex, the existing disk will not be wiped but directories will still be moved, resulting in a fresh system that still contains any of the previous data that is not part of the boot image.

KubeFlex does not actually care much about which OS it uses because the OS is merely providing hardware abstraction and task management. KubeFlex only cares that the OS provides:

  • Linux kernel
  • live boot capabilities
  • cloud-init
  • systemd

All KubeFlex functionality is built on top of those standard components. At present the default OS is a custom live build of Ubuntu 16.04 but it could easily be CentOS, Red Hat, or whatever-you-like as long as the components above work as we expect them to.

The default KubeFlex Ubuntu image is a custom build that enables it operate in a live mode while providing a different mix of default tools and optimizations. The scope of changes made are not really that significant but have allowed us to get an image that is only about 300MB with a RAM resident footprint about the same size in exchange for a fully supported and well known OS.

Our Ubuntu image may be easily verified and built by anyone, from inside a container even, and any organization that wishes to provide their own image based on ours may absolutely do so, just contact Agile Stacks about your desire.

Boot Modifications

Agile Stacks provides a system for modifying the OS at boot time through the use of cloud-init template fragments. These fragments are standard cloud-init user-data YAML that contain any additional cloud-init commands that are desired during boot. Fragments may be included in the boot sequence through the use of the rules system and the "include" attribute.

Self-service management of cloud-init template fragments is not yet supported, so they must be provided to Agile Stacks for inclusion before use.

While most cloud-init modules may be used and any desired behavior may be achieved, here are some example use cases where template fragments are useful:

  • write_files can be used to create or overwrite configuration files. It works at a very early point in the boot process so they can usually be referenced by other modules.
  • packages can be used to load additional OS packages (ie: apt-get install) that might be needed to provide special kernel support or extra non-Kubernetes services.
  • apt will be needed to populate external repositories information for loading out-of-tree drivers
  • runcmd runs arbitrary shell commands and therefore is very flexible. Runs near the end of the boot process.

Template fragments can control how they interact with each other by defining their merge characteristics. Generally we want to know if the templates replace or append to overlapping values that have already been loaded from previous YAML. By default we try to append, however this is configurable, see the cloud-init documentation about merging for more details. The user-data include files should begin with the following block in order to guarantee that their contents are appended together:

#cloud-config
merge_how:
- name: list
settings: [append]
- name: dict
settings: [no_replace, recurse_list]

Build Modifications

The Agile Stacks Ubuntu boot image is a live Linux distribution intended for network boot operation. It is created using the same process as the official "Ubuntu Core" product image with some modifications.

Our build tries to make changes only to the Debian live-build system which is responsible for assembling the final images. The live-build system is configured primarily by files placed in well-defined directories that have implicit meaning. A full understanding will require a close reading of the live-build documentation. Really, most needs will be covered by the following tips:

  • Drop any files that you want copied verbatim to the OS image in config/includes.chroot/ which will be deep copied to the file system root.
  • Add or delete the names of OS packages to/from text files suffixed with .list.chroot_install in the config/package-lists/ directory.
  • Extra processing of the image may be done by scripts in the config/hooks directory.

Enterprise

Agile Stacks knows that your on-prem Kubernetes clusters will eventually need to integrate with the rest of your enterprise network and that comes with its own set of challenges. That's why we've developed a set of additional tools that help you seamlessly integrate KubeFlex in to your existing enterprise infrastructure.

Advanced DNS

KubeFlex supports a number of DNS providers out of the box by virtue of integrating with the External DNS project. External DNS provides a way to refer to Kubernetes services by a globally scoped DNS name and programmed in to one of a number of external DNS providers, like Route 53.

Basic External-DNS Configuration

By default, the KubeFlex deployment uses external-dns configuration, which requires 2 steps to configure in order to provide dynamic DNS updates to your DNS infrastructure. It uses the proctocol defined in RFC2136 in order to dynamically send updates to your DNS server. The inputs required for external dns are as follows:

  1. The Zone - This is the actual DNS zone that external-dns will be updating within your system.
  2. Host Name - This is the DNS server that external-dns will be contacting with updates.
  3. Port - The port on which the DNS server is listening for updates. This is usually port 53
  4. TSIG Secret - This is a pre-shared key defined in RFC2136. On Unixes, this is often generated by using the dnssec-keygen tool. If you're not sure what this is, I would recommend reading the documentation for your specific DNS server. Bind users can go here for a quick tutorial.
  5. TSIG Key Name - This is the name of the handle to the TSIG key above.
  6. TSIG Secret Algorithm - The valid options for this are:
  • hmac-sha1
  • hmac-sha224
  • hmac-sha256 (default)
  • hmac-sha384
  • hmac-sha512

If this is something other than hmac-sha256, you'll need to specify that.

To get started, consult your documentation to create a new TSIG secret key (or use an existing one)

Once you've done that, you must use either the SuperHub UI or the hub CLI Tool to create a new secret in your Environment. To add one in the UI, navigate to Cloud -> Environments -> List then click the environment you care about. Then click the Add button next to the Secrets Heading.

You will receive a popup similar to the one below. Select Token as the Secret Type, then enter this value for the name:

components.external-dns.rfc2136.tsig.secret

Paste in the value into the Token field, and click Add.

After that, you will need to create additional parameters to describe the TSIG key.

component.external-dns.rfc2136.tsig.keyname
component.external-dns.rfc2136.tsig.secret-alg (optional)
component.external-dns.rfc2136.host
component.external-dns.rfc2136.port
component.external-dns.rfc2136.zone

Then press "Save" at the bottom.

Important: You must click the "save" button at the bottom of the environment screen to persist your changes.

The last part is to ensure that the "External DNS" button is selected when creating your KubeFlex cluster in the UI.

Split Horizon DNS

While External DNS is a great solution for a common problem, many enterprises employ a scheme known as "split-horizon" for assigning addresses to services where any single service might actually exist at any number of IP addresses, depending on how you are accessing it.

For example, if you ask for a service at k8s-service.internal.corp from the corporate network you will get one address, but if you ask for the service from the VPN, you could get a totally different address because of the split-horizon configuration. External DNS does not provide a way to make sure that the DNS system is updated differently for the same service, so Agile Stacks provides a tool called "Deep Horizon" which provides a way to update RFC2136 compliant DNS servers (such as bind) in a split-horizon compatible way.

Deep Horizon operates the same way as External DNS but complements it by providing a method for configuring multiple views (ie "horizons") and multiple DNS servers. Additionally, it provides a way to "map" a given IP to a totally different IP subnet. These features combine to provide powerful and expressive capabilities to Kubernetes for your split horizon configuration.

Deep Horizon is a Kubernetes Operator and is configured as a simple YAML document that is saved as a ConfigMap. Here is an example Deep Horizon configuration for two views, a "k8s_native" view and a "corp_intranet" view. The idea here is that Deep Horizon will be configuring the same DNS server for zone "k8s.agilestacks.vdc" with two different addresses for the same Kubernetes service in a cluster named "cluster1".

zone: k8s.agilestacks.vdc
cluster: cluster1
# whether or not the default view should be updated (eg: false if paired with external-dns)
updateDefaultView: true || false
# configure all servers that should be updated/monitored
servers:
- addr: 192.168.123.99
port: 53
# configure shared secrets for the "horizons" that will be managed
views:
- keyname: k8s_native
key: 8YYclPBY4CnV/SlG4OZSSMrkR5KokkvpNbbJjhIV/JemLm7J2FOcziQawHt65KUj8S2AWtOW7KWmrpBGfswWrg==
hmac: hmac-sha512
- keyname: corp_intranet
key: nvKJxasg7hi40jijuqbywMwPz6JpLzTbo0VbQdPlyUWesfkhujsjBwW3jCe9LVTQk5ReEwiQil5NC4AXX2LUEg==
hmac: hmac-sha512
# optional additional updates to make based on simple matching rules
# <k8s native IP>, <mapping #1>, <mapping #2>, ...
poolMap:
- [ '192.168.123.201', '10.23.45.6']
- [ '192.168.123.202', '10.23.45.16' ]
Agile Stacks is able to integrate Deep Horizon for you environment so please reach out if you think you might need it.

Like what you see? Request a demo today!


How did we do?

Preparing for KubeFlex On-Prem

Contact