Kubernetes and Magnum

JupyterHub can be deployed to a Kubernetes cluster provisioned using OpenStack Magnum.

The main steps are:

  1. Provision a new Kubernetes cluster via OpenStack Magnum

  2. Configure external access

  3. Install helm

  4. Install and configure JupyterHub

  5. Install and configure Dask or Dask

Provision new cluster

To provision a new cluster, first log into the Horizon interface on the IRIS cloud that you are configuring.

  1. Check that you have selected the correct project under which to create the cluster, by selected the project in menu at the top left of the web page.

  2. Click on “Container Infra” to list the available cluster templates. While this will vary from system to system, generally it is best to select the most recent template.

  3. Click “Create Cluster” to start setting the cluster parameters.
    • Navigate through the options, ensuring that “Docker Volume Size” is set at at least 10GB; the default size is too small for practical use.

    • In “Keypair”, select the keypair that you will use to ssh to the created cluster.

    • Create 1 master instance and at least one node instance.

    • Once all required parameters have been supplied, click “Submit” to begin cluster creation.

    • Cluster creation may take some time. Clusters can be viewed by navigating to “Container Infra->Clusters”

    • Once creation is complete, the compute instances within the cluster can be listed by navigating to “Compute”->”Instances”.

    • The cluster master node can be identified by a name <..>-master-0 or similar.

To log into the master node via ssh, you will need to allocate an external floating point ip address. Navigate Horizon->Instances to find your master node. Under “Actions” select “Associate Floating IP”. Either select an existing unassociated floating ip address, or click “+” to allocate a new address.

Firewall

On the Cumulus cloud, external floaing ip addresses are automatically opened for ingress through the firewall. On the STFC cloud, you will need to contact the system support to request that the firewall be opened for ingress on the ip address that you have just allocated to your master node. Similarly, when deploying kubernetes services via helm, there are differences in firewall behaviour between the clouds. On Cumulus, external floating ip addresses are automatically obtained for kubernetes services that require them. On the STFC cloud, externally accessible services require to be configured with existing floating ip addresses that have already been configured for ingress.

Install helm

Helm is one of the most popular tools for managing applications on kubernetes clusters. Applications are configured via helm charts; usually these contain a large number of default values, and only a simple additional configuration file is needed to override any specific settings.

ssh into your master node using the keypair that you specified during cluster creation

Confirm that kubernetes was correctly installed during the cluster setup:

kubectl --help

Install helm by following the instructions at https://zero-to-jupyterhub.readthedocs.io/en/latest/kubernetes/setup-helm.html (the simplest way is to execute ‘curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash’) Confirm that helm is correctly installed:

helm --list

Create default persistent volume

Some applications, such as postgres, need a default StorageClass to be present in order to install successfully.

The setup for different systems will be dependent on the type of storage system used. You will need to contact your system support for precise details of this.

For the Cumulus system, which uses cinder for storage, the following definition can be used:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: standard
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/cinder
parameters:
  availability: nova

Save this into a file called storage.yaml, and run the following command:

kubectl apply -f storage.yaml

Run the following command to verify that a default storage class has been created. The output should be similar to what is shown below:

kubectl get storageclass
NAME                 PROVISIONER            RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
standard (default)   kubernetes.io/cinder   Delete          Immediate           false                  2d18h

Install and configure JupyterHub

postgresql is required to allow JupyterHub to use persistence. Execute the following to add the bitnami repository in order to access the postgresql charts:

helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

Execute the following to create a jhub namespace containing a postgresql instance:

helm install postgresql --namespace=jhub --create-namespace bitnami/postgresql --set persistence.size=1Gi

The installation will automatically create a postgresql password. This can be obtained by executing the following:

kubectl get secret --namespace jhub postgresql -o jsonpath="{.data.postgresql-password}" | base64 --decode

Postgresql requires some initial setup before it can be used by JupyterHub. Connect to the running instance by executing the following:

kubectl run postgresql-client --rm --tty -i --restart='Never' --namespace jhub --image docker.io/bitnami/postgresql:11.11.0-debian-10-r0 --env="PGPASSWORD=$POSTGRES_PASSWORD" --command -- psql --host postgresql -U postgres -d postgres -p 5432

Copy and paste the password that was output from the “get secret” command above, and press enter. This will log you into the postgresql instance.

Once logged in, run the following commands to create the JupyterHub database in postgresql:

CREATE DATABASE jhub;
CREATE USER jhub;
GRANT ALL ON DATABASE jhub TO jhub;

Then exit from postgresql by entering quit and pressing return

To install JupyterHub, follow the instructions at https://zero-to-jupyterhub.readthedocs.io/en/latest/jupyterhub/index.html

Note that as we have already created the jhub namespace during the postgresql installation, the –create-namespace flag should not be provided during the JupyterHub installation

JupyterHub provides many options for user environments and persistent storage. Here we will use our postgresql instance for persistence.

Note that JupyterHub is sensitive to differences in versions between the installed hub component of JupyterHub and the version of the hub component used to build the docker image specified in the config.yaml file. The following config file was successfully used for installation using helm chart version=0.9.0 and a compatible version of the jupyter datascience-notebook image:

proxy:
  secretToken: <insert your value>
hub:
  db:
    type: postgresql
singleuser:
  image:
    name: jupyter/datascience-notebook
    tag: hub-1.1.0
  storage:
    type: postgresql

Note that if you are installing on the STFC cloud, you will require to specify an existing floating ip address for the external load balancer, for example:

proxy:
   secretToken: <insert your value>
   service:
     loadBalancerIP: 130.246.212.235
 hub:
   db:
     type: postgresql
 singleuser:
   image:
     name: jupyter/datascience-notebook
     tag: hub-1.1.0
   storage:
   type: postgresql

kubectl can then be used to verify that JupyterHub is running and accessible (assuming that you specified jhub as the namespace during installation):

kubectl get service --namespace jhub
NAME           TYPE           CLUSTER-IP       EXTERNAL-IP       PORT(S)                      AGE
hub            ClusterIP      10.254.174.51    <none>            8081/TCP                     112s
proxy-api      ClusterIP      10.254.104.162   <none>            8001/TCP                     112s
proxy-public   LoadBalancer   10.254.102.185   128.232.227.148   443:31962/TCP,80:30774/TCP   112s

You should then be able to browse to the external address shown for the LoadBalancer, and start to create notebooks.

Authentication and authorisation

Note that these instructions will install JupyterHub without any authentication or authorisation; anyone with access to the load balancer IP address can create notebooks. Clearly this is undesirable.

A simple way to implement user control is to configure OIDC via github, and require that users have a github account and are members of a github organisation for which you have management privileges.

Follow the instructions at https://zero-to-jupyterhub.readthedocs.io/en/stable/administrator/authentication.html#github to set up a new OAuth app in github. The callback url required in the OAuth app setup is the external IP address for your Jupter load balancer.

Add the following to your JupyterHub config.xml, where clientId, clientSecret and oauth_callback_url should match the values you have configured in the OAuth app in github.

You should include a list of github organisations of which a user must be member of at least one for access:

auth:
  type: github
  github:
    clientId: ***
    clientSecret: ***
    oauth_callback_url: <your load balancer ip address>/hub/oauth_callback
    allowed_organisations:
      - <your github organisation 1>
      - <your github organisation 2 etc>

Dask

Dask provides a powerful framework allowing parallel processing and automated scaling accessed from a lightweight python notebook. dask-gateway provides a set of services that can be easily installed onto kubernetes using helm. dask-gateway can be installed either as a standalone service or tightly integrated with an instance of JupyterHub.

For standalone installation, follow the instructions at https://gateway.dask.org/install-kube.html#install-dask-gateway, based on the default config file at https://github.com/dask/dask-gateway/blob/master/resources/helm/dask-gateway/values.yaml If you are installing on Cumulus, Dask will automatically try to acquire an external floating ip address for it’s load balancer. This can be listed using kubectl, eg:

kubectl get service --namespace dask-gateway
NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)        AGE
api-dask-gateway       ClusterIP      10.254.61.172   <none>            8000/TCP       3h55m
traefik-dask-gateway   LoadBalancer   10.254.138.19   128.232.227.222   80:31454/TCP   3h55m

If you are installing on the STFC cloud, you will need to manually specify an existing floating ip address in your config file, e.g.

# Additional configuration for the traefik service
service:
    type: LoadBalancer
    annotations: {}
    spec: { externalIPs: [130.246.212.201] }

Your dask installation can be tested by browsing to the external ip address shown for your load balancer and creating a new Jupyter notebook, then running a simple dask task from within it. Note that dask is sensitive to mismatches in versions between libraries on the worker images and in the calling client. The dask_gateway client provides a get_versions method which checks for any potential mismatches.

The following code provides a simple test case for running in a notebook:

pip install dask_gateway
from dask_gateway import Gateway
import dask.array as da
try:
  gateway = Gateway("address of your dask load balancer external IP")
  cluster = gateway.new_cluster()
  cluster.scale(5)
  client = cluster.get_client()
  client.get_versions(true)
  print('created cluster, allocating random array')
  a = da.random.normal(size=(10000, 10000), chunks=(500, 500))
  print('starting calculation')
  print('mean {0}', a.mean().compute())
  print('shutting down cluster')
  cluster.shutdown()
  print('done')
except Exception as e:
  print(e)

execute this, and via kubectl you should be able to see dask starting and stopping worker pods on demand, eg:

kubectl get pods --namespace dask-gateway
NAME                                                 READY   STATUS    RESTARTS   AGE
api-dask-gateway-86f78b7bf-8knfn                     1/1     Running   0          4h36m
controller-dask-gateway-775b47fffc-bmq77             1/1     Running   0          4h36m
dask-scheduler-a21b3dcd471c402ab3e53a8eac625a5e      1/1     Running   0          47s
dask-worker-a21b3dcd471c402ab3e53a8eac625a5e-4k8nb   1/1     Running   0          39s
dask-worker-a21b3dcd471c402ab3e53a8eac625a5e-4xnml   1/1     Running   0          39s
dask-worker-a21b3dcd471c402ab3e53a8eac625a5e-6pf5g   1/1     Running   0          39s
dask-worker-a21b3dcd471c402ab3e53a8eac625a5e-94z2w   0/1     Pending   0          39s
dask-worker-a21b3dcd471c402ab3e53a8eac625a5e-vm2sp   1/1     Running   0          39s
traefik-dask-gateway-754b78c8-fqcx6                  1/1     Running   0          4h3

Integrated dask and JupyterHub

Although an instance of dask in one namespace can be configured to use authentication from a JupyterHub instance in a separate namespace, if the intention is to provide access to dask purely via JupyterHub, dask and JupyterHub can more easily be installed into the same namespace and integrated using the daskhub charts. This means that dask is only accessible via the JupyterHub instance in the same namespace, and easier to manage.

This will provision a single namespace containing a dask instance, a JupyterHub instance, a shared external load balancer and access to dask only via the JupyterHub instance. Follow the instructions https://docs.dask.org/en/latest/setup/kubernetes-helm.html#helm-install-dask-for-mulitple-users

This will provision a single namespace containing a dask instance, a JupyterHub instance, a shared external load balancer and access to dask only via the JupyterHub instance.

Follow the previous instructions to create a dhub namespace containing a postgresql instance, and set up the postgresql database.

Important note: as written, the instructions will install into the default namespace. This is very unadvisable! When following the installation instructions, specify a namespace for your installation, eg to install into a dhub namespace:

helm upgrade --debug --wait --namespace dhub --create-namespace --install --render-subchart-notes  dhub dask/daskhub     --values=secrets.yaml

An example secrets.yaml file:

jupyterhub:
  proxy:
    secretToken: <token1>
  hub:
    services:
      dask-gateway:
        apiToken: <token2>
    db:
      type: postgresql
  singleuser:
    storage:
      type: postgresql

dask-gateway:
  gateway:
    auth:
      jupyterhub:
        apiToken: <token2>

Following installation, all the services for dask and jupyterhub should be visible in the namespace, including a single load balancer serving both JupyterHub and Dask:

$ kubectl --namespace=dhub get services
NAME                        TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)        AGE
api-dhub-dask-gateway       ClusterIP      10.254.196.19    <none>           8000/TCP       46h
hub                         ClusterIP      10.254.247.105   <none>           8081/TCP       46h
proxy-api                   ClusterIP      10.254.133.141   <none>           8001/TCP       46h
proxy-public                LoadBalancer   10.254.172.75    128.232.224.75   80:32574/TCP   46h
traefik-dhub-dask-gateway   ClusterIP      10.254.122.252   <none>           80/TCP         46h

The code from the previous example can be used with a couple of small changes.

There is now no need to specify an address, as the notebook will default to using the dask instance in the same namespace, and we use a GatewayCluster object instead of Gateway:

!pip install dask_gateway
from dask_gateway import GatewayCluster
import dask.array as da
try:
  cluster = GatewayCluster()
  cluster.scale(5)
  client = cluster.get_client()
  client.get_versions(true)
  print('created cluster, allocating random array')
  a = da.random.normal(size=(10000, 10000), chunks=(500, 500))
  print('starting calculation')
  print('mean {0}', a.mean().compute())
  print('shutting down cluster')
  cluster.shutdown()
  print('done')
except Exception as e:
  print(e)