MLRun Serving Pipelines

Real-time Serving Pipelines and Model Monitoring with MLRun and Nuclio

MLRun serving graphs allow you to build, test, deploy, and monitor real-time data processing and advanced model serving pipelines with minimal effort. MLRun Serving is built on top of the real-time serverless framework Nuclio, and is API compatible with KFServing v2. MLRun’s serving functions can be deployed automatically using CLI, SDK, or Kubeflow Pipelines (KFP) operations.

With MLRun Serving you compose a graph of steps (composed of pre-defined graph blocks or native python classes/functions). A graph can have data processing steps, model ensembles, model servers, post-processing, etc. (see example). MLRun Serving supports complex and distributed graphs (see example) which may involve streaming, data/document/image processing, NLP, model monitoring, etc.

MLRun is natively integrated with Kubeflow and Kubeflow Pipelines, MLRun function objects can be deployed, tested and executed through Kubeflow (see example below).

Accelerate performance and time to production

MLRun’s underline serverless engine (Nuclio) uses a high-performance parallel processing engine that maximizes the utilization of CPUs and GPUs.

MLRun Serving provides native model monitoring, including auto drift detection and custom metric, models can be tracked via the Grafana plug-in or in MLRun UI (see details).

The serving pipelines can be tested locally or in a notebook, and deployed into multiple managed serverless functions in a single command (deploy()). Such functions are fully managed, with logging, monitoring, auto-scaling, security, etc., which eliminate the deployment overhead, improve performance and scalability, and accelerate time to production.

MLRun serving is natively integrated with MLRun Online Feature Store, which can be used to generate and/or enrich real-time feature vectors as well as store back production features for later analysis and re-training.

MLRun allows developers to focus on code and deploy faster by supporting:

  • 13 protocols and invocation methods (HTTP, Cron, Kafka, Kinesis, etc…),
  • Dynamic auto-scaling for http and streaming,
  • Optimal resource management for CPUs and GPUs,
  • Full life cycle–including auto-generation of micro-services, APIs, load-balancing, logging, monitoring, and configuration management.

Examples

Loading library serving function, adding models, testing the pipeline, deploy to the cluster, and test the live endpoint:

import mlrun  
# load the sklearn model serving function and add models to it  
fn = mlrun.import_function('hub://v2_model_server')
fn.add_model("model1", model_path={model1-url})
fn.add_model("model2", model_path={model2-url})

# test the serving pipeline using the graph simulator
server = fn.to_mock_server()
result = server.test("/v2/models/model1/infer", {"inputs": x})

# deploy the function to the cluster
fn.deploy()

# test the live model endpoint
fn.invoke('/v2/models/model1/infer', body={"inputs": [5]})

Building your own serving class:

MLRun Model Serving classes look and behave like KFServing classes, but are faster, support advanced graphs and capabilities, and eliminate all the deployment overhead.

from cloudpickle import load
import numpy as np
import mlrun

class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self):
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model('.pkl')
        self.model = load(open(model_file, 'rb'))

    def predict(self, body: dict) -> list:
        """Generate model predictions from sample"""
        feats = np.asarray(body['inputs'])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()

Deploy and Test Model Serving using Kubeflow Pipelines:

The following Kubeflow pipeline uses MLRun Serverless functions from the MLRun marketplace and execute a simple training, serving deployment, and serving testing Kubefow pipeline. (see the full example)

@dsl.pipeline(name="Demo pipeline")
def kfpipeline():
      
    # train with hyper-paremeters 
    train = mlrun.import_function('hub://sklearn_classifier').as_step(
        name="train",
        params={"sample"          : -1, 
                "label_column"    : LABELS,
                "test_size"       : 0.10,
                'model_pkg_class': "sklearn.ensemble.RandomForestClassifier"},
        inputs={"dataset"         : DATASET},
        outputs=['model', 'test_set'])

    # deploy our model as a serverless function, we can pass a list of models to serve 
    deploy = mlrun.import_function('hub://v2_model_server').deploy_step(
        models=[{"key": "mymodel:v1", "model_path": train.outputs['model']}])
    
    # test out new model server (via REST API calls)
    tester = mlrun.import_function('hub://v2_model_tester').as_step(
        name='model-tester',
        params={'addr': deploy.outputs['endpoint'], 'model': "mymodel:v1"},
        inputs={'table': train.outputs['test_set']})

See the documentation links below for more advanced examples

Additional Resources

Further Documentation for Serving

Further Documentation for Monitoring