TensorFlow Serving

When we have the model trained, it is often necessary to deploy the model in a production environment. The most common way to do this is to provide an API on the server, where the client sends a request in a specific format to one of the server’s APIs, then the server receives the requested data, computes it through the model, and returns the results. The server API can be implemented very easily with Python web frameworks such as Flask if what we want is just a demo, regardless of the high concurrency and performance issues. However, most of the real production environment is not that case. Therefore, TensorFlow provides us with TensorFlow Serving, a serving system that helps us deploy machine learning models flexibly and with high performance in real production environments.

Installation of TensorFlow Serving

TensorFlow Serving can be installed using either apt-get or Docker. In a production environment, it is recommended to use Docker to deploy TensorFlow Serving. However, as a tutorial, we will introduce apt-get installization which do not rely on docker..

Hint

Software installation method is time-sensitive, this section is updated in August 2019. If you encounter problems, it is recommended to refer to the latest installation instructions on the TensorFlow website.

First add the package source:

# Add the TensorFlow Serving package source from Google
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list
# Add gpg key
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -

Then you can use apt-get to install TensorFlow Serving

sudo apt-get update
sudo apt-get install tensorflow-model-server

Hint

You can use Windows Subsystem Linux (WSL) to install TensorFlow Serving on Windows for development.

TensorFlow Serving models deployment

TensorFlow Serving can read models in SavedModel format directly for deployment (see previous chapter for exporting models to SavedModel files). The command is as follows

tensorflow_model_server \
    --rest_api_port=PORT(e.g., 8501) \
    --model_name=MODEL_NAME \
    --model_base_path="Absolute folder path of the SavedModel format model (without version number)"

注解

TensorFlow Serving supports hot update of models with the following typical model folder structure.

/saved_model_files
    /1      # model files of version 1
        /assets
        /variables
        saved_model.pb
    ...
    /N      # model files of version N
        /assets
        /variables
        saved_model.pb

The subfolders from 1 to N above represent models with different version numbers. When specifying --model_base_path, you only need to specify the absolute address (not the relative address) of the root directory. For example, if the above folder structure is in the home/snowkylin folder, then --model_base_path should be set to home/snowkylin/saved_model_files (without the model version number). TensorFlow Serving will automatically select the model with the largest version number for loading.

Keras Sequential mode models deployment

Since the Sequential model has fixed inputs and outputs, this type of model is easy to deploy without additional operations. For example, to deploy the MNIST handwriting digit classification model (built using the Keras Sequential mode) in the previous chapter using SavedModel with the model name MLP on port 8501, you can directly use the following command.

tensorflow_model_server \
    --rest_api_port=8501 \
    --model_name=MLP \
    --model_base_path="/home/.../.../saved"  # The absolute address of the SavedModel folder without version number

The model can then be called on the client using gRPC or RESTful API as described later.

Custom Keras models deployment

Custom Keras models built inheriting tf.keras.Model class are more flexible. Therefore, when using the TensorFlow Serving deployment model, there are additional requirements for the exported SavedModel file.

  • Methods that need to be exported to the SavedModel format (e.g. call) require not only being decorated by @tf.function, but also specifying the input_signature parameter at the time of decoration to explicitly describe the input shape, using a list of tf.TensorSpec specifying the shape and type of each input tensor. For example, for MNIST handwriting digit classification, the input is a four-dimensional tensor of [None, 28, 28, 1] (None denotes that the first dimension, i.e., the batch size, is not fixed). We can decorate the call method of the model as follows.

class MLP(tf.keras.Model):
    ...

    @tf.function(input_signature=[tf.TensorSpec([None, 28, 28, 1], tf.float32)])
    def call(self, inputs):
        ...
  • When exporting the model using tf.saved_model.save, we need to provide an additional “signature of the function to be exported” via the signature parameter. In short, since there may be multiple methods in a custom model class that need to be exported, TensorFlow Serving needs to be told which method is called when receiving a request from the client. For example, if we want to assign signature call to the model.call method, we can pass the signature parameter when exporting to tell the correspondence between the signature and the method to be exported, in the form of a key-value pair of dict. The following code is an example

model = MLP()
...
tf.saved_model.save(model, "saved_with_signature/1", signatures={"call": model.call})

Once both of these steps have been completed, you can deploy the model using the following commands

tensorflow_model_server \
    --rest_api_port=8501 \
    --model_name=MLP \
    --model_base_path="/home/.../.../saved_with_signature"  # 修改为自己模型的绝对地址

Calling models deployed by TensorFlow Serving on client

TensorFlow Serving supports gRPC and RESTful API for models deployed with TensorFlow Serving. This handbook mainly introduces the more general RESTful API method.

The RESTful API use standard HTTP POST method, with both requests and response being JSON objects. In order to call the server-side model, we send a request to the server on the client side in the following format.

Server URI: http://SERVER_ADDRESS:PORT/v1/models/MODEL_NAME:predict

Content of the request:

{
    "signature_name": "the signature of the method to be called (do not request for Sequential models)",
    "instances": input data
}

The format of the response is

{
    "predictions": the returned value
}

An example of Python client

The following example uses Python’s Requests library (which you may need to install via pip install requests) to send the first 10 images of the MNIST test set to the local TensorFlow Serving server and return the predicted results, which are then compared to the actual tags of the test set.

import json
import numpy as np
import requests
from zh.model.utils import MNISTLoader


data_loader = MNISTLoader()
data = json.dumps({
    "instances": data_loader.test_data[0:3].tolist()
    })
headers = {"content-type": "application/json"}
json_response = requests.post(
    'http://localhost:8501/v1/models/MLP:predict',
    data=data, headers=headers)
predictions = np.array(json.loads(json_response.text)['predictions'])
print(np.argmax(predictions, axis=-1))
print(data_loader.test_label[0:10])

Output:

[7 2 1 0 4 1 4 9 6 9]
[7 2 1 0 4 1 4 9 5 9]

It can be seen that the predicted results are very close to the true label values.

For a custom Keras model, simply add the signature_name parameter to the sent data, changing the data build process in the above code to

data = json.dumps({
    "signature_name": "call",
    "instances": data_loader.test_data[0:10].tolist()
    })

An example of Node.js client

The following example uses Node.js to convert the following image to a 28*28 grayscale image, send it to the local TensorFlow Serving server, and output the returned predicted values and probabilities. This program uses the image processing library jimp and the HTTP library superagent, which can be installed using npm install jimp and npm install superagent.

../../_images/test_pic_tag_5.png

test_pic_tag_5.png : A handwritten number 5. (This image can be downloaded and placed in the same directory as the code when running the code below)

const Jimp = require('jimp')
const superagent = require('superagent')

const url = 'http://localhost:8501/v1/models/MLP:predict'

const getPixelGrey = (pic, x, y) => {
  const pointColor = pic.getPixelColor(x, y)
  const { r, g, b } = Jimp.intToRGBA(pointColor)
  const gray =  +(r * 0.299 + g * 0.587 + b * 0.114).toFixed(0)
  return [ gray / 255 ]
}

const getPicGreyArray = async (fileName) => {
  const pic = await Jimp.read(fileName)
  const resizedPic = pic.resize(28, 28)
  const greyArray = []
  for ( let i = 0; i< 28; i ++ ) {
    let line = []
    for (let j = 0; j < 28; j ++) {
      line.push(getPixelGrey(resizedPic, j, i))
    }
    console.log(line.map(_ => _ > 0.3 ? ' ' : '1').join(' '))
    greyArray.push(line)
  }
  return greyArray
}

const evaluatePic = async (fileName) => {
  const arr = await getPicGreyArray(fileName)
  const result = await superagent.post(url)
    .send({
      instances: [arr]
    })
  result.body.predictions.map(res => {
    const sortedRes = res.map((_, i) => [_, i])
    .sort((a, b) => b[0] - a[0])
    console.log(`我们猜这个数字是${sortedRes[0][1]},概率是${sortedRes[0][0]}`)
  })
}

evaluatePic('test_pic_tag_5.png')

The output is

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1               1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1                 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1     1                 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1                         1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1         1 1 1 1 1 1     1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1       1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1     1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1         1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1         1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1     1 1 1         1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1                 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1         1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
We guess the number is 5, the probability is 0.846008837

The output can be seen to be as expected.

Note

If you are not familiar with HTTP POST, you can refer to this article. In fact, when you fill out a form in your browser (let’s say a personality test), click the “Submit” button and get a return result (let’s say “Your personality is ISTJ”), you are most likely sending an HTTP POST request to the server and getting a response from the server.

RESTful API is a popular API design theory that is briefly described at this article.

The complete use of the RESTful API for TensorFlow Serving can be found in the documentation.