Deploying Models with TensorFlow Serving: A Comprehensive Guide
TensorFlow Serving is a high-performance, flexible system designed for serving machine learning models in production environments. It enables scalable, efficient inference for TensorFlow models, supporting REST and gRPC APIs, model versioning, and integration with various platforms. This blog provides a detailed exploration of TensorFlow Serving, covering its mechanics, practical deployment scenarios, and optimization techniques. Aimed at TensorFlow users with basic familiarity with the framework and Python, this guide assumes knowledge of SavedModel, tf.estimator, and tf.data APIs.
Introduction to TensorFlow Serving
TensorFlow Serving is part of the TensorFlow Extended (TFX) ecosystem, built to deploy SavedModel-format models for real-time or batch inference. It handles model loading, versioning, and request processing, making it ideal for production systems requiring low-latency predictions, high throughput, or dynamic model updates. TensorFlow Serving supports TensorFlow models natively and integrates with Docker, Kubernetes, and cloud platforms for scalable deployment.
This blog demonstrates how to deploy models using TensorFlow Serving, including setting up the server, serving Keras and estimator models, and optimizing performance. Practical examples cover common use cases like classification and regression, ensuring you can implement robust serving pipelines.
For foundational context, see SavedModel and Model Deployment.
Why Use TensorFlow Serving?
TensorFlow Serving offers several advantages for production deployment:
- High Performance: Optimized for low-latency, high-throughput inference with batching and hardware acceleration.
- Model Versioning: Supports multiple model versions, enabling A/B testing and rollback.
- Scalability: Integrates with Kubernetes and cloud platforms for distributed serving.
- Flexibility: Supports REST and gRPC APIs, catering to diverse client needs.
However, setting up TensorFlow Serving requires careful configuration of model formats, input signatures, and server settings to avoid issues like latency spikes or resource overuse. We’ll address these challenges with practical solutions.
External Reference
- [TensorFlow Serving Overview](https://www.tensorflow.org/tfx/guide/serving) – Official documentation on TensorFlow Serving setup and usage.
Mechanics of TensorFlow Serving
TensorFlow Serving operates as a server that loads SavedModel-format models and exposes them via REST or gRPC endpoints. Key components include:
- Model Server: The core server process that loads and serves models.
- SavedModel: The model format containing the computation graph, weights, and signatures.
- Servables: Abstractions for models or data, allowing dynamic loading and versioning.
- Signatures: Define input and output tensors for inference, specified in the SavedModel.
- Batching: Combines multiple requests into batches to improve throughput.
To deploy a model, you save it in SavedModel format, configure the server to load it, and send inference requests via API calls.
Practical Applications of TensorFlow Serving
Let’s explore how to deploy models with TensorFlow Serving, with detailed examples for common scenarios.
1. Serving a Keras Model
Keras models saved in SavedModel format can be served efficiently with TensorFlow Serving.
Example: Serving a Keras Classification Model
Suppose you have a Keras model for binary classification.
import tensorflow as tf
import numpy as np
# Sample data
data = {
"feature1": np.array([1.0, 2.0, 3.0, 4.0]),
"feature2": np.array([10.0, 20.0, 30.0, 40.0]),
"label": np.array([0, 1, 0, 1])
}
# Define Keras model
model = tf.keras.Sequential([
tf.keras.layers.Dense(16, activation="relu", input_shape=(2,)),
tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
# Train model
features = np.stack([data["feature1"], data["feature2"]], axis=1)
labels = data["label"]
model.fit(features, labels, epochs=5, batch_size=2)
# Save to SavedModel
tf.saved_model.save(model, "saved_model_keras/1") # Version 1
This saves the model in SavedModel format with version 1. For Keras models, see Keras in TensorFlow.
Setting Up TensorFlow Serving
Run TensorFlow Serving using Docker, mounting the model directory:
docker run -p 8501:8501 --mount type=bind,source=/path/to/saved_model_keras,target=/models/my_model -e MODEL_NAME=my_model -t tensorflow/serving
This starts the server, exposing a REST API at http://localhost:8501/v1/models/my_model.
Sending Inference Requests
Use the REST API to send a prediction request:
curl -d '{"instances": [[2.5, 25.0]]}' -X POST http://localhost:8501/v1/models/my_model:predict
Example response:
{
"predictions": [[0.7321]]
}
For gRPC, use the TensorFlow Serving Python client:
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import grpc
# Connect to server
channel = grpc.insecure_channel("localhost:8500")
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create request
request = predict_pb2.PredictRequest()
request.model_spec.name = "my_model"
request.inputs["inputs"].CopyFrom(tf.make_tensor_proto([[2.5, 25.0]], dtype=tf.float32))
response = stub.Predict(request, 10.0) # 10s timeout
print(response.outputs["dense_1"].float_val) # Output: [0.7321]
For REST and gRPC APIs, see Serving REST API and gRPC Serving.
External Reference
- [TensorFlow Serving REST API Tutorial](https://www.tensorflow.org/tfx/serving/api_rest) – Guide to using the REST API for inference.
2. Serving an Estimator Model
Estimators are commonly used for structured data and can be exported to SavedModel for serving.
Example: Serving a DNNClassifier
Suppose you have a DNNClassifier for classification.
import tensorflow as tf
import pandas as pd
# Sample data
data = pd.DataFrame({
"age": [25, 30, 35, 40],
"region": ["NY", "SF", "LA", "NY"],
"label": [0, 1, 0, 1]
})
# Define feature columns
age_col = tf.feature_column.numeric_column("age")
region_col = tf.feature_column.categorical_column_with_vocabulary_list(
"region", ["NY", "SF", "LA"]
)
region_indicator = tf.feature_column.indicator_column(region_col)
feature_columns = [age_col, region_indicator]
# Define input function
def input_fn(data, batch_size=32, shuffle=True):
features = {"age": data["age"], "region": data["region"]}
labels = data["label"]
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
if shuffle:
dataset = dataset.shuffle(buffer_size=len(data))
dataset = dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)
return dataset
# Create estimator
estimator = tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[16, 8],
n_classes=2,
model_dir="model_dir"
)
# Train
estimator.train(lambda: input_fn(data, batch_size=2), steps=100)
# Export to SavedModel
feature_spec = tf.feature_column.make_parse_example_spec(feature_columns)
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
estimator.export_saved_model("saved_model_estimator/1", serving_input_fn)
This exports the estimator to SavedModel with version 1. For estimators, see Keras to Estimator.
Serving and Inferring
Run TensorFlow Serving:
docker run -p 8501:8501 --mount type=bind,source=/path/to/saved_model_estimator,target=/models/my_model -e MODEL_NAME=my_model -t tensorflow/serving
Send a REST request with a serialized tf.train.Example:
import json
example = tf.train.Example(features=tf.train.Features(feature={
"age": tf.train.Feature(float_list=tf.train.FloatList(value=[30.0])),
"region": tf.train.Feature(bytes_list=tf.train.BytesList(value=[b"SF"]))
}))
request_data = {"instances": [{"examples": {"b64": base64.b64encode(example.SerializeToString()).decode()}}]}
response = requests.post("http://localhost:8501/v1/models/my_model:predict", json=request_data)
print(response.json()) # Output: {"predictions": [{"probabilities": [0.3, 0.7], "classes": 1}]}
This uses the REST API to infer with a tf.train.Example.
Optimizing TensorFlow Serving
To ensure efficient and scalable deployment, apply these optimization strategies:
1. Enable Batching
Configure TensorFlow Serving for batch inference to improve throughput:
docker run -p 8501:8501 --mount type=bind,source=/path/to/saved_model_keras,target=/models/my_model -e MODEL_NAME=my_model -t tensorflow/serving --enable_batching=true --batching_parameters_file=/models/batching_config.txt
Example batching_config.txt:
max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
num_batch_threads { value: 4 }
This batches up to 32 requests with a 5ms timeout. For batch inference, see Batch Inference.
2. Optimize Model Performance
Apply model optimization techniques like quantization or pruning before saving:
from tensorflow_model_optimization.quantization.keras import quantize_model
# Quantize model
quantized_model = quantize_model(model)
quantized_model.compile(optimizer="adam", loss="binary_crossentropy")
tf.saved_model.save(quantized_model, "saved_model_quantized/1")
For optimization, see Quantization.
3. Model Versioning
Store multiple model versions in subdirectories (e.g., saved_model_keras/1, saved_model_keras/2). TensorFlow Serving automatically loads the latest version or allows version-specific requests:
curl -d '{"instances": [[2.5, 25.0]]}' -X POST http://localhost:8501/v1/models/my_model/versions/1:predict
For versioning, see Model Versioning.
4. Scalable Deployment with Kubernetes
Deploy TensorFlow Serving on Kubernetes for scalability:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-serving
spec:
replicas: 3
selector:
matchLabels:
app: tf-serving
template:
metadata:
labels:
app: tf-serving
spec:
containers:
- name: tf-serving
image: tensorflow/serving
ports:
- containerPort: 8501
env:
- name: MODEL_NAME
value: "my_model"
volumeMounts:
- mountPath: /models/my_model
name: model-volume
volumes:
- name: model-volume
hostPath:
path: /path/to/saved_model_keras
This deploys three replicas for load balancing. For Kubernetes, see TensorFlow on Kubernetes.
5. Monitor and Profile
Monitor serving performance using TensorFlow Serving’s metrics endpoint (/monitoring/prometheus) and profile with TensorFlow Profiler:
tf.profiler.experimental.start("logdir")
loaded_model = tf.saved_model.load("saved_model_keras/1")
infer = loaded_model.signatures["serving_default"]
infer(tf.constant([[2.5, 25.0]], dtype=tf.float32))
tf.profiler.experimental.stop()
For profiling, see Profiler Advanced.
External Reference
- [TensorFlow Serving with Kubernetes](https://www.tensorflow.org/tfx/serving/serving_kubernetes) – Deploying TensorFlow Serving on Kubernetes.
Advanced Use Cases
1. Multi-Model Serving
Serve multiple models by mounting multiple model directories:
docker run -p 8501:8501 -t tensorflow/serving --model_config_file=/models/models.config
Example models.config:
model_config_list {
config {
name: "model1"
base_path: "/models/saved_model_keras"
model_platform: "tensorflow"
}
config {
name: "model2"
base_path: "/models/saved_model_estimator"
model_platform: "tensorflow"
}
}
Request specific models via /v1/models/model1:predict or /v1/models/model2:predict.
2. A/B Testing
Use model versioning for A/B testing:
# Save new model version
tf.saved_model.save(model, "saved_model_keras/2")
Test the new version:
curl -d '{"instances": [[2.5, 25.0]]}' -X POST http://localhost:8501/v1/models/my_model/versions/2:predict
For A/B testing, see A/B Testing.
3. Custom Signatures
Define custom signatures for specific inference tasks:
@tf.function(input_signature=[tf.TensorSpec(shape=[None, 2], dtype=tf.float32)])
def custom_serving(x):
return {"prediction": model(x)}
tf.saved_model.save(model, "saved_model_custom/1", signatures={"custom": custom_serving})
For custom models, see tf.Module.
Common Pitfalls and Solutions
- Signature Mismatches:
- Pitfall: Incorrect input signatures cause serving errors.
- Solution: Define input_signature in tf.function. See [tf.function Optimization](/tensorflow/intermediate/tf-function-optimization).
2. High Latency:
- Pitfall: Inefficient models or small batch sizes increase latency.
- Solution: Enable batching and optimize models. See [Inference Optimization](/tensorflow/production/inference-optimization).
3. Resource Overuse:
- Pitfall: Multiple model versions consume excessive memory.
- Solution: Configure model unloading or limit versions. See [Model Monitoring](/tensorflow/production/model-monitoring).
For debugging, see Debugging Tools.
Conclusion
TensorFlow Serving is a robust solution for deploying TensorFlow models in production, offering high-performance inference, model versioning, and scalability. By saving models in SavedModel format and configuring TensorFlow Serving with Docker or Kubernetes, you can serve Keras, estimator, or custom models efficiently. Optimizing with batching, model optimization, and profiling ensures low-latency, high-throughput serving. Whether you’re deploying classification models or implementing A/B testing, TensorFlow Serving empowers you to build production-ready machine learning systems.
For further exploration, dive into Scalable Inference or Production Best Practices.