Serving Deep Networks in Production: Balancing Productivity vs Efficiency Tradeoff

2022-05-21 20:39:04 By : Mr. XuLIn ZHeng

Attend QCon San Francisco (Oct 24-28) and find practical inspiration from software leaders. Register

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Avdi Grimm describes the future of development, which is already here. Get a tour of a devcontainer, and contrast it with a deployment container.

Live from the venue of the QCon London Conference, we are talking with Casie Breviu. She will talk about how she got started with AI, and what machine learning tools can accelerate your work when deploying models on a wide range of devices. We will also talk about GitHub Copilot and how AI can help you be a better programmer.

In this article, author Nikita Povarov discusses the role AI/ML plays in software development and how tasks like code completion, code search, and bug detection can be powered by machine learning. But he also explains why a complete replacement of programmers by algorithms isn't going happen any time soon.

In the pursuit of agile at scale, the landscape is dominated by process-driven approaches which are broken. This article explores a solution-driven rollout approach, one that puts authentic agreement on outcomes before solutions. The principles on which it is based are effective also as leadership strategies, where frameworks are resources to draw on as people find fitting solutions.

The panelists discuss the security for the software supply chain and software security risk measurement.

Learn how cloud architectures achieve cost savings, improve reliability & deliver value. Register Now.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

InfoQ Homepage News Serving Deep Networks in Production: Balancing Productivity vs Efficiency Tradeoff

A new project provides an alternative modality for serving deep neural networks. It enables utilizing eager-mode (a.k.a define-by-run) model code directly at production workloads by using embedded CPython interpreters. The goal is to reduce the engineering effort to bring the models from the research stage to the end-user and to create a proof-of-concept platform for migrating future numerical Python libraries. The initial library is also available as a prototype feature (torch::deploy) in the PyTorch C++ API (version 1.11).

The two common practices for deploying deep networks for API inference have been direct containerization of model code with a REST/RPC server (e.g. using Python, CUDA, ROCm base images) and construction of the static model graph (e.g. using Tensorflow graph-mode or Torchscript). Containerization brings agility when carrying the model from development to production. On the other hand, the graph-mode (a.k.a define-and-run) allows optimized deployment within a larger serving infrastructure. Both methods have related tradeoffs considering required engineering cost, API latency, and resource constraints (e.g. available GPU memory).

Depending on the application field and company infrastructure requirements, the deep network deployment method may be customized to specific needs (e.g. graph-mode may be enforced or containers may be created from compiled models instead). But the rapidly evolving nature of deep networks mandates relatively loose development environments where quick experimentation is important for the success of the project, hence possible reduction in development cost to bring such models to production is bounded. The question the paper raises is: Can we minimize machine learning engineering effort by enabling eager-mode serving within the C++ ecosystem without giving up large performance penalties?

The author's answer to this question is not very surprising. What if incoming requests were handed off to (multiple) CPython workers that are packaged within the C++ application? By load balancing via a proxy, the models may be used directly without further engineering work. Such a scenario would also allow machine learning engineers to keep the external libraries used when developing the model (e.g. Numpy, Pillow) as the required CPython interpreter is available (this is not yet supported in the current prototype). This approach may seem similar to containerization with base Python images since they both evade Python's GIL by using multiple decoupled interpreters, but the new method also allows execution in C++ (i.e. the idea combines two aforementioned common ways in a unique way).

An example of packaging in Python can be seen below:

After saving the model artifacts, it can be used with the C++ API for inference purposes:

The benchmarks carried out in the report show such CPython packaging can be a good alternative, especially for large model serving. As it is still a work-in-progress, there are several shortcomings of the project. For example, the external library support is limited to Python standard library and PyTorch only. Also, it requires copying and loading the shared interpreter library to each interpreter, therefore the size and number of workers may become a factor for scalability. In the future, contributors plan to package dependencies directly from Pip/Conda environment libraries, hence allowing even easier production deployment.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

The industry’s only comprehensive Cloud Native Security Platform. Get a free trial now.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon Plus - May 10-20, Online. QCon Plus brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Find practical inspiration (not product pitches) from software leaders deep in the trenches creating software, scaling architectures and fine-tuning their technical leadership to help you make the right decisions.

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy