Building Scikit-Learn for AWS Lambda

Using the Amazon Linux Image to Build Lambda Packages

Posted by Ryan S. Brown on Sat, Jan 21, 2017
In Mini-Project
Tags: lambda, scikitlearn, python, numpy, scipy, docker

Last year, I posted instructions for building scikit-learn for AWS Lambda and since then, there have been changes in both the way scikit-learn has to be built. The project has also started being shipped as a different kind of wheelfile – bdist_wheel. According to this Github issue that breaks build processes using strip to reduce the code size of numpy and scipy.

Amazon has also released a container edition of Amazon Linux. The Amazon Linux container is a full container version of the same Amazon Linux that’s being run in the AWS Lambda environment. In this post, we’ll use the new container image to build the same scikit-learn artifact as the last post that used an EC2 instance.

The New Script

The script itself is relatively unchanged, but to use it you’ll need to have Docker installed on your computer. That’s beyond the scope of this post, and if you don’t have it check out the docs.

Installing the non-binary wheels means changing the pip commands to use the (new in pip version 8) --no-binary option to force the type of wheel to be installed. The new command is pip install --use-wheel --no-binary numpy numpy.

To run it, instead of using an Ansible playbook, you’ll need to pull down the Amazon Linux image with docker pull amazonlinux:2016.09. As of January of 2017, the 2016.09 image matches the Lambda execution environment.

Running in Docker

Once the image is downloaded, we can run the build script in the container. Clone my ryansb/sklearn-build-lambda and change into the directory. Once that’s done, we can use docker run to build the artifacts and dump them in the working directory. The $(pwd) part of the volume argument mounts the current directory to a /outputs folder inside the container.

$ docker run -v $(pwd):/outputs -it amazonlinux:2016.09 \
    /bin/bash /outputs/build.sh

After a few minutes (depending on your hardware) you’ll have a venv.zip file containing scikit-learn and all it’s dependencies. The artifact is still hovering around 40MB, which is large but not unmanageable. For pulling models, I’ve had success storing them separately in S3 and downloading at function start.

Wrapping Up

With the Docker container, you don’t need to worry whether your base OS is the right flavor of Linux, or run an instance in AWS. This build script is easy to expand to include more dependencies, just add libraries to the do_pip function of build.sh.

Keep up with future posts via RSS. If you have suggestions, questions, or comments ryan@serverlesscode.com is my email address.


Tweet this, send to Hackernews, or post on Reddit