Update: there’s a newer version of this article using Docker & the Amazon Linux image, it’s both faster and cheaper since it runs on your local machine with no remote instance.
Lambda is a neat tool for running infrequent jobs, and not having to maintain any servers is a blessing. There is a tradeoff for this ease of use: you give up some control, and need to build standalone packages of stacks like scikit-learn if you need them. Scikit-learn depends on numpy and scipy, which in turn require C and Fortran (!!!) libraries. This makes getting all these dependencies into one Lamdba deploy package interesting.
This post covers getting scikit-learn and its dependencies built and packaged for Lambda using Ansible and EC2. This means there’s no need to set up a build environment on your local box, and you can get the package from S3 when it’s complete to reuse again and again.
If you want to skip ahead, all the code to build sklearn (and a ready-to-use
.zip
) are on Github at ryansb/sklearn-build-lambda. Feature
requests (and improvements) are welcome via pull requests, issues, or tweets
@ryan_sb.
Constriants
In AWS Lambda, there are some limitations to contend with when using Python. Broadly, they are:
- C libraries for Python modules can’t be installed ahead of time
- There is a size limit (50MB at this time) of the zipped code, including libraries
- You don’t have the ability to change the
LD_LIBRARY_PATH
of your script before it runs, so you’re stuck with what you get - You can’t write to the system-level library paths
The numpy/scipy/scikit-learn stack has trouble with just about all these constraints. The installed libraries, zipped, come out to 65 MB, the lapack and blas shared libraries must be loaded, and they msut be compiled targeting the Amazon Linux runtime environment.
Building Sklearn on Amazon Linux
First, I had to sort out the instance itself. I used Ansible to create the
instance because I wanted to be able to use this as part of an automated build
process. Ansible has a handy ec2
module for creating/terminating instances.
- register: ectwo
ec2:
key_name: "{{ ssh_key }}"
instance_type: t2.micro
image: ami-60b6c60a
wait: yes
instance_profile_name: "{{ profile_name }}"
user_data: "{{ lookup('file', 'sklearn_user_data.sh')}}"
# Give EBS volume a little more space than default
volumes:
- device_name: /dev/xvda
delete_on_termination: true
volume_type: gp2
volume_size: 10
# networking biz
region: us-east-1
vpc_subnet_id: "{{ subnet_id }}"
assign_public_ip: yes
groups:
- default
The dependencies for sklearn/scipy/numpy are well-documented, making this part easy.
$ yum install -y atlas-devel atlas-sse3-devel blas-devel gcc gcc-c++ lapack-devel python27-devel
Then make a virtualenv
for the build process to make sure all the
dependencies are contained and you can install the libraries.
$ /usr/bin/virtualenv \
--python /usr/bin/python sklearn_build \
--always-copy \
--no-site-packages
$ source sklearn_build/bin/activate
$ pip install --use-wheel numpy
$ pip install --use-wheel scipy
$ pip install --use-wheel sklearn
If you use a t2.micro or t2.nano instance (which I did), you’ll also need a swapfile because 1GB isn’t enough RAM to build all the C libraries. I did, since t2.nano instances cost less than a penny per hour.
$ dd if=/dev/zero of=/swapfile bs=1024 count=1500000
$ mkswap /swapfile
$ chmod 0600 /swapfile
$ swapon /swapfile
We have all the dependencies in the virtualenv, but they’re still too large to use in Lambda. Next, we’ll work on trimming down the size.
10 Pounds of Libraries in a 5 Pound Zip
To reduce the size of all the shared libraries, we’ll use the strip
command
and apply it to every library we can find - this will shave about 40MB off the
total.
$ find "$VIRTUAL_ENV/lib64/python2.7/site-packages/" -name "*.so" | xargs strip
$ pushd "$VIRTUAL_ENV/lib64/python2.7/site-packages/"
$ zip -r -9 -q ~/venv.zip *
$ popd
Note also that we’re using the -9
compression level in the zip
command,
which provides the highest compression ratio.
Ansible for Build Processes
On small instances, it takes a while to compile all the dependencies, so we need to set up Ansible to wait while that’s happening. It takes around 15 minutes to build the full stack and upload to S3.
- s3:
bucket: tmp.serverlesscode.com
object: "sklearn/{{ ectwo.instance_ids[0] }}-site-pkgs.zip"
dest: /tmp/sklearn-site-packages.zip
mode: get
register: result
until: result.failed is not defined or result.failed == false
retries: 15
delay: 90
Using the until-retry
pattern in Ansible, I have it check S3 for the
zipfile every 90 seconds, so it will wait up to 22 minutes for the instance to
finish the build.
- name: Terminate instances that were previously launched
ec2:
state: 'absent'
region: us-east-1
# get the instance ID from earlier to terminate now that the build
#artifacts are in S3
instance_ids: '{{ ectwo.instance_ids }}'
Once the archive is downloaded, Ansible kills the EC2 instance so you aren’t charged for any more time than you use.
Using Sklearn
To actually import sklearn from your Python code in the Lambda environment, you
need to add your handler (mine is called demo.py
) and it needs to load the
.so
files before running import sklearn
. I used the ctypes
module to
load them.
import os
import ctypes
for d, dirs, files in os.walk('lib'):
for f in files:
if f.endswith('.a'):
continue
ctypes.cdll.LoadLibrary(os.path.join(d, f))
import sklearn
def handler(event, context):
# do sklearn stuff here
return {'yay': 'done'}
I had to walk the entire lib
directory created by my build script, because
there isn’t a way in Python to manipulate the LD_LIBRARY_PATH
in time for the
loader to accept the changes. This might result in more libraries than you need
being loaded, which is fine.
Loading all the libraries outside the handler saves a significant amount of time for subsequent executions. In my tests with a 128MB Lambda execution environment, the first execution took up to 6.2 seconds to load the code (39MB zipped) and then import all the C libraries. Subsequent executions ran as quickly as 4ms when reusing a warm container.
Building Deps Yourself
To build all these yourself, clone ryansb/sklearn-build-lambda and install Ansible 2.0+ and boto 2.
To build everything, run ansible-playbook launch_sklearner.yml
and expect to
wait about 15 minutes before the built files to appear in S3. They’ll be named
with the instance ID and “site-pkgs.zip”. For example,
i-0c9dab6cd6dcb0d9d-site-pkgs.zip
. You can then download the zip and include
your custom code, then upload the whole thing as a Lambda function.
The Ansible playbook will also download the dependencies to
/tmp/sklearn-site-packages.zip
for your convenience.
Wrapping Up
Eventually, I’ll probably expand the playbook to take any requirements and build them on an EC2 instance, so you don’t need to worry about any unexpected differences between your build machine and your EC2 instances.
Keep up with future posts via RSS. If you have suggestions, questions, or comments feel free to email me ryan@serverlesscode.com.