Using Scikit-Learn in AWS Lambda

Use Ansible and Boto to Build Scikit-Learn for Use in Lambda

Posted by Ryan S. Brown on Sat, Feb 20, 2016
In Mini-Project
Tags: lambda, scikitlearn, python, numpy, scipy

Update: there’s a newer version of this article using Docker & the Amazon Linux image, it’s both faster and cheaper since it runs on your local machine with no remote instance.

Lambda is a neat tool for running infrequent jobs, and not having to maintain any servers is a blessing. There is a tradeoff for this ease of use: you give up some control, and need to build standalone packages of stacks like scikit-learn if you need them. Scikit-learn depends on numpy and scipy, which in turn require C and Fortran (!!!) libraries. This makes getting all these dependencies into one Lamdba deploy package interesting.

This post covers getting scikit-learn and its dependencies built and packaged for Lambda using Ansible and EC2. This means there’s no need to set up a build environment on your local box, and you can get the package from S3 when it’s complete to reuse again and again.

If you want to skip ahead, all the code to build sklearn (and a ready-to-use .zip) are on Github at ryansb/sklearn-build-lambda. Feature requests (and improvements) are welcome via pull requests, issues, or tweets @ryan_sb.

Constriants

In AWS Lambda, there are some limitations to contend with when using Python. Broadly, they are:

  1. C libraries for Python modules can’t be installed ahead of time
  2. There is a size limit (50MB at this time) of the zipped code, including libraries
  3. You don’t have the ability to change the LD_LIBRARY_PATH of your script before it runs, so you’re stuck with what you get
  4. You can’t write to the system-level library paths

The numpy/scipy/scikit-learn stack has trouble with just about all these constraints. The installed libraries, zipped, come out to 65 MB, the lapack and blas shared libraries must be loaded, and they msut be compiled targeting the Amazon Linux runtime environment.

Building Sklearn on Amazon Linux

First, I had to sort out the instance itself. I used Ansible to create the instance because I wanted to be able to use this as part of an automated build process. Ansible has a handy ec2 module for creating/terminating instances.

- register: ectwo
  ec2:
    key_name: "{{ ssh_key }}"
    instance_type: t2.micro
    image: ami-60b6c60a
    wait: yes
    instance_profile_name: "{{ profile_name }}"
    user_data: "{{ lookup('file', 'sklearn_user_data.sh')}}"

    # Give EBS volume a little more space than default
    volumes:
      - device_name: /dev/xvda
        delete_on_termination: true
        volume_type: gp2
        volume_size: 10

    # networking biz
    region: us-east-1
    vpc_subnet_id: "{{ subnet_id }}"
    assign_public_ip: yes
    groups:
      - default

The dependencies for sklearn/scipy/numpy are well-documented, making this part easy.

$ yum install -y atlas-devel atlas-sse3-devel blas-devel gcc gcc-c++ lapack-devel python27-devel

Then make a virtualenv for the build process to make sure all the dependencies are contained and you can install the libraries.

$ /usr/bin/virtualenv \
      --python /usr/bin/python sklearn_build \
      --always-copy \
       --no-site-packages
$ source sklearn_build/bin/activate
$ pip install --use-wheel numpy
$ pip install --use-wheel scipy
$ pip install --use-wheel sklearn

If you use a t2.micro or t2.nano instance (which I did), you’ll also need a swapfile because 1GB isn’t enough RAM to build all the C libraries. I did, since t2.nano instances cost less than a penny per hour.

$ dd if=/dev/zero of=/swapfile bs=1024 count=1500000
$ mkswap /swapfile
$ chmod 0600 /swapfile
$ swapon /swapfile

We have all the dependencies in the virtualenv, but they’re still too large to use in Lambda. Next, we’ll work on trimming down the size.

10 Pounds of Libraries in a 5 Pound Zip

To reduce the size of all the shared libraries, we’ll use the strip command and apply it to every library we can find - this will shave about 40MB off the total.

$ find "$VIRTUAL_ENV/lib64/python2.7/site-packages/" -name "*.so" | xargs strip
$ pushd "$VIRTUAL_ENV/lib64/python2.7/site-packages/"
$ zip -r -9 -q ~/venv.zip *
$ popd

Note also that we’re using the -9 compression level in the zip command, which provides the highest compression ratio.

Ansible for Build Processes

On small instances, it takes a while to compile all the dependencies, so we need to set up Ansible to wait while that’s happening. It takes around 15 minutes to build the full stack and upload to S3.

- s3:
    bucket: tmp.serverlesscode.com
    object: "sklearn/{{ ectwo.instance_ids[0] }}-site-pkgs.zip"
    dest: /tmp/sklearn-site-packages.zip
    mode: get
  register: result
  until: result.failed is not defined or result.failed == false
  retries: 15
  delay: 90

Using the until-retry pattern in Ansible, I have it check S3 for the zipfile every 90 seconds, so it will wait up to 22 minutes for the instance to finish the build.

- name: Terminate instances that were previously launched
  ec2:
    state: 'absent'
    region: us-east-1
    # get the instance ID from earlier to terminate now that the build
    #artifacts are in S3
    instance_ids: '{{ ectwo.instance_ids }}'

Once the archive is downloaded, Ansible kills the EC2 instance so you aren’t charged for any more time than you use.

Using Sklearn

To actually import sklearn from your Python code in the Lambda environment, you need to add your handler (mine is called demo.py) and it needs to load the .so files before running import sklearn. I used the ctypes module to load them.

import os
import ctypes

for d, dirs, files in os.walk('lib'):
    for f in files:
        if f.endswith('.a'):
            continue
        ctypes.cdll.LoadLibrary(os.path.join(d, f))

import sklearn

def handler(event, context):
    # do sklearn stuff here
    return {'yay': 'done'}

I had to walk the entire lib directory created by my build script, because there isn’t a way in Python to manipulate the LD_LIBRARY_PATH in time for the loader to accept the changes. This might result in more libraries than you need being loaded, which is fine.

Loading all the libraries outside the handler saves a significant amount of time for subsequent executions. In my tests with a 128MB Lambda execution environment, the first execution took up to 6.2 seconds to load the code (39MB zipped) and then import all the C libraries. Subsequent executions ran as quickly as 4ms when reusing a warm container.

Building Deps Yourself

To build all these yourself, clone ryansb/sklearn-build-lambda and install Ansible 2.0+ and boto 2.

To build everything, run ansible-playbook launch_sklearner.yml and expect to wait about 15 minutes before the built files to appear in S3. They’ll be named with the instance ID and “site-pkgs.zip”. For example, i-0c9dab6cd6dcb0d9d-site-pkgs.zip. You can then download the zip and include your custom code, then upload the whole thing as a Lambda function.

The Ansible playbook will also download the dependencies to /tmp/sklearn-site-packages.zip for your convenience.

Wrapping Up

Eventually, I’ll probably expand the playbook to take any requirements and build them on an EC2 instance, so you don’t need to worry about any unexpected differences between your build machine and your EC2 instances.

Keep up with future posts via RSS. If you have suggestions, questions, or comments feel free to email me ryan@serverlesscode.com.


Tweet this, send to Hackernews, or post on Reddit