Prod-to-Staging Data Pipeline With Yesterdaytabase

Use Lambda to Get the Freshest Data Daily

Posted by Ryan S. Brown on Mon, Nov 7, 2016
In Mini-Project
Tags: python, rds, backup, cron, workflows

When developing SaaS (or other) products, it’s critical to get the same view your users do of your product. Sometimes, that means giving your team the ability to step right into their shoes and see their account. For your support team, it’s handy to be able to try out new features just like clients would – same data, same interface.

Facilitating this kind of transfer is easy if you have the right tools:

  • A way to run code on a schedule
  • Code to restore the latest production snapshot to a new database
  • DNS to point to the new database so you can use it in staging

Lambda and Cloudwatch Events can run code on a schedule, check. Python and boto3 are pretty much a universal remote for AWS, check. Route53 for DNS, check. Now let’s put it all together.

For those of you who skip to the last page of books, here’s a button to just set all this up. Just fill in the info about your database and security groups. The full code is also on github.

Launch stack TestStack

First Steps – Serverless Setup

With v1.1, the Serverless framework makes it pretty quick to set up one-off functions and associated resources. Use serverless create -t aws-python to create a service, then go to set up the serverless.yml file. Some bits are excluded from this example, so follow the link on the bottom for the full setup.

# serverless.yml
service: yesterdaytabase
provider:
  name: aws
  cfLogs: true
  runtime: python2.7
  stage: prod
  region: us-east-1

functions:
  DbManager:
    handler: handler.handler
    events:
        # run at 6am every day
      - schedule: "cron(0 11 * * ? *)"
        enabled: true

The full serverless.yml is available here.

That’s almost all the configuration we’ll need for the whole project. That will set up our Lambda function, and a schedule to run it daily.

A Security/Authorization Aside

One last thing before we deploy: IAM configuration. Below is the policies to allow our Lambda function to make Route53 records, RDS instances (as long as they’re t2.small), and CloudFormation stacks. Note the Condition parameter that checks the instance size before allowing destructive RDS actions.

# serverless.yml
iamRoleStatements:
  - Effect: "Allow"
    Action:
      - "rds:CreateDBInstance"
      - "rds:ModifyDBInstance"
      - "rds:RestoreDBInstanceFromDBSnapshot"
      - "rds:DeleteDBInstance"
    Resource: "*"
    Condition:
      StringEquals: {"rds:DatabaseClass": "db.t2.small"}
  - Effect: "Allow"
    Action:
      - "cloudformation:Describe*"
      - "ec2:Describe*"
      - "rds:Describe*"
      - "cloudformation:CreateStack"
      - "cloudformation:DeleteStack"
      - "cloudformation:GetTemplate"
      - "cloudformation:UpdateStack"
      - "cloudformation:ValidateTemplate"
      - "route53:ChangeResourceRecordSets"
      - "route53:Get*"
      - "route53:List*"
    Resource: "*"

Note that if your staging database needs to be bigger than a t2.small, you’ll need to change the conditional on RDS actions to match.

Hold Up! What About Privacy?

The last thing you need to do is something that’s app-specific. When you move this data over from production, you need to sanitize any personally-identifiable information. Developers don’t need to see real customer names, street addresses, or credit card numbers. To resolve this, the way to go would be a Lambda (or other scheduled system) that connects to the database and replaces all the emails, passwords, addresses, and so on with safe, dummy data.

There are two reasons that’s not built in to the yesterdaytabase code:

  1. The yesterdaytabase function can’t wait for the fresh DB copy to finish creating. Lambda has a limit of 5 minutes runtime, and RDS databases often take longer.
  2. The script to actually clean up you app data is custom to your data model, so you’ll need to build that yourself.

Reminder: I am not a lawyer. If you think you’re treading anywhere near legally protected PII like health data, financials, and so forth you should consult one.

The Code

The actual code is relatively simple. I made use of CloudFormation for its async nature. Starting a stack takes a second at most, and it wouldn’t be reasonable to have a Lambda wait around for the RDS database to spin up. One goal was to minimize the amount of time yesterday’s database was down during transition, and having CloudFormation create the new RDS instance, wait for it to be ready, then move the DNS record was just the ticket.

First, here’s the stack that’s executed to create the RDS database and its DNS record (with some omissions, see here for full template):

AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  DBName:
    Description: Name for the new DB instance
    Type: String
  DomainRoot:
    Description: Domain to put the DB name record under (not the FQDN)
    Type: String
  SnapshotID:
    Description: Identifier for DB snapshot
    Type: String
  DiskSize:
    Description: How many gigs of storage to make available for the snapshot restore
    Type: String
Resources:
  DBAliasRecord:
    Type: 'AWS::Route53::RecordSet'
    Properties:
      HostedZoneName: !Sub "${DomainRoot}."
      Name: !Sub "${HostName}.${DomainRoot}"
      ResourceRecords: [!GetAtt Database.Endpoint.Address]
      TTL: 300
      Type: CNAME
  Database:
    Type: 'AWS::RDS::DBInstance'
    Properties:
      AllocatedStorage: !Ref DiskSize
      BackupRetentionPeriod: 0
      DBInstanceClass: 'db.t2.small'
      DBInstanceIdentifier: !Ref DBName
      DBSnapshotIdentifier: !Ref SnapshotID
      DBSubnetGroupName: !Ref SubnetGroup
      VPCSecurityGroups:
      - !Ref SecurityGroup

The actual handler is also fairly pedestrian. It has to pull in information about the database and DNS zone. That can either come from a config.json, so it can be deployed with the config baked in. If you have multiple databases to snapshot and roll out to development environments, you can include the same config on the invoking events.

if event.get('config') is None:
    cfg = json.load(open(os.path.join(cwd, 'config.json')))
else:
    cfg = event.get('config')

template = open(os.path.join(cwd, "template.yml")).read()

snapshots = rds.describe_db_snapshots()["DBSnapshots"]
db_snapshots = [snap for snap in snapshots
                    if snap["DBInstanceIdentifier"] == cfg['db']['name']]
# get the most recent snapshot by sorting by date
latest_snapshot = sorted(db_snapshots, reverse=True,
    key=lambda x: x["SnapshotCreateTime"])[0]
identifier = latest_snapshot["DBSnapshotIdentifier"]

Grabbing the most recent snapshot requires that you sort by the creation time, but with that we can update (or create if it doesn’t exist) the stack.


cfn_params = dict(
    StackName=STACK_NAME,
    TemplateBody=template,
    Parameters=[{"ParameterKey": k, "ParameterValue": v}
                for k, v in stack_params.items()]
)

try:
    stacks = cfn.describe_stacks(StackName=STACK_NAME)
except botocore.exceptions.ClientError as exc:
    if 'does not exist' in exc.message:
        cfn.create_stack(**cfn_params)
        return {"action": "create", "error": None, "stack_args": cfn_params}
else:
    if stacks["Stacks"][0]["StackStatus"].endswith("COMPLETE"):
        # stack can be updated
        cfn.update_stack(**cfn_params)
        return {"action": "update", "error": None,
                "stack_args": cfn_params}
    else:
        return {"action": "update", "message": msg,
                "error": True, "stack_args": cfn_params}

There is, of course, a bit more glue code around those key parts that you can find on github, but these two samples pretty much complete the picture.

Wrapping Up

Come back for part 2 (coming soon) of this series on distributing single-button Serverless projects. As always, if you have an idea, question, or comment; tweet @ryan_sb or send ryan@serverlesscode.com a note.


Tweet this, send to Hackernews, or post on Reddit