How to Use Heroku to Automate Updating of Your CartoDB Tables

by Raphael Dumas on June 15, 2016 under bcu

4 minute read

This post has been updated 2016-06-22 to include python logging best practices

Bikeways receives data from users, but this data must eventually be processed and aggregated in order to make maps. After figuring out how to do that processing, there were two possibilities for how to automate it (so I didn’t have to periodically run a processing query):

Set a Trigger that gets triggered everytime a new piece of data gets added in order to properly process each new element to the aggregate table.
~~Schedule the bulk update using pgAGent~~ (unavailable in CartoDB)
Schedule the bulk update using an external process.

I pondered #1 a fair bit. But I felt the data-processing sql was complex enough, that atomizing the procedure so that it could run on individual inserts wasn’t worth the development effort. Especially since, as of this writing, the raw input table has just 700 rows, which is trivial to process repeatedly in bulk (thanks CartoDB!). So I combined all the processing into one query that I exposed to the CartoDB API. If I had a linux server, I could use a cron job that would run something like wget https://username.cartodb.com/api/v2/sql?q=SELECT process_data(). But I didn’t. So I asked the friendly folks at Code for Boston and was suggested to set up a script on Heroku that runs occasionally using the Heroku Scheduler.

Because I’m trying to improve my Python skills, I wrote the script to ping the API in Python:

import requests
import logging
import json
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
username = 'raphaeld'
url = 'https://' + username + '.cartodb.com/api/v2/sql'
sql = {'q':'SELECT process_data();'}
logger.info('Pinging {url} with query: {sql}'.format(url=url,sql=sql['q']))
res = requests.post(url, data=sql)
resj = res.json()
errors = resj.get('error', 'No errors')
if errors == 'No errors':  
    logger.info('Query completed in: %s', resj.get('time'))
else:
    logger.warning('Error: %s', errors)

The script is broken up into its component parts for legibility. First the url is constructed, next the sql query, and the two are combined into an HTTP request that pings the CartoDB API and returns a response. Finally the response text is printed to stdout.

Heroku set up

Here’s how to set up a Heroku app so that it will run the above script on a schedule.

Get a Heroku account and provide your credit card details (don’t worry, this should all be free), create a new app.
Install the Heroku Toolbelt
Create process-data-call.py from the python code above in a new folder. Don’t forget to change your username ;)
Login to heroku heroku login and then initialize the git repositroy with git init and add the heroku remote with heroku git:remote -a process-data
Add a requirements.txt file to the folder with the contents requests==2.9.1. This tells the Python package manager to include the requests library, which handles the HTTP requests
Add the file Procfile to the folder with the contents web: python process-data-call.py. This tells Heroku what command to run at the root folder of your “application” when it runs it.
Tell git to track all the files in your folder with git add ., commit them with git commit -m "Initial commit and then push them to heroku with git push heroku master
On the dashboard for your app, add the Heroku scheduler app and then create a new task with task python process-data-call.py. I set the schedule to be every day.
Monitor progress with heroku logs --tail

bikeways4everybody, python, cartodb, heroku

I feedback.
Let me know what you think of this article on twitter @dumasraphael!