How to Use Heroku to Automate Updating of Your CartoDB Tables
This post has been updated 2016-06-22 to include python logging best practices
Bikeways receives data from users, but this data must eventually be processed and aggregated in order to make maps. After figuring out how to do that processing, there were two possibilities for how to automate it (so I didn’t have to periodically run a processing query):
- Set a Trigger that gets triggered everytime a new piece of data gets added in order to properly process each new element to the aggregate table.
Schedule the bulk update using pgAGent(unavailable in CartoDB)- Schedule the bulk update using an external process.
I pondered #1 a fair bit. But I felt the data-processing sql was complex enough, that atomizing the procedure so that it could run on individual inserts wasn’t worth the development effort. Especially since, as of this writing, the raw input table has just 700 rows, which is trivial to process repeatedly in bulk (thanks CartoDB!). So I combined all the processing into one query that I exposed to the CartoDB API. If I had a linux server, I could use a cron job that would run something like wget https://username.cartodb.com/api/v2/sql?q=SELECT process_data()
. But I didn’t. So I asked the friendly folks at Code for Boston and was suggested to set up a script on Heroku that runs occasionally using the Heroku Scheduler.
Because I’m trying to improve my Python skills, I wrote the script to ping the API in Python:
import requests
import logging
import json
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
username = 'raphaeld'
url = 'https://' + username + '.cartodb.com/api/v2/sql'
sql = {'q':'SELECT process_data();'}
logger.info('Pinging {url} with query: {sql}'.format(url=url,sql=sql['q']))
res = requests.post(url, data=sql)
resj = res.json()
errors = resj.get('error', 'No errors')
if errors == 'No errors':
logger.info('Query completed in: %s', resj.get('time'))
else:
logger.warning('Error: %s', errors)
The script is broken up into its component parts for legibility. First the url
is constructed, next the sql
query, and the two are combined into an HTTP request that pings the CartoDB API and returns a response. Finally the response text is printed to stdout.
Heroku set up
Here’s how to set up a Heroku app so that it will run the above script on a schedule.
- Get a Heroku account and provide your credit card details (don’t worry, this should all be free), create a new app.
- Install the Heroku Toolbelt
- Create process-data-call.py from the python code above in a new folder. Don’t forget to change your username ;)
- Login to heroku
heroku login
and then initialize the git repositroy withgit init
and add the heroku remote withheroku git:remote -a process-data
- Add a
requirements.txt
file to the folder with the contentsrequests==2.9.1
. This tells the Python package manager to include therequests
library, which handles the HTTP requests - Add the file
Procfile
to the folder with the contentsweb: python process-data-call.py
. This tells Heroku what command to run at the root folder of your “application” when it runs it. - Tell git to track all the files in your folder with
git add .
, commit them withgit commit -m "Initial commit
and then push them to heroku withgit push heroku master
- On the dashboard for your app, add the
Heroku scheduler
app and then create a new task with taskpython process-data-call.py
. I set the schedule to be every day. - Monitor progress with
heroku logs --tail
Let me know what you think of this article on twitter @dumasraphael!