Connect Python and Pandas To Redshift

There are many situations where you might like to get data from a database and into a pandas data frame. The simplest way to is to initialize a dataframe via the pandas read_sql_query method. The ORM of choice for pandas is SQLAlchemy. By installing a few more packages, you can query Redshift data and read that into a dataframe with just a few lines of of Python code.

The Setup

This post assumes you have a number of Python packages already installed, such as pandas, numpy, sqlalchemy and iPython. I’ve used the Python distribution Anacoda, which can be downloaded here. It comes packed full of all the libraries you might need for data analysis.

The Dialects documentation for SQLAlchemy mentions that Redshift is supported through another Python package; this package also depends on a Postgresql driver. So, the additional packages needed for connecting to Redshift are redshift-sqlalchemy and psycopg2. If you already have Anaconda, you can install psycopg2 quickly using conda. For getting redshift-sqlalchemy installed, here are some docs on using conda/pip in conjunction with each other. TBH I just used downloaded the package and did this in my terminal:

Of course, I did this in my default environment. Everything seems to work, but the Anaconda docs are very useful if you have multiple Python environments on your machine.

On To The Data

Start iPython in your terminal (or Notebook if you prefer). Note that the connection string follows the pattern ‘flavor+DBAPI://username:password@host:port/database’. Make sure to change the connection string below to your own!

If we look at the “test” object we see:

  • mdisc

    Thanks! This was super helpful – got me connected pretty quickly. I skipped over the part about installing the redshift-sqlalchemy library so I had to go back and ‘sudo pip install redshift-sqlalchemy’ after getting some error message.

  • Sarang Manjrekar

    Supercool !