Python for Data Science

Axial HQ

23 July 2013

Aditya Mukerjee

Personal Background

Studied computer science and statistics at Columbia
Worked on the server team at foursquare (Explore recommendation engine)
Previously on the OkCupid data team (conducted research for OkTrends reports)
Most recently, Hacker-in-Residence at Quotidian Ventures
hackNY fellow 2011

Some common data science tools

Python
R
MATLAB
Stata
SAS
Julia
Go

Chances are, you'll end up having to use more than one of these

Python is great for data science!

High-level language
Munging made easy
Full-featured standard library
Highly comprehensive third-party libs for data science

Python can be problematic for data science

The GIL poses, at best, an extra hoop to jump through
Slow (compared to C/C++/Java)
Academics often prefer R (stats), MATLAB (science), or STATA (quant. social sciences)

Some problems can be avoided by avoiding pure python

Why not pure python?

Slow
Non-standard language for some academics
Most third-party libs aim to solve one or both of these problems

(PyPy may also help in the future, but its use in data science is currently experimental)

The big question: python2 or python3?

Scientists (particularly academics) tend to be the slowest to change any technology

6-12 months ago: Definitely python2
6 months from now: probably python3
Right now - tossup (try python3)
Most major libraries have now been ported

Tools from the Python Standard Library

Iterating with Itertools

Provides a number of utilities (not just useful for data science)
Faster than using custom Python code, but still slow

Array

Just kidding - never use this!

Bridging the Gap

SciPy (e.g. NumPy)

A set of free software packages for scientific computation tasks
If you need it, it's in SciPy!
NumPy provides support for arrays and matrices of arbitrary dimensions
SciPy provides the basics of scientific computation
Matplotlib provides MATLAB-esque graphing and plotting
Pandas provides manipulation of data structures
scikit-learn provides machine learning in python
Also available: symbolic mathematics (SymPy), interactive console (IPython), image processing (scikit-image), and more

IPython

Optional for Python work; all but essential for data science
IPython Notebook, in their own words:

"The IPython Notebook is a web-based interactive computational environment where you can 
combine code execution, text, mathematics, plots and rich media into a single document"

Allows easy profiling, debugging, sharing, and more

NumPy

Basic arrays and matrices in arbitrary dimensions
Enough goodies here for a talk on their own
NumPy arrays are implemented in C, and laid out in memory in C-idiomatic form
Very interoperable with native Python functions
Can usually be passed in where a Python list is expected, etc.

Working with arrays

If you're used to MATLAB, do not use matrices (http://wiki.scipy.org/NumPy_for_Matlab_Users)

Why use np.array

NumPy gives you speed

NumPy gives you flexibility: eg. indexing

NumPy gives you flexibility: eg. reshaping

The above is instantaneous!

NumPy Flags

Don't Fear the C

Thinking (just a bit!) about memory will save much time & headache

Python memory layout

NumPy memory layout

NumPy Caveats

Always use binary installation if possible

If compiling from scratch, use GNU tools or stock up on Tylenol
This applies to the entire SciPy toolchain

No free lunch - you're in C-land now!

Kiss goodbye to that!

NumPy costs you memory

The above operation will create three temporary values

Pandas

Mitigates some of the need for RPy (allows a similar interface for manipulating data)

Pandas' data frames are very similar to R's data frames

RPy

Interface with R without leaving the comfort of Python
Many statistical libraries are implemented first in R, the language of choice for most academic statisticians
R also allows a flexible interface for manipulating data as 'data frames', which can be convenient

High-level: Scraping, processing, and analysis

Scraping: beautifulsoup/lxml

beautifulsoup and lxml are both tools for "parsing" XML/HTML
beautifulsoup is not a true parser
lxml wraps both libxml and libxslt and is usually faster
beautifulsoup has better support for encoding detection
lxml provides an interface to beautifulsoup, so you can switch between the two
lxml can use beautifulsoup as a fallback

Processing: NLTK

NLTK is the go-to library for NLP (natural-language processing) work
Very full-featured
Provides sample corpora, tokenizing, parse trees, as well as advanced functionality
Very easy to use
Be careful of runtime complexity

MapReduce

Could spend an entire talk about the MapReduce paradigm
Python is a very convenient language for using with Hadoop
Amazon EMR (Elastic MapReduce) even uses Python for their demo

Analysis: SciKit-learn

Provides fundamental machine learning utilities
Supervised/unsupervised learning, model selection and evaluation, etc.
Can also do some preprocessing, though not the focus of scikit-learn

Graphing: Matplotlib

Not the only (or necessarily best) graphing library for Python
Compatible with the rest of SciPy, great for 'polished WIP'
Highly customizable

Matplotlib example

Futher Resources

Navigating the Postmodern Python World

Python for Data Analysis (O'Reilly)

Thank you

Aditya Mukerjee

http://www.adityamukerjee.net

@chimeracoder