Python for Data Science

Axial HQ

23 July 2013

Aditya Mukerjee

Personal Background

Some common data science tools

Chances are, you'll end up having to use more than one of these

Python is great for data science!

Python can be problematic for data science

Some problems can be avoided by avoiding pure python

Why not pure python?

(PyPy may also help in the future, but its use in data science is currently experimental)

The big question: python2 or python3?

Scientists (particularly academics) tend to be the slowest to change any technology

Tools from the Python Standard Library

Iterating with Itertools

Array

Bridging the Gap

SciPy (e.g. NumPy)

IPython

"The IPython Notebook is a web-based interactive computational environment where you can 
combine code execution, text, mathematics, plots and rich media into a single document"

NumPy

Working with arrays

Why use np.array

NumPy gives you speed

NumPy gives you flexibility: eg. indexing

NumPy gives you flexibility: eg. reshaping

The above is instantaneous!

NumPy Flags

Don't Fear the C

Python memory layout

NumPy memory layout

NumPy Caveats

Always use binary installation if possible

No free lunch - you're in C-land now!

Kiss goodbye to that!

NumPy costs you memory

Pandas

Pandas' data frames are very similar to R's data frames

RPy

High-level: Scraping, processing, and analysis

Scraping: beautifulsoup/lxml

Processing: NLTK

MapReduce

Analysis: SciKit-learn

Graphing: Matplotlib

Matplotlib example

Futher Resources

Thank you

Aditya Mukerjee