STAT C100 Lecture Notes - Lecture 2: Apache Spark, Ipython, Data Science
Data100 Lecture02
Goals For Today
• Introduce Pandas, with emphasis on:
o Key Data Structures (data frames, series, indices).
o How to index into these structures.
o How to read files to create these structures.
o Other basic operations on these structures.
• Go over some important and handy iPython features and concepts:
o Shell commands (e.g. !dir or !ls).
o Portable vs. operating system specific code.
o Shift-tab.
• Solve some very basic data science problems using Jupyter/pandas.
Pandas Data Structures: Data Frames, Series, and Indices
There are three fundamental data structures in pandas:
• Data Frame: 2D data tabular data.
• Series: 1D data. I usually think of it as columnar data.
• Index: A sequence of row labels.
• We can think of a Data Frame as a collection of Series that all share the same Index.
Indices Are Not Necessarily Row Numbers
Indices (a.k.a. row labels) can also:
• Be non-numeric.
• Have a name, e.g. “State”.
The row labels that constitute an index do not have to be unique.
• Left: The index values are all unique and numeric, acting as a row number.
• Right: The index values are named and non-unique.
Column names in Pandas are always unique!
• Example: Can’t have two columns named “Candidate”.
Indexing with The [] Operator
Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as
“Column Selection” or sometimes “indexing by column”.
• Column name argument to [] yields Series.
• List argument (even of one name) to [] yields a Data Frame.
We can also index by row numbers using the [] operator.
• Numeric slice argument to [] yields rows.
• Example: [0:3] yields rows 0 to 2.
Summary