STAT C100 Lecture Notes - Lecture 2: Apache Spark, Ipython, Data Science

73 views3 pages

coralhamster848

13 Oct 2018

School

Department

Course

Professor

For unlimited access to Class Notes, a Class+ subscription is required.

Data100 Lecture02

Goals For Today

• Introduce Pandas, with emphasis on:

o Key Data Structures (data frames, series, indices).

o How to index into these structures.

o How to read files to create these structures.

o Other basic operations on these structures.

• Go over some important and handy iPython features and concepts:

o Shell commands (e.g. !dir or !ls).

o Portable vs. operating system specific code.

o Shift-tab.

• Solve some very basic data science problems using Jupyter/pandas.

Pandas Data Structures: Data Frames, Series, and Indices

There are three fundamental data structures in pandas:

• Data Frame: 2D data tabular data.

• Series: 1D data. I usually think of it as columnar data.

• Index: A sequence of row labels.

• We can think of a Data Frame as a collection of Series that all share the same Index.

Indices Are Not Necessarily Row Numbers

Indices (a.k.a. row labels) can also:

• Be non-numeric.

• Have a name, e.g. “State”.

The row labels that constitute an index do not have to be unique.

• Left: The index values are all unique and numeric, acting as a row number.

• Right: The index values are named and non-unique.

Column names in Pandas are always unique!

• Example: Can’t have two columns named “Candidate”.

Indexing with The [] Operator

Given a dataframe, it is common to extract a Series or a collection of Series. This process is also known as

“Column Selection” or sometimes “indexing by column”.

• Column name argument to [] yields Series.

• List argument (even of one name) to [] yields a Data Frame.

We can also index by row numbers using the [] operator.

• Numeric slice argument to [] yields rows.

• Example: [0:3] yields rows 0 to 2.

Summary

Unlock document

This preview shows page 1 of the document.
Unlock all 3 pages and 3 million more documents.

Already have an account? Log in

STAT C100 Lecture Notes - Lecture 2: Apache Spark, Ipython, Data Science

Get access

Related textbook solutions

Introductory Statistics