STAT C100 Lecture 5: Data 100 lecture05

55 views3 pages
13 Oct 2018
School
Department
Course
Professor
Data 100
Lecture05 Data Cleaning & Exploratory Data Analysis
- Structure -- the “shape” of a data file
- Granularity -- how fine/coarse is each datum
- Scope -- how (in)complete is the data
- Temporality -- how is the data situated in time
- Faithfulness -- how well does the data capture “reality
Scope:
- Does my data cover my area of interest?
o Example: I am interested in studying crime in California but I only have Berkeley crime data.
- Is my data too expansive?
o Example: I am interested in student grades for DS100 but have student grades for all statistics
classes.
o Solution: Filtering Implications on sample?
o If the data is a sample I may have poor coverage after filtering
- Does my data cover the right time frame?
o More on this in temporality
Temporality
- Data changes When was the data collected!
- What is the meaning of a the time and date fields?
o When the “event” happened?
o When the data was collected or was entered into the system?
o Date the data was copied into a database (look for many matching timestamps)
- Time depends on where! (Time zones & daylight savings)
o Learn to use datetime python library
o Multiple string representation (depends on region): 07/08/09?
- Are there strange null values?
o January 1st 1970, January 1st 1900
- Is there periodicity? Diurnal patterns
Unix Time / POSIX Time
- Time measured in seconds since January 1st 1970
o Minus leap seconds …
- Unix time follows Coordinated Universal Time (UTC)
o International time standard
o Measured at 0 degrees latitude
o Similar to Greenwich Mean Time (GMT)
o No daylight savings
o Time codes
- Time Zones:
o San Francisco (UTC-8)
without daylight savings
Faithfulness: Do I trust this data?
- Does my data contain unrealistic or “incorrect” values?
o Examples?
o Dates in the future for events in the past
o Locations that don’t exist
o Negative counts
o Misspellings of names
o Large outliers
Unlock document

This preview shows page 1 of the document.
Unlock all 3 pages and 3 million more documents.

Already have an account? Log in

Document Summary

Structure -- the (cid:862)shape(cid:863) of a data file. Granularity -- how fine/coarse is each datum. Scope -- how (in)complete is the data. Temporality -- how is the data situated in time. Faithfulness -- how well does the data capture (cid:862)reality. Does my data cover my area of interest: example: i am interested in studying crime in california but i only have berkeley crime data. If the data is a sample i may have poor coverage after filtering . Does my data cover the right time frame: more on this in temporality . Data changes when was the data collected! Are there strange null values: january 1st 1970, january 1st 1900. Time measured in seconds since january 1st 1970: mi(cid:374)us leap se(cid:272)o(cid:374)ds . Unix time follows coordinated universal time (utc) International time standard: measured at 0 degrees latitude, similar to greenwich mean time (gmt, no daylight savings, time codes. Time zones: san francisco (utc-8) without daylight savings.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related textbook solutions

Related Documents