Dealing With Data

Coding: systematically reorganizing raw numerical data into a format that is easy to analyze

using computers

Codebook: a document describing the coding procedure and the location of data for variables

in a format that computers can use

Pre-coding: placing the code categories on the questionnaire

Entering Data

In a grid, each row represents a respondent, subject, or case. A column or a set of columns

represent specific variables. Makes it possible to go from a column and row location back to

the original source of data

Four ways to get raw quantitative data into a computer

o Code Sheet: gather the information, then transfer it from the original source onto a

grid format, type it in line by line

o Direct Entry Method: as information is collected, enter the information instantly.

o Optical Scan: Gather the information, then enter it onto optical scan sheets, use

optical scanner to enter information

o Bar Code: convert the information into different widths of bars associated with

numeric values and use a bar-code reader

Cleaning Data

Code Checking: involves checking the categories of all variables for impossible codes

Contingency Cleaning: involves cross-classifying two variables and looking for logically

impossible combinations

Results with One Variable (univariate)

Statistics: a set of collected numbers, and a branch of applied mathematics used to

manipulate and summarize the features and numbers

Descriptive Statistics: describe numerical data. Can be categorized by the number of

variables involved:

o univariate

o bivariate

o multivariate

Frequency Distribution: the easiest way to describe the numerical data of one variable

o Histogram

o Bar Chart

o Pie Chart

Frequency Polygon: for interval or ratio level data, a researcher often groups information

into mutually exclusive categories

Measures of Central Tendency

Mode: the easiest to use, can be used with nominal, ordinal, interval, ratio data. Consists of

the most common/frequently occurring number

Median: the middle point, also called the 50th percentile, or the point at which half the cases

are above and half the cases are below it

Mean: the average, the most widely used measure of central tendency, can be used only with

interval/ratio level data

If the frequency distribution forms a normal curve, the three measures of central tendency

equal each other. If it is skewed they will be different

Measures of Variation

Spread: another characteristic of a distribution which is the variability/dispersion around the

center

Zero Variation: if the mean and median are exactly the same and there is zero variation, all

the variables are the same

Range: the simplest measure of variation, consists of the largest and smallest scores

subtracted from each other to find the amount in between

Percentiles: tell the score at a specific place within the distribution

Standard Deviation: the most difficult to compute measure of dispersion, but it is the most

comprehensive and widely used.

o It is based on the mean and gives an average distance between all scores and the

mean

o Used for comparison purposes

Steps in computing the standard deviation

o Compute the mean

o Subtract the mean from each score

o Square the resulting difference for each score

o Total up the squared differences to get the sum

o Divide the sum of squares by the number of cases to get the variance

o Take the square root of the variance to get the standard deviation

o Formula:

The standard deviation is used to create z-scores, which are standardized scores that points a

score on a frequency distribution in terms of number of standard deviations

o Formula:

o Where x= score, x bar= mean and � equals the SD

Results with Two Variables

Bivariate Statistics: much more valuable. They allow a researcher to consider two variables

together and describe the relationship between them

o Bivariate statistical analysis shows a relationship between variables

Covariation: things go together or are associated

Independence: the opposite of Covariation, it means there is no association or no relationship

between variables

o Null Hypothesis: there is independence

Three techniques exist to help researchers decide whether a relationship exists between the

two variable

o Scattergram/graph/plot

o Cross-tabulation/percentage table

o Measures of association/statistical measures that express the amount of Covariation

by a single number

The Scattergram

Scattergram: a graph on which a researcher plots each case or observation, where each axis

represents the value of one variable.

o Used for variables measured at the interval/ratio level

o Usually the independent variable goes on the X-axis and the dependent variable goes

on the Y-axis

What can be learned form a Scattergram

Form: Relationships can take three forms

o Independent: no relationship exists, looks like random scatter with no pattern

o Linear: means that a straight line can be visualized

o Curvilinear: at the center of the cases would be a U shape right side up or upside

down

Direction: Linear relationships can have positive or negative direction

o Positive: a diagonal line from the bottom left to top right

o Negative: a diagonal line form the top left to bottom right

Precision: the amount of spread in the point on the graph. A high level of precision occurs

when the points hug the line that summarizes the relationship. A low level occurs when ther

points are widely spread around the line

Bivariate Table

Bivariate Contingency: presents the same information as a Scattergram in a more condensed

form

o Based on cross-tabulation: the cases are organized in the table on the basis of two

variables at the same time

Contingency Table: formed by cross-tabulating two or more variables

o Shows how the cases are contingent upon the categories of the variables

## Document Summary

Coding: systematically reorganizing raw numerical data into a format that is easy to analyze using computers. Codebook: a document describing the coding procedure and the location of data for variables in a format that computers can use. Pre-coding: placing the code categories on the questionnaire. In a grid, each row represents a respondent, subject, or case. A column or a set of columns represent specific variables. Makes it possible to go from a column and row location back to the original source of data. Code checking: involves checking the categories of all variables for impossible codes. Contingency cleaning: involves cross-classifying two variables and looking for logically impossible combinations. Statistics: a set of collected numbers, and a branch of applied mathematics used to manipulate and summarize the features and numbers. Can be categorized by the number of variables involved: univariate, bivariate, multivariate. Frequency distribution: the easiest way to describe the numerical data of one variable: histogram, bar chart, pie chart.