# top answer: Please make sure that it is your own work and not copy and paste. Please watch out for Spelling and

Don't use plagiarized sources. Get Your Custom Essay on
Just from \$10/Page

Book Reference: Fox, J. (2017). Using the R Commander: A point-and-click interface for R. CRC Press. https://online.vitalsource.com/#/books/9781498741934

Discuss how you would use the various types of summarizing and graphing to present your data. Make sure you discuss the type of data you would have and the type of display you would select.

Summarizing and Graphing Data

This chapter explains how to use the R Commander to compute simple numerical summaries of data, to construct and analyze contingency tables, and to draw common statistical graphs. Most of the statistical content of the chapter is covered in a typical basic statistics course, although a few topics, such as quantile-comparison plots (in Section 5.3.1) and smoothing scatterplots (in Section 5.4.1), are somewhat more advanced.

Although most of the graphs produced by the R Commander use color, most of the figures in this chapter are rendered in monochrome.1

The R Commander Statistics > Summaries menu (see Figure A.4 on page 202) contains several items for summarizing data. I’ll use the Canadian occupational prestige data (introduced in Section 4.2.3) to illustrate. This data set is most conveniently available in the Prestige data frame in the car package, which is one of the packages loaded when the R Commander starts up. I read the data via Data > Data in packages > Read data set from an attached package (as described in Section 4.2.4). Because the default alphabetic order of the levels of the type factor in the data set—“bc” (blue-collar), “prof” (professional and managerial), “wc” (white-collar)—is not the natural order, I reorder the levels of the factor with Data > Manage variables in active data set > Reorder factor levels (see Section 3.4).

Selecting Statistics > Summaries > Active data set produces the brief summary in Figure 5.1. There’s a “five-number summary” for each numeric variable—reporting the minimum, first quartile, median, third quartile, and maximum of the variable—plus the mean, and the frequency distribution of the factor type, including a count of NAs.

Statistics > Summaries > Numerical summaries brings up the dialog box in Figure 5.2. I select the variables education, income, prestige, and women in the Data tab and retain the default choices in the Statistics tab. Clicking OK results in the output in Figure 5.3. Were I to press the Summarize by groups button in the Data tab, I could compute summary statistics separately for each level of type.

Choosing Statistics > Summaries > Table of statistics allows you to calculate a statistic for one or more numeric variables within levels or combinations of levels of one or more factors. To illustrate, I’ll use the Adler data set from the car package. The data are from a social-psychological experiment, reported by Adler (1973), on “experimenter effects” in psychological research—that is, how researchers’ expectations can influence the data that they collect. Adler recruited “research assistants,” who showed photographs of individuals’ faces to respondents; the respondents were asked by the research assistants to rate the apparent “successfulness” of the individuals in the photographs. In fact, Adler chose photographs that were average in their appearance of success, and the true subjects in the study were the research assistants. Adler manipulated two factors, named expectation and instruction in the data set.

FIGURE 5.1: Summary output for the Prestige data set.

FIGURE 5.2: Numerical Summaries dialog box: Data tab (top) and Statistics tab (bottom).

FIGURE 5.3: Numerical summaries for several variables in the Prestige data set.

FIGURE 5.4: The Table of Statistics dialog box.

•  expectation: Some assistants were told to expect high ratings, while others were told to expect low ratings.

•  instruction: In addition, the assistants were given different instructions about how to collect data. Some were instructed to try to collect “good” data, others were instructed to try to collect “scientific” data, and still others were given no special instruction of this type.

Adler randomly assigned 18 research assistants to each of six experimental conditions—combinations of the two levels of the factor expectation (“HIGH” or “LOW”) and the three levels of the factor instruction (“GOOD”, “SCIENTIFIC”, or “NONE”). I deleted 11 of the 108 subjects at random to produce the “unbalanced” Adler data set.2 After reading the data into the R Commander in the usual manner, I reorder the levels of the factor instruction from the default alphabetic ordering.

The Table of Statistics dialog box appears in Figure 5.4. I select both expectation and instruction in the Factors list box; because there’s just one numeric variable in the data set—rating—it’s preselected in the Response variables list box. The dialog includes radio buttons for calculating the mean, median, standard deviation, and interquartile range, along with an Other button, which allows you to enter any R function that computes a single number for a numeric variable. I retain the default Mean, and press the Apply button. Then, when the dialog reappears, I select Standard deviation and press OK. The output is shown in Figure 5.5. I’ll defer interpreting the Adler data to Section 5.4 on graphing means and Section 6.1 on hypothesis tests for means.

Several of the Statistics > Summaries menu items and associated dialogs are very straightforward, and so, in the interest of brevity, I won’t demonstrate their use here:3

FIGURE 5.5: Tables of means and standard deviations for rating in the Adler data set, classified by expectation and instruction.

•  The Frequency Distributions dialog produces frequency and percentage distributions for factors, along with an optional chi-square goodness-of-fit test with user-supplied hypothesized probabilities for the levels of a factor.

•  The Count missing observations menu item simply reports the number of NAs for each variable in the active data set.

•  The Correlation Matrix dialog calculates Pearson product-moment correlations, Spearman rank-order correlations, or partial correlations for two or more numeric variables, along with optional pairwise p-values, computed with and without correction for simultaneous inference.

The Statistics > Contingency tables menu (see Figure A.4 on page 202) has items for constructing two-way and multi-way tables from the active data set. I demonstrated the Two-Way Table dialog in Section 3.5, and there is no need to repeat that demonstration here. Moreover, the Multi-Way Table dialog is similar, except that, in addition to selecting row and column factors for the contingency table, you can pick one or more “control” factors: A separate two-way partial table, optionally percentaged by rows or columns, is reported for each combination of levels of the control factors.

In contrast, the Enter Two-Way Table dialog (in Figure 5.6), selected via Statistics > Contingency tables > Enter and analyze two-way table, is unusual for the R Commander, in that it doesn’t use the active data set. The dialog allows you to enter frequencies (counts) from an existing two-way contingency table, typically from a printed source such as a textbook. The sliders at the top of the Table tab control the number of rows and columns in the table. Initially, the table has 2 rows and 2 columns, and the cells of the table are empty.

Setting the sliders to 3 rows and 2 columns, I enter a frequency table taken from The American Voter, a classic study of electoral behavior by Campbell et al. (1960). The data originate in a panel study of the 1956 U. S. presidential election. During the campaign, survey respondents were asked how strongly (weak, medium, or strong) they preferred one candidate to the other, and after the election they were asked whether or not they had voted.

FIGURE 5.6: The Enter Two-Way Table dialog: Table tab (top) and Statistics tab (bottom).

FIGURE 5.7: Output produced by the Enter Two-Way Table dialog, having entered a contingency table from The American Voter.

The Statistics tab appears at the bottom of
Figure 5.6
. I check the box for Row percentages because the row variable in the table, intensity of preference, is the explanatory variable; the Chi-square test of independencecheckbox is selected by default. I also check Print expected frequencies, which is not selected by default.

The output from the dialog is shown in
Figure 5.7
. Reported voter turnout increases with intensity of partisan preference, and the relationship between the two variables is highly statistically significant, with a very small p-value for the chi-square test of independence. All of the expected counts are much larger than necessary for the chi-square distribution to be a good approximation to the distribution of the test statistic; had that not been the case, a warning would have appeared, whether or not expected frequencies are printed.

5.3 Graphing Distributions of Variables

I’ll use the Canadian occupational prestige data, read from the car package earlier in this chapter, to illustrate graphing distributions. There are, at this point in the chapter, two data sets in memory—the Prestige data set and the Adler data set—and the latter is the active data set. To change the active data set, I click on the Data set button in the R Commander toolbar and select Prestige in the resulting dialog.
4

5.3.1 Graphing Numerical Data

The R Commander Graphs menu (see Figure A.6 on
page 203
) is divided into several groups of items, the second of which leads to dialogs for constructing graphs of the distribution of a numerical variable: Index plot, Dot plot, Histogram, nonparametric Density estimate, Stem-and-leaf display, Boxplot, and theoretical Quantile-comparison plot. Many of these graphs—specifically, dot plots, histograms, density estimates, and boxplots—can also show the distribution of a numeric variable within levels of (i.e., conditional on) a factor, and stem-and-leaf displays can be drawn back-to-back for the two levels of a dichotomous factor (see the example in
Section 6.1.1
).

Selecting Graphs > Histogram produces the dialog box in
Figure 5.8
. The Data tab, at the top of the figure, allows you to choose a numeric variable; I select income. Clicking the Plot by groups button brings up the Groups sub-dialog shown at the center of the figure; because there is only one factor in the data set, type, it is preselected. Clicking OK in the Groups sub-dialog returns to the main dialog, and now the Plot by button reads Plot by: type. The Options tab is at the bottom of
Figure 5.8
. Leaving all of the options at their defaults and clicking OK produces the vertically aligned histograms in
Figure 5.9
.

If you don’t like the default number of bins, which results from leaving the Number of bins text box at <auto>, you can type a target number for the number of bins:
5
As a general matter, as you increase the number of bins, the width of each bin decreases. You can conveniently experiment with the number of bins by pressing the Apply button rather than the OK button in the dialog.

The dialogs for the other distributional displays differ only in their Options tabs and whether or not (as noted above) they support plotting by groups.
Figure 5.10
shows the default distributional displays for education in the Canadian occupational prestige data set.
6
There is also a “rug plot” at the bottom of the density estimate (center-right panel), showing the location of the data values. By default the quantile-comparison plot (lower-right) compares the distribution of the data to the normal distribution, but you can also plot against other theoretical distributions.
7

FIGURE 5.8: Histogram dialog, showing the Data tab (top), Groups sub-dialog (center), and Options tab (bottom).

FIGURE 5.8: Histogram dialog, showing the Data tab (top), Groups sub-dialog (center), and Options tab (bottom).

FIGURE 5.9: Histograms of average income by type of occupation, for the Canadian occupational prestige data.

In the index plot (at the upper-left) and quantile-comparison plot (at the lower-right), the two most extreme values are automatically identified by default, but because these values are close to each other in the graphs, the labels for the points are over-plotted. The case labels are also displayed, however, in the R Commander Output pane (not shown), and they are university.teachers and physicians.

The default stem-and-leaf display for education appears in
Figure 5.11
; it is text output and so is printed in the Output pane.

FIGURE 5.10: Various default distributional displays for average education in the Canadian occupational prestige data. From top to bottom and left to right: index plot, dot plot, histogram, nonparametric density estimate with rug plot, boxplot, and quantile-comparison plot comparing the distribution of education to the normal distribution.

FIGURE 5.11: Default “Tukey-style” stem-and-leaf display for education in the Canadian occupational prestige data. The column of numbers to the left of the stems represents “depths”—counts in to the median from both ends of the distribution—with the parenthesized value (4) giving the count for the stem containing the median. Note the divided stems, with x. stems containing leaves 0–4 and x * stems leaves 5–9. Five-part stems are similarly labelled x. with leaves 01, x t with leaves 23, x f with leaves 45, x s with leaves 67, and x * with leaves 89.

FIGURE 5.12: Bar Graph dialog, showing the Data tab (top) and Options tab (bottom). I previously pressed the Plot by button and selected the factor vote.

FIGURE 5.12: Bar Graph dialog, showing the Data tab (top) and Options tab (bottom). I previously pressed the Plot by button and selected the factor vote.

I’ll demonstrate graphing the distribution of a categorical variable by using the Chile data set from the car package. This data set is from a poll conducted about six months before the 1988 Chilean plebiscite on the continuation of military rule: voting “yes” in the plebiscite represented support for Pinochet’s military government, while “no” represented support for a return to electoral democracy. Two of the variables in the Chile data set are the factors vote, with levels “N” (no), “Y” (yes), “U” (undecided), and “A” (abstain), and education, with levels “P”(primary), “S” (secondary), and “PS” (post-secondary). In both cases, the default alphabetic ordering of the factor levels isn’t the natural ordering, and so, after reading the data, I change the orderings via Data > Manage variables in active data set > Reorder factor levels (see Section 3.4).

The Graphs menu includes two simple distributional plots for factors: frequency bar graphs and pie charts. Because it allows for dividing bars by the value of a second factor, the Bar Graph dialog, shown in Figure 5.12, is the more complex of the two. In the Data tab, at the top of the figure, I select the factor education to define the bars. I previously pressed the Plot by button and chose vote in the resulting Groups sub-dialog, and so the button displays Plot by: vote. I retain all of the default choices in the Options tab at the bottom of Figure 5.12. Clicking OK produces the graph in Figure 5.13. It’s apparent that relative support for the military government declined with education, but that overall the plebiscite appeared close (visually summing and comparing the “N” and “Y” areas across the bars).

FIGURE 5.13: Bar graph for education in the Chilean plebiscite data, with bars divided by vote. A color version of this figure appears in the insert at the center of the book.

Overall voting intentions are displayed in the pie chart in Figure 5.14. The Pie Chart dialog, not shown, simply allows you to pick a factor and, optionally, provide axis labels and a graph title.

FIGURE 5.14: Pie chart for vote in the Chilean plebiscite data. A color version of this figure appears in the insert at the center of the book.

The third section of the Graphs menu is for graphing relationships between and among variables, including scatterplots, scatterplot matrices, and 3D scatterplots for numeric variables, line plots, which are typically for time series data, plots of means of a numeric variable classified by one or more factors, strip charts, which are similar to conditional dot plots (discussed in Section 5.3.1), and conditioning plots, which are capable of representing the relationships between one or more numeric response variables and explanatory variables that are both numeric and factors.8 I’ll focus here on scatterplots for two numeric variables, scatterplot matrices for several numeric variables, 3D scatterplots for three numeric variables, and plots of means of a numeric variable classified by one or two factors.

In addition, and as mentioned previously, some of the distributional graphs discussed in Section 5.3.1 can be used to examine the relationship between a numeric response variable and a factor. These include dot plots, histograms, stem-and-leaf displays (with a dichoto-mous factor), and boxplots.

To illustrate the construction of scatterplots, scatterplot matrices, and 3D scatterplots, I return to the Canadian occupational prestige data in the previously read Prestige data set. Choosing Graphs > Scatterplot from the R Commander menus brings up the dialog box in Figure 5.15. As you can see, there are many options in the dialog, some of which I’ll describe presently. In the Data tab, I select income (which is the explanatory variable) as the x-variable and prestige (the response variable) as the y-variable. I retain all of the defaults in the Options tab, clicking Apply to draw the simple scatterplot in Figure 5.16. Occupational prestige apparently increases with income, but the relationship is nonlinear, with the rate of increase declining with income.

To draw the scatterplot in Figure 5.17, I click on the Plot by groups button in the Data tab; because it’s the only factor in the data set, type is preselected in the resulting Groups variable sub-dialog (not shown). The sub-dialog also has a checkbox for plotting lines by group, which is selected by default. In the Options tab, I check the boxes for Least-squares line, Smooth line, and Plot concentration ellipses. I also change the Legend Position from the default Above plot to Bottom right.

The smooth line is produced by a method of nonparametric regression called loess, an acronym for local regression, which traces how the average value of y changes with x without assuming that the relationship between yand x takes a specific form. The span of the loess smoother is the percentage of the data used to compute each smoothed value: The larger the span, the smoother the resulting loess regression. The default span is 50%, a value that I increase to 100% because of the small number of cases in each level of occupational type. As a general matter, you want to select the smallest span that produces a reasonably smooth regression, a value that you can determine by trial and error, pressing the Apply button in the dialog each time you adjust the Span slider.

Concentration ellipses are summaries of the variational and correlational structure of the points. For bivariately normally distributed data, concentration ellipses enclose specific fractions of the data—50% and 90% by default; the ellipses are computed robustly, however, to reduce the impact of outliers. To avoid an overly cluttered graph, I set the Concentration levels to 0.5, to draw only one ellipse for each occupational type.

The scatterplot in Figure 5.17 suggests that the apparently nonlinear relationship between prestige and incomeis due to occupational type: Within levels of type, the relationship is reasonably linear, but with the slope changing across levels.

FIGURE 5.15: Scatterplot dialog: Data tab (top) and Options tab (bottom).

FIGURE 5.16: Simple scatterplot of prestige vs. income for the Prestige data.

FIGURE 5.17: Enhanced scatterplot of prestige vs. income by occupational type, showing 50% concentration ellipses, least-squares lines, and loess lines. A color version of this figure appears in the insert at the center of the book.

5.4.3 Scatterplot Matrices

scatterplot matrix displays the pairwise relationships among several numeric variables; it is the graphical analog of a correlation matrix. The Scatterplot Matrix dialog, shown in
Figure 5.18
, is similar in most respects to the Scatterplot dialog. I select several variables in the Data tab and leave all of the choices in the Options tab at their defaults. Each off-diagonal panel in the resulting scatterplot matrix in
Figure 5.19
displays the pairwise scatterplot for two variables, while the diagonal panels show the marginal distributions of the variables. The plots in the first row, for example, have education on the vertical axis, while those in the first column have education on the horizontal axis—and similarly for the other variables in the graph. Thus, the scatterplot in the second row, first column has income on the vertical axis and education on the horizontal axis.

5.4.4 Point Identification in Scatterplots and Scatterplot Matrices

Both the Scatterplot and the Scatterplot Matrix dialogs provide an option for automatic identification of noteworthy cases. Automatic point identification uses a robust method to find the most unusual points in each scatterplot, with the number of points to be identified set by the user.

The Scatterplot dialog additionally supports interactive point identification, selected by pressing the corresponding radio button in the Options tab. Under Windows or Linux/Unix, interactive point identification displays a message box with the text Use left mouse button to identify points, right button to exit; under Mac OS X, the message reads Use left mouse button to identify points, esc key to exit. In either case, click the OK button to dismiss the message box. On all operating systems, the mouse cursor turns into “cross-hairs” (a +) when the cursor is over the scatterplot. Left-clicking near a point labels the point with the row name of the corresponding case.

To produce
Figure 5.20
, I use the Scatterplot dialog (not repeated) to plot income (on the y-axis) versus education (on the x-axis), checking the box to identify points Interactively with the mouse. I click near two of the points, which are identified as general.managers and physicians. These are, incidentally, precisely the two points that are flagged if automatic point identification is employed. Both occupations have unusually high values of income given their levels of education.

There are two issues to keep in mind about interactive point identification:

1.  It is necessary to exit from point identification mode before you can do anything else in the R Commander. If you forget to exit, the R Commander will appear to freeze!

2.  Scatterplots using interactive point identification do not appear in the R Markdown document produced by the R Commander (see
Section 3.6
). As in the example, however, automatic point identification usually works quite well.

FIGURE 5.18: Scatterplot Matrix dialog, with Data tab (top) and Options tab (bottom).

FIGURE 5.19: Default scatterplot matrix for education, income, prestige, and women in the Prestige data set.

FIGURE 5.20: Scatterplot for education and income in the Prestige data set, with two points identified interactively by mouse clicks.

FIGURE 5.20: Scatterplot for education and income in the Prestige data set, with two points identified interactively by mouse clicks.

You can draw a dynamic three-dimensional scatterplot for three numeric variables by choosing Graphs > 3D graph > 3D scatterplot from the R Commander menus, bringing up the dialog box in Figure 5.21. The structure of the 3D Scatterplot dialog is very similar to that of the Scatterplot dialog.

In the Data tab, I select two numeric explanatory variables, education and income, along with a numeric response variable, prestige. As in a 2D scatterplot, I could (but don’t) use the factor type to plot by groups—for example, using different colors for points in the various levels of type.

In addition to the default selections in the Options tab, I opt to plot the least-squares regression plane and an additive nonparametric regression, which allows nonlinear partial relationships between prestige and each of education and income; the degrees of freedom for these terms are analogous to the span of the 2D loess smoother: smaller df produce a smoother fit to the data. I allow df to be selected automatically. Concentration ellipsoids (which are not selected) are 3D analogs of 2D concentration ellipses. I also elect to identify 2 points automatically.9

Clicking OK produces the 3D scatterplot in Figure 5.22, which appears in an RGL device window.10 In the original image (which appears in the insert at the center of the book), points in the plot are represented by small yellow spheres. A static image doesn’t do justice to the 3D dynamic plot, which you can manipulate in the RGL devicewindow: Left-clicking and dragging allows you to rotate the plot—in effect grabbing onto an invisible sphere surrounding the data—while right-clicking and dragging changes perspective.

The Plot Means dialog, selected by Graphs > Plot of means, displays the mean of a numeric variable as a function of one or two factors. For an example, I return to the Adler experimenter-expectations data (introduced in Section 5.1), which I make the active data set in the R Commander. The Plot Means dialog appears in Figure 5.23. In the Data tab, I select the factors expectation and instruction; the response variable rating is preselected because it’s the only numeric variable in the data set. I leave the Options tab in its default state.

Clicking OK yields the graph in Figure 5.24, where the error bars represent ±1 standard error around the means. Apparently, instructing the subjects to obtain “good” data produces a bias consistent with expectation, while instructing subjects to obtain “scientific” data or providing no instruction produces a smaller bias in the opposite direction.

FIGURE 5.21: The 3D Scatterplot dialog: Data tab (top) and Options tab (bottom).

FIGURE 5.22: Three-dimensional scatterplot for education, income, and prestige in the Prestige data set, showing the least-squares plane (nearly edge-on, in blue) and an additive nonparametric regression surface (in green). The points are yellow spheres, two of which (general.managers and physicians) were labelled automatically. A color version of this figure appears in the insert at the center of the book.

FIGURE 5.23: Plot Means dialog box: Data tab (top) and Options tab (bottom).

FIGURE 5.23: Plot Means dialog box: Data tab (top) and Options tab (bottom).

FIGURE 5.24: Mean rating by instruction and expectation for the Adler data set. The error bars represent ±1 standard error around the means.

1Some figures in the chapter appear in color in the insert at the center of the book.

2Of course, it isn’t sensible to discard data like this, but I wanted to produce a more complex two-way analysis of variance example, with unequal numbers of cases in the combinations of levels of two factors; see Section 6.1.

3Two of the items in this menu, Correlation test and Shapiro-Wilk test of normality, perform simple hypothesis tests; I’ll take them up in Chapter 6.

4Although the R Commander can switch among data sets in this manner, you’ll no doubt work with a single data set in most of your R Commander sessions.

5The number of bins specified is just a target because the program that creates the histogram also tries to use “nice” numbers for the boundaries of the bins. The default target number of bins is determined by Sturges’s rule (Sturges, 1926).

6To obtain the marginal histogram for education—that is, not to plot by occupational type—either press the Reset button in the Histogram dialog, or press the Plot by: type button, and deselect type in the resulting Groups sub-dialog by Ctrl-clicking on it in the Groups variable list.

7Probability distributions are discussed in Chapter 8.

8The R Commander dialogs for strip charts and conditioning plots were originally contributed by Richard Heiberger.

9Interactive point identification in 3D works differently than in a 2D scatterplot: You right-click and drag a box around the point or points you want to identify, repeating this procedure as many times as you want. To exit from point identification mode, you must right-click in an area of the plot where there are no points. You can identify points interactively by checking the appropriate box in the 3D Scatterplot Options tab or, after drawing a 3D scatterplot, by selecting Graphs > 3D graph > Identify observations with mouse from the R Commander menus.

10The R Commander uses the scatter3d function in the car package to draw 3D scatterplots; scatter3d, in turn, employs facilities provided by the rgl package (Adler and Murdoch, 2015) for constructing 3D dynamic graphs.

RCH 8303, Quantitative Data Analysis 1

Course Learning Outcomes for Unit III

Upon completion of this unit, students should be able to:

1. Perform statistical tests using software tools.
1.1 Describe the procedures to summarize and display data.
1.2 Report normality statistics.

2. Explain results of statistical tests.

2.1 Describe the process to determine whether data are normally distributed or not.
2.2 Demonstrate the procedures necessary to successfully create a histogram, bar chart, boxplot,

Q-Q plot, and cross-tabulation test.

3. Judge whether null hypotheses should be rejected or maintained.
3.1 Discuss how graphing data can help determine conclusions of our data.
3.2 Discuss differences between one-sided and two-sided hypotheses and when to use them.
3.3 Explain how to rule a rival hypotheses.
3.4 Discuss what contingency tables are and what they are used for.

Course/Unit
Learning Outcomes

Learning Activity

1.1, 1.2
Unit Lesson
Chapter 5
Unit III Assignment 2

2.1, 2.2, 3.1, 3.2, 3.3,
3.4

Unit Lesson
Unit III Assignment 1

Required Unit Resources

Chapter 5: Summarizing and Graphing Data

Unit Lesson

Introduction

In Unit III, we now turn our focus to how a researcher can display and understand data using visual methods.
Visualization of data is important to facilitate interpretation. Depending on the form or type of data the
researcher needs to analyze, this will determine the type of data display methods that can be used. This unit
will demonstrate various types of data display methods to allow researchers to better understand their data.
From a researcher’s point of view, it is very important to view the data, and by doing so, the researcher is able
to observe or make observations of their rather than simply relying on a numerical output. Also, the reader of
the material is able to view the data as well and be able to form their opinions based on the data as well.

This unit will focus on how to summarize and graph data; specifically, how to create a histogram, bar chart,
boxplot, and a QQ Plot. In addition, instruction will focus on how to perform and report a cross-tabulation.

The Unit III Assignment will be in three parts.

UNIT III STUDY GUIDE

Summarizing and Graphing Data

RCH 8303, Quantitative Data Analysis 2

UNIT x STUDY GUIDE

Title

Part 1 of your assignment requires you to complete the Contingency Tables and Chi-Square Tests (ID 17630)
module of the Collaborative Institutional Training Initiative (CITI) Program Essentials of Statistical (EOSA)
located in Part 3. This module explains whether a contingency table can be analyzed using a chi-square test.
What contingency tables are and the relationship between categorical variables are displayed. The module
then demonstrates how to analyze relationships between categorical variables using chi-square tests.

Part 2 will require you to construct a histogram, bar chart, boxplot, and QQ Plot. The results will be compiled
and submitted in a single Microsoft Word file.

Part 3 of your assignment is to perform a cross-tabulation, analyze the results, and report the result in APA
format by submitting a single Microsoft Word file.

What is a Histogram?

Once data are collected, a researcher needs to be able to describe, summarize, and, potentially, detect
patterns in the data they have recorded with meaningful numerical scales (McClave & Sincich, 2006). To do
this, the researcher can utilize a histogram. A histogram allows a researcher to show the relationship between
two variables that are continuous in nature (Gall et al., 2003). Huck (2004) notes that histograms are used to
indicate how many times a score appears in a data set.

The Essentials of Statistical Analysis (EOSA) module Distribution and Probability (ID 17613) presented in Unit
I, provides an example of how a histogram is used to display data. A histogram displays data using values on
on the X-axis (horizontal) and a Y-axis (vertical). R Commander provides an Option tab to define labels for the
x and y axis and determine the width of the display bins.

If you are not comfortable utilizing R and R commander you may use whatever statistical software program
you choose. The answers you submit for your assignment must be correct regardless of the software you
choose.

R and R Commander make it easy and quick to construct a histogram. Using the data file “Duncan” provided
by the text, one can quickly construct a histogram of any of the variables.

Once the data file has been accessed, the process to create a histogram is very easy. Make sure when you
access R that you also load R Commander. Type in library(Rcmdr) or see unit I for a refresher on how to

When R and R Commander have been loaded, selecting Graphs in the menu will present various options.
Selecting Histogram allows us to utilize a data set to create a Histogram (Figure 1).

RCH 8303, Quantitative Data Analysis 3

UNIT x STUDY GUIDE

Title

Figure 1
Creating a Histogram From R Commander Menu System

As depicted in Figure 2, once Histogram is selected, a user has two types of options. First, a user must select
the variable to be displayed from the Data tab. In this case, the income variable was selected.

RCH 8303, Quantitative Data Analysis 4

UNIT x STUDY GUIDE

Title

Figure 2
Histogram Variable Selection

A user could click “OK,” and the histogram would be created. This histogram would illustrate all income data
regardless of groups/categories of data.

RCH 8303, Quantitative Data Analysis 5

UNIT x STUDY GUIDE

Title

However, if a user wanted to create a histogram of income by groups, such as type of income, the Groups tab
could be accessed, and the type variables selected (Figure 3).

Figure 3
Selection of Type From the Groups tab

RCH 8303, Quantitative Data Analysis 6

UNIT x STUDY GUIDE

Title

Selecting Options (Figure 4) allows the researcher to label the x- and y-axes and define the width of the bins
of the data display.

Figure 4
Histogram Options Selection tab

RCH 8303, Quantitative Data Analysis 7

UNIT x STUDY GUIDE

Title

Selecting a histogram of income by type would result in a three-histogram display (Figure 5).

Figure 5
Histograms of Income by Occupation Type

This approach may provide a researcher a better grasp of the data.

What is a Bar Graph?

Researchers need to create displays that are not misleading or over simplified, but informative and visually
appealing (Gall et al., 2013). A bar graph is normally used to display the distribution of a categorical variable.
Refer to Introduction to Statistics (ID 17609) for a refresher on categorical variables. Gall (2013) notes that a
bar graph shows the relationship between variables. In a bar graph, the horizontal axis represents categories
of a qualitative variable as opposed to a histogram where the horizontal axis represents a quantitative
variable (Huck, 2004).

RCH 8303, Quantitative Data Analysis 8

UNIT x STUDY GUIDE

Title

Using the data set Nations, navigate to the Graphs menu and select Bar graph (Figure 6).

Figure 6
Bar Graph Dialog Box

RCH 8303, Quantitative Data Analysis 9

UNIT x STUDY GUIDE

Title

Next, select region (Figure 7) and click “OK.”

Figure 7

RCH 8303, Quantitative Data Analysis 10

UNIT x STUDY GUIDE

Title

The resulting graph depicts the frequency of records by region (Figure 8).

Figure 8
Bar Graph Output Display

What is a Boxplot?

Visual representations of a data set are much more effective than numerical representations of data (Hartwig
& Dearling, 1979). A boxplot displays the variability within a data set. McClave and Sincich (2006) note that
boxplots are based on the quartiles of an existing data set and are very good for detecting outliers in data.

RCH 8303, Quantitative Data Analysis 11

UNIT x STUDY GUIDE

Title

McClave and Sincich remind the reader that boxplots are partitioned into four groups: The median, the
interquartile range, the minimum and maximum, and outliers (Figure 9).

Figure 9
Explanation of a Boxplot

The EOSA modules Central Tendency and Variability (ID 17611) and Normal Distribution and Z-Scores (ID
17615) presented in Unit I have examples of how a boxplot is used.

RCH 8303, Quantitative Data Analysis 12

UNIT x STUDY GUIDE

Title

Using the data set Prestige, navigate to the Graphs menu and select Boxplot (Figure 10).

Figure 10
Box Plot Dialog Box

RCH 8303, Quantitative Data Analysis 13

UNIT x STUDY GUIDE

Title

Next, choose the income variable and click “OK” (Figure 11).

Figure 11

RCH 8303, Quantitative Data Analysis 14

UNIT x STUDY GUIDE

Title

Note the boxplot identifies five outliers for further investigation (Figure 12).

Figure 12
Box Plot Display With Visible Outliers

RCH 8303, Quantitative Data Analysis 15

UNIT x STUDY GUIDE

Title

Q-Q Plot

Instead of simply viewing a histogram to interpret or determine whether your data are normally distributed, a
Quantile-Quantile (Q-Q) plot can be used to determine normal distribution. A Q-Q plot allows visually
comparing actual distributions to a theoretical distribution (Mayor, 2015). Using the data set Prestige, navigate
to the Graphs menu and select Quantile-comparison plot (Figure 13).

Figure 13
Quantile-Comparison Plot (Q-Q Plot) Menu Selection

RCH 8303, Quantitative Data Analysis 16

UNIT x STUDY GUIDE

Title

Next, choose the income variable to display and select “OK” (Figure 14).

Figure 14
Q-Q Plot Variable Selection Option

RCH 8303, Quantitative Data Analysis 17

UNIT x STUDY GUIDE

Title

As depicted in Figure 15, the Q-Q Plot has identified data points that vary from not only a theoretical normal
distribution (straight line), but from a 95% confidence interval (CI; dotted line). Finally, two outliers are
identified by group.

Figure 15
Q-Q Plot Display With Visible Outliers

RCH 8303, Quantitative Data Analysis 18

UNIT x STUDY GUIDE

Title

What is Cross Tabulation?

A cross-tabulation, which is also known as a contingency table, uses a chi-square distribution to determine
the association between variables. View page 35 in your textbook. The researcher would normally use this
table to examine relationships between categorical variables. The Contingency Tables and Chi-Square Tests
(ID 17630) module of the CITI Program EOSA located in part 3 will demonstrate this form of data analysis.

Using the data set Nations, navigate to the Statistics menu, and select Contingency tables: Two-way table
(Figure 16).

Figure 16
Contingency Tables Dialog Box

RCH 8303, Quantitative Data Analysis 19

UNIT x STUDY GUIDE

Title

Choose the two variables, and select “OK” (Figure 17).

Figure 17

RCH 8303, Quantitative Data Analysis 20

UNIT x STUDY GUIDE

Title

Before selecting the “OK” button, click on the Statistics tab (Figure 18), and you will be able to select
computer percentages and the type of hypothesis tests you are performing. Make sure the Chi-square test of
independence has a check mark in it. Now you are ready to select OK.

Figure 18

RCH 8303, Quantitative Data Analysis 21

UNIT x STUDY GUIDE

Title

The resulting output would be the statistical test. There are three values: the chi-square statistic (Χ2), the
degrees of freedom (df), and the p-value (Figure 19).

Figure 19
Cross-Tabulation Display With Pearson’s Chi-Squared Test Results

To report the results of a Pearson chi-square test following the American Psychological Association (APA)
Style Guide (7th ed.), a researcher would state:

A chi-square test of independence was performed to examine the relationship between Total Fertility
Rate (TFR) and region. The relation between these variables was significant, Χ2 (596) = 692.33, p =
.004.

References

Gall, M. D., Gall, J. P., & Borg, W. R. (2003). Educational research: An introduction (7th ed.). Allyn and

Bacon.

Hartwig, F., & Dearling, B. E. (1979). Quantitative applications in the social sciences: Exploratory data

analysis. SAGE Publications.

Huck, S. W. (2004). Reading statistics and research. Allyn and Bacon.

Mayor, E. (2015). Learning predictive analytics with R. Packt Publishing.

McClave, J. T., & Sincich, T. (2006). Statistics (10th ed.). Prentice Hall.

RCH 8303, Quantitative Data Analysis 22

UNIT x STUDY GUIDE

Title

Nongraded Learning Activities are provided to aid students in their course of study. You do not have to submit
them. If you have questions, contact your instructor for further guidance and information.

When studying APA formatting, pay particular attention to the sections that pertain to formatting for research
and statistics. Review these sections as needed.

• Course Learning Outcomes for Unit III
• Required Unit Resources
• Unit Lesson
• Introduction
• Unit III Plan
• What is a Histogram?
• What is a Bar Graph?
• What is a Boxplot?
• Q-Q Plot
• What is Cross Tabulation?
• References

## Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
\$26
The price is based on these factors:
Number of pages
Urgency
Basic features
• Free title page and bibliography
• Unlimited revisions
• Plagiarism-free guarantee
• Money-back guarantee
On-demand options
• Writer’s samples
• Part-by-part delivery
• Overnight delivery
• Copies of used sources
Paper format
• 275 words per page
• 12 pt Arial/Times New Roman
• Double line spacing
• Any citation style (APA, MLA, Chicago/Turabian, Harvard)

# Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

### Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

### Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

### Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.