Please make sure that it is your own work and not copy and paste. Please watch out for Spelling and Grammar errors. Please read the study guide. Please use the APA 7th edition.

Don't use plagiarized sources. Get Your Custom Essay on

top answer: Please make sure that it is your own work and not copy and paste. Please watch out for Spelling and

Just from $10/Page

Book Reference: Fox, J. (2017). *Using the R Commander: A point-and-click interface for R*. CRC Press. https://online.vitalsource.com/#/books/9781498741934

Discuss how you would use the various types of summarizing and graphing to present your data. Make sure you discuss the type of data you would have and the type of display you would select.

*Summarizing and Graphing Data*

This chapter explains how to use the R Commander to compute simple numerical summaries of data, to construct and analyze contingency tables, and to draw common statistical graphs. Most of the statistical content of the chapter is covered in a typical basic statistics course, although a few topics, such as quantile-comparison plots (in Section 5.3.1) and smoothing scatterplots (in Section 5.4.1), are somewhat more advanced.

Although most of the graphs produced by the R Commander use color, most of the figures in this chapter are rendered in monochrome.1

5.1 Simple Numerical Summaries

The R Commander *Statistics > Summaries* menu (see Figure A.4 on page 202) contains several items for summarizing data. I’ll use the Canadian occupational prestige data (introduced in Section 4.2.3) to illustrate. This data set is most conveniently available in the Prestige data frame in the car package, which is one of the packages loaded when the R Commander starts up. I read the data via *Data > Data in packages > Read data set from an attached package* (as described in Section 4.2.4). Because the default alphabetic order of the levels of the type factor in the data set—“bc” (blue-collar), “prof” (professional and managerial), “wc” (white-collar)—is not the natural order, I reorder the levels of the factor with *Data > Manage variables in active data set > Reorder factor levels* (see Section 3.4).

Selecting *Statistics > Summaries > Active data set* produces the brief summary in Figure 5.1. There’s a “five-number summary” for each numeric variable—reporting the minimum, first quartile, median, third quartile, and maximum of the variable—plus the mean, and the frequency distribution of the factor type, including a count of NAs.

*Statistics > Summaries > Numerical summaries* brings up the dialog box in Figure 5.2. I select the variables education, income, prestige, and women in the *Data* tab and retain the default choices in the *Statistics* tab. Clicking *OK* results in the output in Figure 5.3. Were I to press the *Summarize by groups* button in the *Data* tab, I could compute summary statistics separately for each level of type.

Choosing *Statistics > Summaries > Table of statistics* allows you to calculate a statistic for one or more numeric variables within levels or combinations of levels of one or more factors. To illustrate, I’ll use the Adler data set from the car package. The data are from a social-psychological experiment, reported by Adler (1973), on “experimenter effects” in psychological research—that is, how researchers’ expectations can influence the data that they collect. Adler recruited “research assistants,” who showed photographs of individuals’ faces to respondents; the respondents were asked by the research assistants to rate the apparent “successfulness” of the individuals in the photographs. In fact, Adler chose photographs that were average in their appearance of success, and the true subjects in the study were the research assistants. Adler manipulated two factors, named expectation and instruction in the data set.

FIGURE 5.1: Summary output for the Prestige data set.

FIGURE 5.2: *Numerical Summaries* dialog box: *Data* tab (top) and *Statistics* tab (bottom).

FIGURE 5.3: Numerical summaries for several variables in the Prestige data set.

FIGURE 5.4: The Table of Statistics dialog box.

• expectation: Some assistants were told to expect high ratings, while others were told to expect low ratings.

• instruction: In addition, the assistants were given different instructions about how to collect data. Some were instructed to try to collect “good” data, others were instructed to try to collect “scientific” data, and still others were given no special instruction of this type.

Adler randomly assigned 18 research assistants to each of six experimental conditions—combinations of the two levels of the factor expectation (“HIGH” or “LOW”) and the three levels of the factor instruction (“GOOD”, “SCIENTIFIC”, or “NONE”). I deleted 11 of the 108 subjects at random to produce the “unbalanced” Adler data set.2 After reading the data into the R Commander in the usual manner, I reorder the levels of the factor instruction from the default alphabetic ordering.

The Table of Statistics dialog box appears in Figure 5.4. I select both expectation and instruction in the Factors list box; because there’s just one numeric variable in the data set—rating—it’s preselected in the Response variables list box. The dialog includes radio buttons for calculating the mean, median, standard deviation, and interquartile range, along with an Other button, which allows you to enter any R function that computes a single number for a numeric variable. I retain the default Mean, and press the Apply button. Then, when the dialog reappears, I select Standard deviation and press OK. The output is shown in Figure 5.5. I’ll defer interpreting the Adler data to Section 5.4 on graphing means and Section 6.1 on hypothesis tests for means.

Several of the Statistics > Summaries menu items and associated dialogs are very straightforward, and so, in the interest of brevity, I won’t demonstrate their use here:3

FIGURE 5.5: Tables of means and standard deviations for rating in the Adler data set, classified by expectation and instruction.

• The *Frequency Distributions* dialog produces frequency and percentage distributions for factors, along with an optional chi-square goodness-of-fit test with user-supplied hypothesized probabilities for the levels of a factor.

• The *Count missing observations* menu item simply reports the number of NAs for each variable in the active data set.

• The *Correlation Matrix* dialog calculates Pearson product-moment correlations, Spearman rank-order correlations, or partial correlations for two or more numeric variables, along with optional pairwise p-values, computed with and without correction for simultaneous inference.

The *Statistics > Contingency tables* menu (see Figure A.4 on page 202) has items for constructing two-way and multi-way tables from the active data set. I demonstrated the *Two-Way Table* dialog in Section 3.5, and there is no need to repeat that demonstration here. Moreover, the *Multi-Way Table* dialog is similar, except that, in addition to selecting row and column factors for the contingency table, you can pick one or more “control” factors: A separate two-way partial table, optionally percentaged by rows or columns, is reported for each combination of levels of the control factors.

In contrast, the *Enter Two-Way Table* dialog (in Figure 5.6), selected via *Statistics > Contingency tables > Enter and analyze two-way table*, is unusual for the R Commander, in that it doesn’t use the active data set. The dialog allows you to enter frequencies (counts) from an existing two-way contingency table, typically from a printed source such as a textbook. The sliders at the top of the *Table* tab control the number of rows and columns in the table. Initially, the table has 2 rows and 2 columns, and the cells of the table are empty.

Setting the sliders to 3 rows and 2 columns, I enter a frequency table taken from *The American Voter*, a classic study of electoral behavior by Campbell et al. (1960). The data originate in a panel study of the 1956 U. S. presidential election. During the campaign, survey respondents were asked how strongly (weak, medium, or strong) they preferred one candidate to the other, and after the election they were asked whether or not they had voted.

FIGURE 5.6: The *Enter Two-Way Table* dialog: *Table* tab (top) and *Statistics* tab (bottom).

FIGURE 5.7: Output produced by the *Enter Two-Way Table* dialog, having entered a contingency table from *The American Voter*.

The *Statistics* tab appears at the bottom of

__Figure 5.6__

. I check the box for *Row percentages* because the row variable in the table, intensity of preference, is the explanatory variable; the *Chi-square test of independence*checkbox is selected by default. I also check *Print expected frequencies*, which is *not* selected by default.

The output from the dialog is shown in

__Figure 5.7__

. Reported voter turnout increases with intensity of partisan preference, and the relationship between the two variables is highly statistically significant, with a very small p-value for the chi-square test of independence. All of the expected counts are much larger than necessary for the chi-square distribution to be a good approximation to the distribution of the test statistic; had that *not* been the case, a warning would have appeared, whether or not expected frequencies are printed.

__5.3 Graphing Distributions of Variables__

I’ll use the Canadian occupational prestige data, read from the car package earlier in this chapter, to illustrate graphing distributions. There are, at this point in the chapter, two data sets in memory—the Prestige data set and the Adler data set—and the latter is the active data set. To change the active data set, I click on the *Data set* button in the R Commander toolbar and select Prestige in the resulting dialog.

__4__

__5.3.1 Graphing Numerical Data__

The R Commander *Graphs* menu (see Figure A.6 on

__page 203__

) is divided into several groups of items, the second of which leads to dialogs for constructing graphs of the distribution of a numerical variable: *Index plot, Dot plot, Histogram*, nonparametric *Density estimate, Stem-and-leaf display, Boxplot*, and theoretical *Quantile-comparison plot*. Many of these graphs—specifically, dot plots, histograms, density estimates, and boxplots—can also show the distribution of a numeric variable within levels of (i.e., *conditional on)* a factor, and stem-and-leaf displays can be drawn back-to-back for the two levels of a dichotomous factor (see the example in

__Section 6.1.1__

).

Selecting *Graphs > Histogram* produces the dialog box in

__Figure 5.8__

. The *Data* tab, at the top of the figure, allows you to choose a numeric variable; I select income. Clicking the *Plot by groups* button brings up the *Groups* sub-dialog shown at the center of the figure; because there is only one factor in the data set, type, it is preselected. Clicking *OK* in the *Groups* sub-dialog returns to the main dialog, and now the *Plot by* button reads *Plot by: type*. The *Options* tab is at the bottom of

__Figure 5.8__

. Leaving all of the options at their defaults and clicking *OK* produces the vertically aligned histograms in

__Figure 5.9__

.

If you don’t like the default number of bins, which results from leaving the *Number of bins* text box at <auto>, you can type a target number for the number of bins:

__5__

As a general matter, as you increase the number of bins, the width of each bin decreases. You can conveniently experiment with the number of bins by pressing the *Apply* button rather than the *OK* button in the dialog.

The dialogs for the other distributional displays differ only in their *Options* tabs and whether or not (as noted above) they support plotting by groups.

__Figure 5.10__

shows the default distributional displays for education in the Canadian occupational prestige data set.

__6__

There is also a “rug plot” at the bottom of the density estimate (center-right panel), showing the location of the data values. By default the quantile-comparison plot (lower-right) compares the distribution of the data to the normal distribution, but you can also plot against other theoretical distributions.

__7__

FIGURE 5.8: *Histogram* dialog, showing the *Data* tab (top), *Groups* sub-dialog (center), and *Options* tab (bottom).

FIGURE 5.8: *Histogram* dialog, showing the *Data* tab (top), *Groups* sub-dialog (center), and *Options* tab (bottom).

FIGURE 5.9: Histograms of average income by type of occupation, for the Canadian occupational prestige data.

In the index plot (at the upper-left) and quantile-comparison plot (at the lower-right), the two most extreme values are automatically identified by default, but because these values are close to each other in the graphs, the labels for the points are over-plotted. The case labels are also displayed, however, in the R Commander *Output* pane (not shown), and they are university.teachers and physicians.

The default stem-and-leaf display for education appears in

__Figure 5.11__

; it is text output and so is printed in the *Output* pane.

FIGURE 5.10: Various default distributional displays for average education in the Canadian occupational prestige data. From top to bottom and left to right: index plot, dot plot, histogram, nonparametric density estimate with rug plot, boxplot, and quantile-comparison plot comparing the distribution of education to the normal distribution.

FIGURE 5.11: Default “Tukey-style” stem-and-leaf display for education in the Canadian occupational prestige data. The column of numbers to the left of the stems represents “depths”—counts in to the median from both ends of the distribution—with the parenthesized value (4) giving the count for the stem containing the median. Note the divided stems, with *x*. stems containing leaves 0–4 and *x* * stems leaves 5–9. Five-part stems are similarly labelled *x*. with leaves 01, *x* t with leaves 23, *x* f with leaves 45, *x* s with leaves 67, and *x* * with leaves 89.

FIGURE 5.12: *Bar Graph* dialog, showing the *Data* tab (top) and *Options* tab (bottom). I previously pressed the *Plot by* button and selected the factor vote.

FIGURE 5.12: *Bar Graph* dialog, showing the *Data* tab (top) and *Options* tab (bottom). I previously pressed the *Plot by* button and selected the factor vote.

5.3.2 Graphing Categorical Data

I’ll demonstrate graphing the distribution of a categorical variable by using the Chile data set from the car package. This data set is from a poll conducted about six months before the 1988 Chilean plebiscite on the continuation of military rule: voting “yes” in the plebiscite represented support for Pinochet’s military government, while “no” represented support for a return to electoral democracy. Two of the variables in the Chile data set are the factors vote, with levels “N” (no), “Y” (yes), “U” (undecided), and “A” (abstain), and education, with levels “P”(primary), “S” (secondary), and “PS” (post-secondary). In both cases, the default alphabetic ordering of the factor levels isn’t the natural ordering, and so, after reading the data, I change the orderings via *Data > Manage variables in active data set > Reorder factor levels* (see Section 3.4).

The *Graphs* menu includes two simple distributional plots for factors: frequency bar graphs and pie charts. Because it allows for dividing bars by the value of a second factor, the *Bar Graph* dialog, shown in Figure 5.12, is the more complex of the two. In the *Data* tab, at the top of the figure, I select the factor education to define the bars. I previously pressed the *Plot by* button and chose vote in the resulting *Groups* sub-dialog, and so the button displays *Plot by: vote*. I retain all of the default choices in the *Options* tab at the bottom of Figure 5.12. Clicking *OK* produces the graph in Figure 5.13. It’s apparent that relative support for the military government declined with education, but that overall the plebiscite appeared close (visually summing and comparing the “N” and “Y” areas across the bars).

FIGURE 5.13: Bar graph for education in the Chilean plebiscite data, with bars divided by vote. A color version of this figure appears in the insert at the center of the book.

Overall voting intentions are displayed in the pie chart in Figure 5.14. The *Pie Chart* dialog, not shown, simply allows you to pick a factor and, optionally, provide axis labels and a graph title.

FIGURE 5.14: Pie chart for vote in the Chilean plebiscite data. A color version of this figure appears in the insert at the center of the book.

The third section of the *Graphs* menu is for graphing relationships between and among variables, including *scatterplots, scatterplot matrices*, and *3D scatterplots* for numeric variables, *line plots*, which are typically for time series data, *plots of means* of a numeric variable classified by one or more factors, *strip charts*, which are similar to conditional dot plots (discussed in Section 5.3.1), and *conditioning plots*, which are capable of representing the relationships between one or more numeric response variables and explanatory variables that are both numeric and factors.8 I’ll focus here on scatterplots for two numeric variables, scatterplot matrices for several numeric variables, 3D scatterplots for three numeric variables, and plots of means of a numeric variable classified by one or two factors.

In addition, and as mentioned previously, some of the distributional graphs discussed in Section 5.3.1 can be used to examine the relationship between a numeric response variable and a factor. These include dot plots, histograms, stem-and-leaf displays (with a dichoto-mous factor), and boxplots.

To illustrate the construction of scatterplots, scatterplot matrices, and 3D scatterplots, I return to the Canadian occupational prestige data in the previously read Prestige data set. Choosing *Graphs > Scatterplot* from the R Commander menus brings up the dialog box in Figure 5.15. As you can see, there are many options in the dialog, some of which I’ll describe presently. In the *Data* tab, I select income (which is the explanatory variable) as the *x-variable* and prestige (the response variable) as the *y-variable*. I retain all of the defaults in the *Options* tab, clicking *Apply* to draw the simple scatterplot in Figure 5.16. Occupational prestige apparently increases with income, but the relationship is nonlinear, with the rate of increase declining with income.

To draw the scatterplot in Figure 5.17, I click on the *Plot by groups* button in the *Data* tab; because it’s the only factor in the data set, type is preselected in the resulting *Groups variable* sub-dialog (not shown). The sub-dialog also has a checkbox for plotting lines by group, which is selected by default. In the *Options* tab, I check the boxes for *Least-squares line, Smooth line*, and *Plot concentration ellipses*. I also change the *Legend Position* from the default *Above plot* to *Bottom right*.

The smooth line is produced by a method of *nonparametric regression* called *loess*, an acronym for *lo*cal regression, which traces how the average value of *y* changes with *x* without assuming that the relationship between *y*and *x* takes a specific form. The *span* of the loess smoother is the percentage of the data used to compute each smoothed value: The larger the span, the smoother the resulting loess regression. The default span is 50%, a value that I increase to 100% because of the small number of cases in each level of occupational type. As a general matter, you want to select the smallest span that produces a reasonably smooth regression, a value that you can determine by trial and error, pressing the *Apply* button in the dialog each time you adjust the *Span* slider.

Concentration ellipses are summaries of the variational and correlational structure of the points. For bivariately normally distributed data, concentration ellipses enclose specific fractions of the data—50% and 90% by default; the ellipses are computed robustly, however, to reduce the impact of outliers. To avoid an overly cluttered graph, I set the *Concentration levels* to 0.5, to draw only one ellipse for each occupational type.

The scatterplot in Figure 5.17 suggests that the apparently nonlinear relationship between prestige and incomeis due to occupational type: Within levels of type, the relationship is reasonably linear, but with the slope changing across levels.

FIGURE 5.15: *Scatterplot* dialog: *Data* tab (top) and *Options* tab (bottom).

FIGURE 5.16: Simple scatterplot of prestige vs. income for the Prestige data.

FIGURE 5.17: Enhanced scatterplot of prestige vs. income by occupational type, showing 50% concentration ellipses, least-squares lines, and loess lines. A color version of this figure appears in the insert at the center of the book.

__5.4.3 Scatterplot Matrices__

A *scatterplot matrix* displays the pairwise relationships among several numeric variables; it is the graphical analog of a correlation matrix. The *Scatterplot Matrix* dialog, shown in

__Figure 5.18__

, is similar in most respects to the *Scatterplot* dialog. I select several variables in the *Data* tab and leave all of the choices in the *Options* tab at their defaults. Each off-diagonal panel in the resulting scatterplot matrix in

__Figure 5.19__

displays the pairwise scatterplot for two variables, while the diagonal panels show the marginal distributions of the variables. The plots in the first row, for example, have education on the vertical axis, while those in the first column have education on the horizontal axis—and similarly for the other variables in the graph. Thus, the scatterplot in the second row, first column has income on the vertical axis and education on the horizontal axis.

__5.4.4 Point Identification in Scatterplots and Scatterplot Matrices__

Both the *Scatterplot* and the *Scatterplot Matrix* dialogs provide an option for automatic identification of noteworthy cases. Automatic point identification uses a robust method to find the most unusual points in each scatterplot, with the number of points to be identified set by the user.

The *Scatterplot* dialog additionally supports *interactive* point identification, selected by pressing the corresponding radio button in the *Options* tab. Under Windows or Linux/Unix, interactive point identification displays a message box with the text *Use left mouse button to identify points, right button to exit*; under Mac OS X, the message reads *Use left mouse button to identify points, esc key to exit*. In either case, click the *OK* button to dismiss the message box. On all operating systems, the mouse cursor turns into “cross-hairs” (a +) when the cursor is over the scatterplot. Left-clicking near a point labels the point with the row name of the corresponding case.

To produce

__Figure 5.20__

, I use the *Scatterplot* dialog (not repeated) to plot income (on the y-axis) versus education (on the x-axis), checking the box to identify points *Interactively with the mouse*. I click near two of the points, which are identified as general.managers and physicians. These are, incidentally, precisely the two points that are flagged if automatic point identification is employed. Both occupations have unusually high values of income given their levels of education.

There are two issues to keep in mind about interactive point identification:

1. It is necessary to exit from point identification mode before you can do anything else in the R Commander. If you forget to exit, the R Commander will appear to freeze!

2. Scatterplots using interactive point identification do not appear in the R Markdown document produced by the R Commander (see

__Section 3.6__

). As in the example, however, automatic point identification usually works quite well.

FIGURE 5.18: *Scatterplot Matrix* dialog, with *Data* tab (top) and *Options* tab (bottom).

FIGURE 5.19: Default scatterplot matrix for education, income, prestige, and women in the Prestige data set.

FIGURE 5.20: Scatterplot for education and income in the Prestige data set, with two points identified interactively by mouse clicks.

FIGURE 5.20: Scatterplot for education and income in the Prestige data set, with two points identified interactively by mouse clicks.

You can draw a dynamic three-dimensional scatterplot for three numeric variables by choosing *Graphs > 3D graph > 3D scatterplot* from the R Commander menus, bringing up the dialog box in Figure 5.21. The structure of the *3D Scatterplot* dialog is very similar to that of the *Scatterplot* dialog.

In the *Data* tab, I select two numeric explanatory variables, education and income, along with a numeric response variable, prestige. As in a 2D scatterplot, I could (but don’t) use the factor type to plot by groups—for example, using different colors for points in the various levels of type.

In addition to the default selections in the *Options* tab, I opt to plot the least-squares regression plane and an additive nonparametric regression, which allows nonlinear partial relationships between prestige and each of education and income; the degrees of freedom for these terms are analogous to the span of the 2D loess smoother: smaller *df* produce a smoother fit to the data. I allow *df* to be selected automatically. Concentration ellipsoids (which are not selected) are 3D analogs of 2D concentration ellipses. I also elect to identify 2 points automatically.9

Clicking *OK* produces the 3D scatterplot in Figure 5.22, which appears in an *RGL device* window.10 In the original image (which appears in the insert at the center of the book), points in the plot are represented by small yellow spheres. A static image doesn’t do justice to the 3D dynamic plot, which you can manipulate in the *RGL device*window: Left-clicking and dragging allows you to rotate the plot—in effect grabbing onto an invisible sphere surrounding the data—while right-clicking and dragging changes perspective.

The *Plot Means* dialog, selected by *Graphs > Plot of means*, displays the mean of a numeric variable as a function of one or two factors. For an example, I return to the Adler experimenter-expectations data (introduced in Section 5.1), which I make the active data set in the R Commander. The *Plot Means* dialog appears in Figure 5.23. In the *Data* tab, I select the factors expectation and instruction; the response variable rating is preselected because it’s the only numeric variable in the data set. I leave the *Options* tab in its default state.

Clicking *OK* yields the graph in Figure 5.24, where the error bars represent ±1 standard error around the means. Apparently, instructing the subjects to obtain “good” data produces a bias consistent with expectation, while instructing subjects to obtain “scientific” data or providing no instruction produces a smaller bias in the opposite direction.

FIGURE 5.21: The *3D Scatterplot* dialog: *Data* tab (top) and *Options* tab (bottom).

FIGURE 5.22: Three-dimensional scatterplot for education, income, and prestige in the Prestige data set, showing the least-squares plane (nearly edge-on, in blue) and an additive nonparametric regression surface (in green). The points are yellow spheres, two of which (general.managers and physicians) were labelled automatically. A color version of this figure appears in the insert at the center of the book.

FIGURE 5.23: Plot Means dialog box: Data tab (top) and Options tab (bottom).

FIGURE 5.23: *Plot Means* dialog box: *Data* tab (top) and *Options* tab (bottom).

FIGURE 5.24: Mean rating by instruction and expectation for the Adler data set. The error bars represent ±1 standard error around the means.

1Some figures in the chapter appear in color in the insert at the center of the book.

2Of course, it isn’t sensible to discard data like this, but I wanted to produce a more complex two-way analysis of variance example, with unequal numbers of cases in the combinations of levels of two factors; see Section 6.1.

3Two of the items in this menu, *Correlation test* and *Shapiro-Wilk test of normality*, perform simple hypothesis tests; I’ll take them up in Chapter 6.

4Although the R Commander can switch among data sets in this manner, you’ll no doubt work with a single data set in most of your R Commander sessions.

5The number of bins specified is just a *target* because the program that creates the histogram also tries to use “nice” numbers for the boundaries of the bins. The default target number of bins is determined by Sturges’s rule (Sturges, 1926).

6To obtain the marginal histogram for education—that is, *not* to plot by occupational type—either press the *Reset* button in the *Histogram* dialog, or press the *Plot by: type* button, and deselect type in the resulting *Groups* sub-dialog by *Ctrl*-clicking on it in the *Groups variable* list.

7Probability distributions are discussed in Chapter 8.

8The R Commander dialogs for strip charts and conditioning plots were originally contributed by Richard Heiberger.

9Interactive point identification in 3D works differently than in a 2D scatterplot: You right-click and drag a box around the point or points you want to identify, repeating this procedure as many times as you want. To exit from point identification mode, you must right-click in an area of the plot where there are no points. You can identify points interactively by checking the appropriate box in the *3D Scatterplot Options* tab or, after drawing a 3D scatterplot, by selecting *Graphs > 3D graph > Identify observations with mouse* from the R Commander menus.

10The R Commander uses the scatter3d function in the car package to draw 3D scatterplots; scatter3d, in turn, employs facilities provided by the rgl package (Adler and Murdoch, 2015) for constructing 3D dynamic graphs.

RCH 8303, Quantitative Data Analysis 1

Course Learning Outcomes for Unit III

Upon completion of this unit, students should be able to:

1. Perform statistical tests using software tools.

1.1 Describe the procedures to summarize and display data.

1.2 Report normality statistics.

2. Explain results of statistical tests.

2.1 Describe the process to determine whether data are normally distributed or not.

2.2 Demonstrate the procedures necessary to successfully create a histogram, bar chart, boxplot,

Q-Q plot, and cross-tabulation test.

3. Judge whether null hypotheses should be rejected or maintained.

3.1 Discuss how graphing data can help determine conclusions of our data.

3.2 Discuss differences between one-sided and two-sided hypotheses and when to use them.

3.3 Explain how to rule a rival hypotheses.

3.4 Discuss what contingency tables are and what they are used for.

Course/Unit

Learning Outcomes

Learning Activity

1.1, 1.2

Unit Lesson

Chapter 5

Unit III Assignment 2

2.1, 2.2, 3.1, 3.2, 3.3,

3.4

Unit Lesson

Unit III Assignment 1

Required Unit Resources

Chapter 5: Summarizing and Graphing Data

Unit Lesson

Introduction

In Unit III, we now turn our focus to how a researcher can display and understand data using visual methods.

Visualization of data is important to facilitate interpretation. Depending on the form or type of data the

researcher needs to analyze, this will determine the type of data display methods that can be used. This unit

will demonstrate various types of data display methods to allow researchers to better understand their data.

From a researcher’s point of view, it is very important to view the data, and by doing so, the researcher is able

to observe or make observations of their rather than simply relying on a numerical output. Also, the reader of

the material is able to view the data as well and be able to form their opinions based on the data as well.

This unit will focus on how to summarize and graph data; specifically, how to create a histogram, bar chart,

boxplot, and a QQ Plot. In addition, instruction will focus on how to perform and report a cross-tabulation.

The Unit III Assignment will be in three parts.

UNIT III STUDY GUIDE

Summarizing and Graphing Data

RCH 8303, Quantitative Data Analysis 2

UNIT x STUDY GUIDE

Title

Part 1 of your assignment requires you to complete the Contingency Tables and Chi-Square Tests (ID 17630)

module of the Collaborative Institutional Training Initiative (CITI) Program Essentials of Statistical (EOSA)

located in Part 3. This module explains whether a contingency table can be analyzed using a chi-square test.

What contingency tables are and the relationship between categorical variables are displayed. The module

then demonstrates how to analyze relationships between categorical variables using chi-square tests.

Part 2 will require you to construct a histogram, bar chart, boxplot, and QQ Plot. The results will be compiled

and submitted in a single Microsoft Word file.

Part 3 of your assignment is to perform a cross-tabulation, analyze the results, and report the result in APA

format by submitting a single Microsoft Word file.

What is a Histogram?

Once data are collected, a researcher needs to be able to describe, summarize, and, potentially, detect

patterns in the data they have recorded with meaningful numerical scales (McClave & Sincich, 2006). To do

this, the researcher can utilize a histogram. A histogram allows a researcher to show the relationship between

two variables that are continuous in nature (Gall et al., 2003). Huck (2004) notes that histograms are used to

indicate how many times a score appears in a data set.

The Essentials of Statistical Analysis (EOSA) module Distribution and Probability (ID 17613) presented in Unit

I, provides an example of how a histogram is used to display data. A histogram displays data using values on

on the X-axis (horizontal) and a Y-axis (vertical). R Commander provides an Option tab to define labels for the

x and y axis and determine the width of the display bins.

If you are not comfortable utilizing R and R commander you may use whatever statistical software program

you choose. The answers you submit for your assignment must be correct regardless of the software you

choose.

R and R Commander make it easy and quick to construct a histogram. Using the data file “Duncan” provided

by the text, one can quickly construct a histogram of any of the variables.

Once the data file has been accessed, the process to create a histogram is very easy. Make sure when you

access R that you also load R Commander. Type in library(Rcmdr) or see unit I for a refresher on how to

gain access to R Commander.

When R and R Commander have been loaded, selecting Graphs in the menu will present various options.

Selecting Histogram allows us to utilize a data set to create a Histogram (Figure 1).

RCH 8303, Quantitative Data Analysis 3

UNIT x STUDY GUIDE

Title

Figure 1

Creating a Histogram From R Commander Menu System

As depicted in Figure 2, once Histogram is selected, a user has two types of options. First, a user must select

the variable to be displayed from the Data tab. In this case, the income variable was selected.

RCH 8303, Quantitative Data Analysis 4

UNIT x STUDY GUIDE

Title

Figure 2

Histogram Variable Selection

A user could click “OK,” and the histogram would be created. This histogram would illustrate all income data

regardless of groups/categories of data.

RCH 8303, Quantitative Data Analysis 5

UNIT x STUDY GUIDE

Title

However, if a user wanted to create a histogram of income by groups, such as type of income, the Groups tab

could be accessed, and the type variables selected (Figure 3).

Figure 3

Selection of Type From the Groups tab

RCH 8303, Quantitative Data Analysis 6

UNIT x STUDY GUIDE

Title

Selecting Options (Figure 4) allows the researcher to label the x- and y-axes and define the width of the bins

of the data display.

Figure 4

Histogram Options Selection tab

RCH 8303, Quantitative Data Analysis 7

UNIT x STUDY GUIDE

Title

Selecting a histogram of income by type would result in a three-histogram display (Figure 5).

Figure 5

Histograms of Income by Occupation Type

This approach may provide a researcher a better grasp of the data.

What is a Bar Graph?

Researchers need to create displays that are not misleading or over simplified, but informative and visually

appealing (Gall et al., 2013). A bar graph is normally used to display the distribution of a categorical variable.

Refer to Introduction to Statistics (ID 17609) for a refresher on categorical variables. Gall (2013) notes that a

bar graph shows the relationship between variables. In a bar graph, the horizontal axis represents categories

of a qualitative variable as opposed to a histogram where the horizontal axis represents a quantitative

variable (Huck, 2004).

RCH 8303, Quantitative Data Analysis 8

UNIT x STUDY GUIDE

Title

Using the data set Nations, navigate to the Graphs menu and select Bar graph (Figure 6).

Figure 6

Bar Graph Dialog Box

RCH 8303, Quantitative Data Analysis 9

UNIT x STUDY GUIDE

Title

Next, select region (Figure 7) and click “OK.”

Figure 7

Bar Graph Variable Selection Sub-Menu

RCH 8303, Quantitative Data Analysis 10

UNIT x STUDY GUIDE

Title

The resulting graph depicts the frequency of records by region (Figure 8).

Figure 8

Bar Graph Output Display

What is a Boxplot?

Visual representations of a data set are much more effective than numerical representations of data (Hartwig

& Dearling, 1979). A boxplot displays the variability within a data set. McClave and Sincich (2006) note that

boxplots are based on the quartiles of an existing data set and are very good for detecting outliers in data.

RCH 8303, Quantitative Data Analysis 11

UNIT x STUDY GUIDE

Title

McClave and Sincich remind the reader that boxplots are partitioned into four groups: The median, the

interquartile range, the minimum and maximum, and outliers (Figure 9).

Figure 9

Explanation of a Boxplot

The EOSA modules Central Tendency and Variability (ID 17611) and Normal Distribution and Z-Scores (ID

17615) presented in Unit I have examples of how a boxplot is used.

RCH 8303, Quantitative Data Analysis 12

UNIT x STUDY GUIDE

Title

Using the data set Prestige, navigate to the Graphs menu and select Boxplot (Figure 10).

Figure 10

Box Plot Dialog Box

RCH 8303, Quantitative Data Analysis 13

UNIT x STUDY GUIDE

Title

Next, choose the income variable and click “OK” (Figure 11).

Figure 11

Box Plot Variable Selection Sub-Menu

RCH 8303, Quantitative Data Analysis 14

UNIT x STUDY GUIDE

Title

Note the boxplot identifies five outliers for further investigation (Figure 12).

Figure 12

Box Plot Display With Visible Outliers

RCH 8303, Quantitative Data Analysis 15

UNIT x STUDY GUIDE

Title

Q-Q Plot

Instead of simply viewing a histogram to interpret or determine whether your data are normally distributed, a

Quantile-Quantile (Q-Q) plot can be used to determine normal distribution. A Q-Q plot allows visually

comparing actual distributions to a theoretical distribution (Mayor, 2015). Using the data set Prestige, navigate

to the Graphs menu and select Quantile-comparison plot (Figure 13).

Figure 13

Quantile-Comparison Plot (Q-Q Plot) Menu Selection

RCH 8303, Quantitative Data Analysis 16

UNIT x STUDY GUIDE

Title

Next, choose the income variable to display and select “OK” (Figure 14).

Figure 14

Q-Q Plot Variable Selection Option

RCH 8303, Quantitative Data Analysis 17

UNIT x STUDY GUIDE

Title

As depicted in Figure 15, the Q-Q Plot has identified data points that vary from not only a theoretical normal

distribution (straight line), but from a 95% confidence interval (CI; dotted line). Finally, two outliers are

identified by group.

Figure 15

Q-Q Plot Display With Visible Outliers

RCH 8303, Quantitative Data Analysis 18

UNIT x STUDY GUIDE

Title

What is Cross Tabulation?

A cross-tabulation, which is also known as a contingency table, uses a chi-square distribution to determine

the association between variables. View page 35 in your textbook. The researcher would normally use this

table to examine relationships between categorical variables. The Contingency Tables and Chi-Square Tests

(ID 17630) module of the CITI Program EOSA located in part 3 will demonstrate this form of data analysis.

Using the data set Nations, navigate to the Statistics menu, and select Contingency tables: Two-way table

(Figure 16).

Figure 16

Contingency Tables Dialog Box

RCH 8303, Quantitative Data Analysis 19

UNIT x STUDY GUIDE

Title

Choose the two variables, and select “OK” (Figure 17).

Figure 17

Cross-Tabulation Variable Selection Sub-Menu

RCH 8303, Quantitative Data Analysis 20

UNIT x STUDY GUIDE

Title

Before selecting the “OK” button, click on the Statistics tab (Figure 18), and you will be able to select

computer percentages and the type of hypothesis tests you are performing. Make sure the Chi-square test of

independence has a check mark in it. Now you are ready to select OK.

Figure 18

Two-Way Table Statistics Selection Menu

RCH 8303, Quantitative Data Analysis 21

UNIT x STUDY GUIDE

Title

The resulting output would be the statistical test. There are three values: the chi-square statistic (Χ2), the

degrees of freedom (df), and the p-value (Figure 19).

Figure 19

Cross-Tabulation Display With Pearson’s Chi-Squared Test Results

To report the results of a Pearson chi-square test following the American Psychological Association (APA)

Style Guide (7th ed.), a researcher would state:

A chi-square test of independence was performed to examine the relationship between Total Fertility

Rate (TFR) and region. The relation between these variables was significant, Χ2 (596) = 692.33, p =

.004.

References

Gall, M. D., Gall, J. P., & Borg, W. R. (2003). Educational research: An introduction (7th ed.). Allyn and

Bacon.

Hartwig, F., & Dearling, B. E. (1979). Quantitative applications in the social sciences: Exploratory data

analysis. SAGE Publications.

Huck, S. W. (2004). Reading statistics and research. Allyn and Bacon.

Mayor, E. (2015). Learning predictive analytics with R. Packt Publishing.

McClave, J. T., & Sincich, T. (2006). Statistics (10th ed.). Prentice Hall.

RCH 8303, Quantitative Data Analysis 22

UNIT x STUDY GUIDE

Title

Learning Activities (Nongraded)

Nongraded Learning Activities are provided to aid students in their course of study. You do not have to submit

them. If you have questions, contact your instructor for further guidance and information.

When studying APA formatting, pay particular attention to the sections that pertain to formatting for research

and statistics. Review these sections as needed.

- Course Learning Outcomes for Unit III
- Required Unit Resources
- Unit Lesson
- Introduction
- Unit III Plan
- What is a Histogram?
- What is a Bar Graph?
- What is a Boxplot?
- Q-Q Plot
- What is Cross Tabulation?
- References
- Learning Activities (Nongraded)

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

- Free title page and bibliography
- Unlimited revisions
- Plagiarism-free guarantee
- Money-back guarantee
- 24/7 support

On-demand options

- Writer’s samples
- Part-by-part delivery
- Overnight delivery
- Copies of used sources
- Expert Proofreading

Paper format

- 275 words per page
- 12 pt Arial/Times New Roman
- Double line spacing
- Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Delivering a high-quality product at a reasonable price is not enough anymore.

That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more