Here we talk through components involved in understanding what scatter plots are, how to read them, and some aspects overall for how to interpret them. This is not a list of questions to ask every time your students work with scatter plots, but rather a resource of skills regarding scatter plots that we need to help our students gain over time.
What are scatter plots?
A scatter plot plots a quantitative attribute along each of the axes to investigate the relationship in the data between the attributes. If desired, a third attribute may be included within the formatting of the data points.
How do we read scatter plots?
- What is the range of the quantitative scales/axes?
Tuva tip: Besides just reading the lowest and highest values on the y- and x-axes, students can adjust the size of the axes by moving the “T” bars at each end to see if that is the full extent of the data, and/or look at the data values in the corresponding attribute columns in the Table View (below graph).
- What does each dot on the graph represent?
Tuva tip: Have students open up the Table View (below graph), have them click on a dot within the graph and find the corresponding case that gets highlighted in the table. Reinforce the connection between the graph, data table, and the case card.
- How many data values are in the graph?
Tuva tip: Use the Count function or have students look through the number of cases in the Table View (below graph). Having an understanding of how many data points you have will influence your confidence in the patterns observed in the data, so having a sense of this before you interpret the data is helpful.
- Is there an additional attribute in the plot?
- Is there a categorical variable as well in the plot? If so, what are the categories?
Tuva tip: Categorical variables in scatter plots most often appear through formatting of the data points, aka different colors, shapes, sizes, etc. of the data points. For example, months is plotted by color below.
- Is there an additional quantitative variable (than those on the y- and x-axes)? If so, how is it represented?
Tuva tip: Additional quantitative variables in scatter plots most often appear through formatting of the data points, aka different colors, shapes, sizes, etc. of the data points, but also can be plotted as a secondary y-axis. For example, mean high air temperature is plotted as color below.
- Moving from left to right (x-axis) and up to down (y-axis), do the values tend to increase or decrease?
Tuva tip: While the majority of graphs have axes setup to increase going up the y-axis and to the right of the x-axis, this is not always the case. For example, when plotting water depth data depth values increase going down the y-axis (to better mimic going down into the water) and when plotting millions of years ago the time values increase going to the left on the x-axis (to better show going back in time). Therefore, it is important for students to actually look at how the axes are oriented.
- On which end of the scale will you find the highest data values? Where will you find the smallest data values?
Tuva tip: Encourage students to click on different dots or hover over them to read the data values of the dots.
- To what extent are data clumped together or evenly spread out? What patterns do you see in clumping?
Tuva tip: Encourage students to look for clusters in the data, use the shapes options in Annotate feature to mark the clusters. For example, it seems there are two clusters of data in the scatter plot below.
- Where are any isolated or extreme points (outliers in the data)?
Tuva tip: Do it by eye, encourage them to study the context of the data to see if the far out values make sense.
- What are the maximum values?
Tuva tip: Always use the context of the data for framing all data interpretation related questions. For example, rather than asking “What are the maximum values?” ask questions like “What is the longest duration of an eruption at Old Faithful?” or “What is the longest wait time recorded for eruptions at Old Faithful?”
- What are the minimum values?
Tuva tip: Similar to the maximum value questions, always use the context of the data for framing all data interpretation related questions.
- What are there ranges of case values that occur more frequently?
Tuva tip: Again, always use the context of the data for framing all data interpretation related questions. For example, the question “At what duration length are there many cases across a range of wait times?” helps students key into frequency of cases of one attribute for a range of values of the other attribute that is relevant for their interpretation of the data.
How do we interpret relationships using scatter plots?
- Describe the direction of the relationship between the attributes.
Tuva tip: Have the students use their knowledge of the direction of the axes, scale of axes, and maximum/minimum values for each attribute to talk about the general nature of the relationship. Prompts might be: “As values along the x-axis attribute increase, what happens to the values of the y-axis attribute? Do they also increase, do they decrease, or do they not change?” This helps your students make the connection between the data values and whether the relationship is positive, negative, or no relationship, and what that means in the context of the dataset.
- What is the mathematical relationship between the attributes?
Tuva tip: Add the Stats of the Least Squares Line to the plot as the mathematical calculation of the relationship. Beginning students might focus on the direction of the slope of the line (rising to the right, or falling). More advanced students can hover over the Least Squares Line to find the equation with the calculated intercept and slope of the line.
- How well does the line explain the relationship between the two attributes?
Tuva tip: For beginners, have them explain how far or close overall the data points are from the Least Squares Line, the farther the data points from the relationship line the less well the line explains the changes or variation in the values of each attribute. For more advanced students, have them look at the r2 value to assess how strong of a relationship there is between these attributes. For example, at Old Faithful r2 = 0.7874 meaning that 78.74% of the change in duration of an eruption can be predicted by the wait time between eruptions. That is a high amount of explanation!
- How far are the data points from the relationship line?
Tuva tip: Have students use the Annotate / Pencil feature to create an upper bound of the data (blue below) and another line along the lower bound of the data (yellow below). The key is not to have it perfectly straight or hit all of the farthest out data points but to in general capture the range of data points farthest away from the line above and below the line along both axes. Have students make observations about how close or far the upper and lower bound lines are from the Least Squares Line, if they are close then it indicates a strong relationship but if they are far away it indicates a weak relationship. For more advanced students you can have them look at the distribution of points between the Least Squares Line and the upper or lower bound lines. Are the data points uniformly distributed away, or are there only a few away and most are close, or are most far away and only a few are close? Have them think about what this means for their confidence in the strength of the relationship.
- How well can you predict (or interpolate) the value of a new case within the range of data we have?
Tuva tip: To help students see the opportunity to make a prediction/interpolation of what a new measured case could be within the existing range, encourage them to use the Stats / Add Reference Line on X/Y feature to make predictions (either from the Line of Least Squares or from the range of options based on the upper and lower bounds). For example, if you ask “what eruption duration would likely occur after 75 minutes of wait time?” your students could make that prediction from the Least Squares Line (grey) to be around 3.7 minutes, or they could use the upper (blue) and lower (yellow) bands to be between around 2.5 - 5.2 minutes.
- Can you make a prediction/extrapolation of the value of a future case beyond the scales you have?
Tuva tip: To help students see the opportunity to make a prediction/extrapolation of a future measured case beyond the scale of the measured data your currently have, encourage them to use the Stats / Add Reference Line on X/Y feature to make predictions. As a note, they may need to adjust the scales of the axes to move beyond the measured data values in your dataset. For example, if you ask “what would the eruption duration be after 40 minutes of wait time?” your students could extrapolate that it could be around 1 minute in duration.