4 Data Visualization

Data visualization is the practice of translating information into a visual context, such as a chart or map, to make data easier for the human brain to understand. In any context – whether in a business dashboard, a scientific report, or a student project – effective visualizations help reveal insights and tell a story. In the International Business (IB) curriculum and other academic programs, there is a strong emphasis on clear communication and integrity of information. Visualizing data is not just about making charts look appealing; it’s about conveying truth and meaning in a way that an audience can grasp quickly. The IB learner profile values of being principled and communicative are directly relevant here: one must present data honestly and in a globally understandable way. With that in mind (along with some new tools like Posit’s Positron IDE for data science, which we’ll touch on later), let’s explore the fundamentals of effective data visualization.

4.1 Key Principles of Data Visualization

Experts and educators often highlight a handful of core principles that good data visualizations should follow. We will discuss five key principles and why they matter:

1. Graphical Integrity

Graphical integrity means that a visual representation of data should tell the truth about the numbers. In other words, never let the design of a chart distort or mislead the viewer about what the data actually says. Visualizations must be accurate, honest, and transparent about the data. This principle aligns with what statistician Edward Tufte famously emphasized: “Visual representations of data must tell the truth.” In Tufte’s list of fundamentals, one is explicitly Graphical Integrity, i.e. ensuring that charts and graphs are truthful and clear. Here are some aspects of maintaining integrity in visuals:

Use appropriate axes and scales: Perhaps the most common breach of integrity is manipulating the axis scales. For example, starting a bar chart’s y-axis at a value higher than zero can exaggerate differences. Tufte notes that a non-zero baseline can amplify small differences and mislead the viewer. If two values differ by only 1%, but the y-axis is truncated to a narrow range, a bar for the higher value might look several times taller than the other – a deceptive impression. Always consider if your axes should include zero (for bar charts, usually yes) and use a consistent scale. Log scales or breaks can be used when appropriate, but these choices must be clearly communicated.
Avoid “chart junk” that distorts data: Unnecessary three-dimensional effects, overly fancy graphics, or pictograms that don’t scale linearly with the data can misrepresent values. For instance, using area or volume to represent 1D data (like showing one number with a larger 3D object) can trick the eye about magnitude differences.
Proportion and context: Ensure that visual proportions correspond to the numerical proportions. If one category is twice as large as another, its bar or slice or point should appear twice as large (not 10 times larger). Provide context like baselines or reference lines so the audience can interpret the magnitude correctly. Label axes clearly with units, so the viewer isn’t guessing scale.
Disclose uncertainty or variability: Integrity also means not hiding the fact that data has uncertainty. Use error bars, confidence intervals, or annotations to indicate data quality where relevant. In an academic or IB context, being honest about the limitations of data is part of ethical communication.

To illustrate the importance of graphical integrity, consider the following example of a misleading vs. an accurate chart. The first chart (left) shows Company X’s profits over 5 months with a truncated y-axis that starts at 20 million instead of 0, making the growth look dramatic:

Misleading visualization: The y-axis begins at 20 rather than 0, exaggerating a small increase in profits. In reality, profits only rose from 20.1M to 20.4M (a very modest change), but the bar heights differ greatly, giving a false impression of a big jump.

In contrast, the second chart (right) shows the same data with a y-axis starting at zero:

Accurate visualization: With the y-axis starting at 0, it’s clear that the profit increase is minimal – the bars for each month are almost the same height. This chart maintains graphical integrity by not overstating the change.

As these examples demonstrate, a dishonest tweak in scaling can lead viewers to incorrect conclusions. Integrity is paramount: A visualization should never sacrifice truth for aesthetic or dramatic effect. For IB students and professionals alike, this means your charts and graphs should uphold the same standards of honesty as your written or spoken statements. Misleading visuals can be considered as serious as misquoting a source. Always double-check: are we “telling the truth” with this graphic?

Finally, keep in mind global and cultural fairness in visualization. An aspect of integrity is not only what you show, but how you show it to a diverse audience. For example, using color (addressed later) or symbols that have different meanings in different cultures could inadvertently mislead or confuse. In a globally minded framework like IB, ensuring your visuals are interpreted correctly by people from various backgrounds is part of maintaining integrity.

2. Use the Right Display (Appropriate Chart Type)

Choosing the appropriate type of chart or graph for your data is crucial. The effectiveness of a visualization hinges on using a form that suits the structure of your data and the message you want to convey. As one data visualization guide puts it, the goal is to match “different chart types [to] different kinds of data and analytical purposes”. In practice, this means:

Match the chart to your question: Figure out what you want to show. Is it a comparison between categories? A trend over time? A distribution of values? A relationship between variables? Different goals suggest different chart types:
- Use a bar chart or column chart to compare quantities across categories (e.g. sales by region). Bars make it easy to compare lengths.
- Use a line chart to show trends over continuous data, typically time. Line charts excellently display how something changes sequentially (e.g. monthly unemployment rate over years).
- Use a scatter plot to show relationships or correlations between two quantitative variables (e.g. height vs weight).
- Use a pie chart (or its variation, a donut chart) to show parts of a whole (proportions), but only when you have a few categories and want to emphasize percentages. Avoid pie charts with too many slices – they become hard to read.
- Use a histogram or box plot to show distributions of a single variable (histogram for the frequency distribution shape, box plot for summary statistics like median and quartiles).
- Use a scatter plot or bubble chart for multivariate relationships (two variables on axes, perhaps a third as color or size).
- Use a map if your data has a geographic component and you want to show spatial patterns.
- This list can go on: heatmaps, tree maps, network graphs, etc., each suited for specific data structures.
Consider the nature of your data: The type of variables matters. If your x-axis is time or an ordered sequence, a line graph makes sense to connect points in order. If your data is categorical (unordered categories like types of fruit, or industries), bars might be better. If you have a part-to-whole relationship, a pie or stacked bar could work (though use with caution). If you have a hierarchy, maybe a tree map. Always ask: What is the data type (time series? categories? continuous? proportions? spatial?) and what chart types handle that well? A helpful summary from one source: identify if you’re comparing values, looking at distribution, checking for correlation, or composition, and “choose the simplest chart that accomplishes your goal”. For example, if comparing a few categories, a simple bar chart is often clearer than, say, a radar chart or some novelty visual. Overly complex or uncommon chart types can confuse more than clarify.
Leverage conventions: Some chart types are so standard that audiences immediately grasp them. For instance, line = trend, bar = compare sizes, etc. Using the expected chart for a given data/story can make comprehension faster. Conversely, using an unconventional chart (when a simpler one would do) may require more effort from the viewer to decode. Aim for clarity over cleverness. A famous guideline is “Do not reinvent the wheel unnecessarily – if a basic chart type communicates well, use it.”
Avoid chart-type pitfalls: Each chart type has its pitfalls. Pie charts, for example, are notorious when there are many slices or the differences are subtle; our eyes are not as good at comparing angles or areas as they are at comparing lengths (bars) or positions. If you find yourself adding data labels to each tiny pie slice because the visual differences aren’t clear, that’s a sign another chart might be better. Stacked area charts can obscure data if too many segments are stacked. Know the limitations: for instance, a single pie chart can show composition at one point in time, but to compare compositions across many points, multiple pies are hard to compare – a stacked bar or line might be better in that case.

In short, choose wisely. There isn’t one chart type that’s always best; it must fit the data. A good practice is to identify your goal (comparison, distribution, trend, relationship, etc.), consider the data’s nature (categorical, numeric, time-based, part-whole, etc.), and then select the simplest visual form that highlights what you want to show. Simplicity here also means familiarity – what form will your audience intuitively understand? By using the right display for the job, you ensure the audience “gets it” quickly and correctly.

(One handy reference: there are tables that map “what do I want to show?” to possible chart types. For example, the Ajelix blog provides a chart suggesting which charts are best at which tasks. Such guidelines reiterate: bar for comparisons, line for trends, scatter for relationships, etc. Adhering to these tried-and-true matches is usually a safe bet.)

3. Keep it Simple (Clarity Over Clutter)

The third principle is all about simplicity and clarity. As the saying goes, “simplicity is the ultimate sophistication.” In data visualization, this translates to designs that are clean and free of unnecessary clutter. A simple chart is easier to read and interpret correctly. In contrast, a busy, cluttered chart can overwhelm or confuse the viewer, obscuring the data’s message. One set of guidelines succinctly advises: “Focus on Clarity: Keep it simple and avoid clutter.”. Let’s break down what that means in practice:

Remove non-essential elements (Declutter): Chart junk refers to all the extraneous or decorative elements that don’t improve understanding of the data. Gridlines, heavy backgrounds, excessive text, glossy effects, superfluous icons – ask if each element serves a purpose. If not, consider removing it. For example, in a line graph, light gridlines can help estimate values, but too many lines or very dark gridlines become visual noise. In bar charts, unnecessary shading or 3D effects add nothing but distraction. Keep your ink ratio high (Tufte’s principle of maximizing “data-ink”, the ink used to display data, vs. minimizing non-data-ink). Every pixel on your chart should earn its keep. Simpler fonts, fewer borders, and minimalistic design help the data shine. As a tip: after drawing a chart, step back and see if removing something (a border, a tick mark, a label) would still convey the information – if yes, perhaps it wasn’t needed.
Highlight what’s important: Simplicity doesn’t mean you show less information than necessary, but it does mean emphasizing the key insights and not everything with equal weight. Use annotations or callouts sparingly to draw attention to critical data points (like an all-time high on a time series, or a significant benchmark). Keep other elements muted. For example, in a trend line chart with many categories, you might gray out most lines and use a bright color for the one line you want the audience to focus on. This technique, sometimes called “soften the background to let the focus pop,” is part of simplifying the viewer’s task – they’re not wading through a jungle, their path is lit to what matters.
Use clear labels and titles: A simple chart still needs context. A direct, informative title (perhaps stating the key message of the chart), clearly labeled axes with units, and a legend (if needed) placed intuitively all contribute to clarity. Don’t make viewers guess what they’re looking at. However, avoid verbose labels or redundant text that could clutter. For instance, if a bar chart already has labels on each bar, you might not need value labels on the axis as well (or vice versa). Strike a balance so that the chart can stand on its own without a paragraph of explanation, but also without drowning in text. In IB assessments or professional reports, you often have captions – utilize those for additional explanation so the graph itself can remain lean.
One idea per chart: Often, simplicity means narrowing the scope. Each visualization should ideally communicate one main idea or a tightly related set of insights. If you find a single graph trying to convey everything (e.g., comparing categories, showing trend, distribution, and highlighting outliers all at once), it might be better to split that into multiple visuals. As one expert advises, “Each visualization should convey a single key insight. If you have multiple messages, use multiple charts.”. This way, each chart can be simpler, and together they can still tell a comprehensive story.

To visualize the difference, imagine two versions of the same data plot. The first is cluttered: a busy background, dozens of tick marks, overlapping labels, maybe a cute but irrelevant image in the corner, and too many colors. The second version uses a clean minimal theme, only essential gridlines, concise labeling, and maybe two colors (one for highlight, one neutral for the rest). The latter will feel easier to read. In fact, experiments show that removing clutter improves comprehension and retention. The Ajelix article on data viz principles provides a compelling example contrasting a cluttered vs. clean chart – the clean chart highlights the important information and enables quick decision-making, whereas the cluttered one leaves viewers lost in details.

In R’s ggplot2, using themes like theme_minimal() or theme_classic() can instantly remove heavy backgrounds and gridlines, giving a simpler look (more on themes soon). You can then add back only what you need. For example, you might start with theme_minimal() and then use theme() to turn off even the few gridlines it has if they’re not needed. Simplicity is often achieved through iterative refinement: make the plot, then edit it down.

IB Context: In an IB Mathematics or Science project, presenting data simply can also demonstrate your understanding. It shows you can focus on what matters and communicate effectively to an audience that might not be technical. Also, remember the IB core value of clarity in communication – a de-cluttered visualization exemplifies clear communication, just as a well-structured essay does.

Key takeaway: Clarity trumps complexity. By keeping designs simple and clean, you allow the data to speak. As one source succinctly advises: “Focus on clarity – keep it simple and avoid clutter”. Your audience (be it examiners, colleagues, or the public) will thank you, because you’ve made the information accessible and easy to digest.

4. Use Color Strategically

Color is a powerful tool in data visualization, but with great power comes great responsibility. Using color strategically means applying color deliberately to enhance comprehension, not to decorate the chart arbitrarily. A well-chosen color palette can highlight key patterns or group related data; a poor choice can distract or even mislead. As a rule of thumb, “Color Counts: Use color strategically to highlight key data points.” and ensure consistency and meaning in your color choices.

Consider these best practices for color in charts:

Use color to differentiate or categorize: Color is very effective for distinguishing categories or data series, especially when shape or position alone isn’t sufficient. For example, in a line chart with multiple lines, using different colors for each line (with a clear legend) helps the viewer track which is which. In a bar chart, color can group bars into logical sets (e.g. shading all bars for 2020 in one color and 2021 in another in a grouped bar chart). Without color, one might have to rely on patterns or labels that can be harder to follow. That said, don’t overload on colors – a limited palette (perhaps 4-5 max distinct colors in one view) is usually enough. If you need more than about 6 distinct categories, consider whether the chart is too busy or if there’s a way to break it into multiple charts.
Highlight with color: Our eyes are naturally drawn to color, especially bright or saturated color. Use this to your advantage by coloring the most important data point or series differently from the rest. For example, you might make all bars grey but one bar (representing, say, “our company” vs. competitors) in a bold blue. Or in a time series, highlight the current year’s line in color and previous years in light gray. This technique directs attention where you want it. However, use sparingly – if you highlight too many things, nothing stands out. A common approach is to have a neutral/base color for most data and a contrasting accent color for the focus. This is an intentional use of color to tell a story (connecting to principle 5): e.g., red line for a critical threshold or event, green to highlight a target being met, etc.
Maintain consistency: If the same category or variable appears in multiple charts, use the same color for it throughout your report/dashboard. Inconsistency can confuse the viewer (imagine one chart where “Asia” is represented by blue and another where “Asia” is orange – a reader might momentarily mistake it for a different category). Consistent coloring becomes a visual language or legend that persists across visuals. Many data viz tools (like ggplot2 in R) do this automatically if you keep the factor levels consistent, but if you manually tweak colors, be mindful to apply the scheme everywhere. The Ajelix tips mention “maintain consistent fonts, colors, and styles throughout” – unity in design helps a viewer focus on the data, not re-learn the color encoding for each chart.
Contrast and readability: Ensure there is enough contrast between colors (and between the data and background). If you have a plot with a light background, use strong colors for lines/points; if you have a dark background, use light-colored lines/points. Contrast also matters for overlapping elements: in a scatter plot, if points are semi-transparent, choose fill colors that stand out but still allow overlaps to be seen. Avoid light yellow on white, for instance, which is nearly invisible, or red on green for small text which many people can’t read (and is problematic for colorblind viewers, as we’ll note below). When in doubt, test your chart on a projector or printed in grayscale – do things still distinguish? Tools like ColorBrewer (and the scale_color_brewer() in R) are designed to give palettes with good contrast and visual appeal.
Be mindful of color meaning and culture: Colors carry associations. Red often implies bad or down in financial contexts (or urgent, or negative), while green implies good or up (like profit). Blue is often seen as neutral or positive. But these interpretations can vary by context and culture. For example, red in some cultures is a sign of prosperity or luck (e.g., in China), and green can have negative connotations in some contexts. Also, think of domain-specific uses: in healthcare, blue vs red might be used for cold vs hot, or certain diseases; in politics (at least in the U.S.), red and blue represent opposing parties. If you inadvertently use a color strongly associated with something else, your audience might subconsciously misinterpret your intent. The key is to use color intentionally and be aware of your audience. If your visualization is part of an IB global politics project comparing countries, check if the colors you use align or clash with any national or cultural colors that could introduce bias. One guideline is: “Different colors can evoke different emotional responses and cultural associations. Use these to your advantage but be aware of differences in perception.”. For instance, using bright red to highlight one bar might signal “danger/alert” to many viewers – which is fine if that bar is something like a failing score or an alarming trend, but not fine if it’s just “Team Red’s result” unless you intend that connotation.
Color for quantitative data (gradients): If you’re showing a heatmap or choropleth map (color representing a value), choose an appropriate color scale. Sequential data (e.g., low-to-high values) usually use a single-hue or multi-hue gradient (light to dark of one color, or from one color to another). Divergent data (deviations from a midpoint, say above/below zero or above/below average) should use a diverging palette (two hues with a neutral middle, like blue–white–red). For example, temperature anomalies might use blue for colder than average, red for hotter than average, with white at zero difference. This encodes meaning in the color. Avoid arbitrary gradients that have no intuitive order (like rainbow for ordered data – it can be misleading). Ensure the gradient is perceptually uniform (ColorBrewer or scale_fill_viridis() in R provide good options). And always include a legend for color so the viewer knows what the colors signify in terms of numbers.
Accessibility (colorblind-friendly): A significant percentage of the population (around 8% of males, for example) have some form of color vision deficiency (often red-green colorblindness). Thus, a color scheme that relies on red vs green alone could be problematic – those viewers might not distinguish them well. To be strategic with color, you need to ensure your visualization is still effective for those viewers. Use colorblind-friendly palettes (there are well-known ones, like the “Okabe-Ito” palette, or using blues and oranges which are generally distinguishable). Alternatively, use differences in lightness or pattern in addition to hue. Tools like Color Oracle can simulate how your chart looks to someone who is colorblind. Additionally, never use color as the sole means of conveying information if possible. For example, instead of just a red dot and a green dot on a plot, you could also use different shapes or a clear legend that labels them, so even if someone can’t see the color difference, they can still tell by another cue.
Don’t go crazy with the rainbow: Using too many distinct colors is a common novice mistake. A chart with 10 differently colored lines is hard to follow – the viewer has to keep looking back and forth to the legend, and some colors might be hard to tell apart. If you truly have that many categories, consider another approach (like small multiples or labeling lines directly). If you must, try to group them using shade variations of a few main hues rather than 10 unrelated colors. Remember that color printing might turn them all to different shades of gray – will the chart still make sense? Often, less is more.

By using color thoughtfully, you enhance the visualization’s clarity and impact. Color can guide the eye to the story in the data, which is what you want. For example, a Medium article on data viz principles notes how color can be used to differentiate data sets, emphasize key points, and add emotional impact, but warns that improper use can confuse and detract. They also provide sensible guidelines: use contrasting colors for important distinctions, limit the palette, be mindful of cultural color associations, and ensure accessibility.

In summary, color should be used with purpose. Before adding a color to a chart, ask: Why am I using this color? What does it convey? If you have a good answer (to separate these two groups, to highlight this outlier, to encode a third dimension of data), go ahead. If the answer is “it looks pretty” or “I just felt like it,” reconsider. A purposeful approach to color ensures that your visuals are both attractive and meaningful. Remember the adage from the guidelines: “Color counts.” Use it to make your critical data stand out while maintaining a cohesive, easy-to-read design. When done right, color is often the element that takes a visualization from good to great – making it not only clear, but also engaging and memorable.

5. Tell a Story with Your Data

At the end of the day, data visualization is a form of storytelling. It’s not just about plotting points or bars; it’s about conveying a message or insight to your audience. The best visualizations lead the viewer to an “aha” moment – they illuminate something interesting or important in the data. Therefore, our fifth principle is to tell a story with your data. Rather than showing data for data’s sake, think about what narrative or insight you want the visualization to communicate. As one author on data visualization notes, this transforms numbers into a narrative that “engages and informs,” turning a chart into a compelling story.

Here’s how to infuse storytelling into your visuals:

Have a clear message or focus: Before you create the visualization, ask yourself: What is the main point I want someone to take away from this? Is it that a certain trend is increasing? That one category is outperforming others? That two variables are correlated? Once you identify the key message, design the visual to highlight that. This might mean choosing a specific chart type that best shows that message (tying back to principle #2), and using annotations or color (principle #4) to draw attention to it. For example, if the story is “Company A’s market share has been steadily rising over 10 years,” a line chart with an annotation at the end (“Market share doubled from 10% to 20%”) tells that story clearly. Every element in the chart should support that message or at least not distract from it.
Structure and sequence (if multiple visuals): In many cases, a single chart may not tell the whole story; you might have a series of visuals. Think of this like chapters in a story or scenes in a film. There should be a logical flow. Perhaps you start with a broad overview (e.g., “Overall sales are growing”) and then zoom into specific components (e.g., “Online sales are driving this growth, whereas retail is flat”). Each subsequent visualization builds on or complements the previous, so together they form a coherent narrative. In a presentation setting, you might literally have multiple slides that step through a data story. In a static report, you can achieve a similar effect by the order and arrangement of figures, and through explanatory captions.
Use annotations and context: A good data story doesn’t leave interpretation entirely to the viewer. Guide them with subtle narration on the chart. This could be a note on a spike (“Promotion campaign here boosted sales”) or shading a region of a timeline (“Recession period”) to give context. These annotations act like the narrator’s voice in a story, helping the audience understand causation or significance. Storytelling with data often involves combining the visual (show what happened) with a hint of why it happened or why it matters. In academic contexts, linking data to real events (e.g., marking policy changes on a graph of emissions) is powerful. In business, annotating when a new product launched on a revenue chart can make the story clear (e.g., “after launch of X, growth accelerated”). Instead of making the audience guess or recall these contextual details, provide them on the chart so the story is self-contained.
Engage the audience’s emotions or curiosity: While maintaining accuracy (never sacrifice integrity for drama), you can design visuals that are intriguing or impactful. Perhaps start with a question that the data answers, either in the title or leading text: “Which energy source grew the fastest in the last decade?” Then the chart (maybe a slopegraph or line chart with multiple energy sources) visually answers it, and you highlight the winner (e.g., solar in bright color surging upward). This question-and-answer format creates a mini-story and invites the viewer to think and conclude – engaging them actively. Another technique is to highlight human implications of data: instead of just “30%,” an annotation might say “30% (that’s 1 in 3 people)”. It connects data to real-life understanding, making the story relatable.
Make it memorable: Stories stick in the mind better than raw facts. If you craft a visualization story well, viewers will remember it. Use metaphors or relatable framing if appropriate (but carefully). For instance, comparing the scale of something to known objects (“the area lost to deforestation is equivalent to X football fields per day”) can drive a point home. Visual metaphors (icons or pictograms) can sometimes be used, but ensure they don’t distort the data – they should supplement, not replace, standard quantitative encoding.
Example – Story of a rising trend: Suppose you have data of a city’s bicycle usage over years. A pure presentation of data would just show the numbers going up. To tell a story, you might frame it as “The Bike Boom: How Cycling Became Mainstream in City X.” Your line chart of number of cyclists from 2010 to 2025 could have annotations: a note in 2015 “Bike-share program introduced,” a highlight in 2020 “city builds 200 km of bike lanes,” and a projection or target for 2025 if available. The narrative might be: initial slow growth, a turning point (policy intervention), then rapid growth. The title, the annotations, and perhaps a concluding caption tie it all together: “By 2025, cycling is not just recreation – it’s a main mode of transport for city residents.” Such a visualization does more than report numbers; it conveys a transformation over time – a story of change with causes and effects.
Interactive storytelling: If you have the chance to use interactive visuals (say in a Shiny app or Tableau dashboard), storytelling can also be interactive where users explore data at their own pace guided by prompts. For example, an interactive might start zoomed out (global view) but allow the user to click on their country to see local trends – effectively letting them find their own “story” under the umbrella of the bigger narrative. Even then, providing hints (“Click on a country to see its trend”) ensures they engage with the intended story.

The notion of storytelling might sound a bit abstract, but it’s increasingly recognized as what elevates a visualization. A Medium article on this principle states: “Through visualization, hidden trends and patterns emerge, telling a story of change over time or relationships that might not be evident from raw data.”. In other words, it’s the visuals that can reveal the narrative the data contains. Our job as the designer is to bring that narrative to light in a compelling and truthful way.

Finally, a story has a point – so ensure your data story has a clear takeaway. Don’t assume the audience will “get it” automatically; often, it helps to explicitly state the insight either in the chart title or caption. For instance: instead of a generic title like “GDP of Country A 1990–2020,” a story title would be “Country A’s GDP has doubled every decade since 1990.” That primes the audience on what to see in the chart. It’s akin to giving the moral of the story up front.

In summary, think of each visualization as a communication, not just an analysis artifact. You want to inform or persuade the viewer of something. By applying principles 1–4 (integrity, appropriate display, simplicity, color) in service of principle 5 (storytelling), you ensure that the data’s story is told ethically, clearly, and effectively. Whether you’re creating a visualization for an IB Internal Assessment, a scientific paper, or a business meeting, framing it as a story will make it more impactful. People remember stories; if your data visualization tells one, they’re more likely to remember the data (and insight) as well.

4.2 Goals of Visualization

Now that we’ve covered principles for how to visualize, let’s step back and recall why we visualize data. What purposes do charts and graphs actually serve? Understanding the goals will reinforce why those principles matter. Broadly, data visualizations serve three main purposes (or goals): to record/monitor, to analyze/explore, and to communicate/explain. Different sources and textbooks sometimes use varying terms, but a common breakdown is Explore, Monitor, and Explain.

Record (or Monitor): One fundamental use of visualization is to document information or keep an eye on data as it updates. This could be as simple as making a chart to log experimental observations in a lab notebook, or as complex as a real-time dashboard tracking stock prices or website traffic. Historically, explorers drew maps – that’s data visualization to record geographical data. In business, a dashboard that shows key metrics (sales today, number of active users, etc.) is used to monitor the status in real-time. In this mode, the visualization itself might be straightforward (maybe even a table or simple time series updated regularly), and the goal is not deep analysis but maintaining an awareness of the data’s state. Good monitoring visuals surface the important status quickly (for example, a big KPI figure with green or red indicator to show if it’s on target). When you hear about “situational awareness” or operational dashboards, that’s this goal. It answers: “What’s happening? Is everything normal or do I need to take action?”
Analyze and Explore: Data visualization is a vital tool for analysis – this is when you don’t yet know what the data will reveal and you’re using visuals to discover patterns. Analysts and data scientists often plot data in many ways to see distributions, relationships, outliers, and trends as part of exploratory data analysis (EDA). Visualization can reveal a pattern (e.g., a cluster of points suggesting two groups in your data), or show that an assumption was wrong (e.g., a supposed trend is actually very noisy). Interactive visualizations can be especially useful here: you might filter, zoom, or highlight subsets of data dynamically to test hypotheses. The goal in exploratory visualization is insight for the analyst themselves. It’s a thinking tool. For example, an IB student doing a science project might plot various combinations of variables to see which correlates with which before writing up results. Or a statistician might use a scatter matrix plot to see pairwise relationships in a large dataset. Often, the visualizations made for analysis are not the same ones you would use in a final report – they might be messier, or just quick plots – but they are crucial for figuring out what the data tells you. In short, we visualize to explore, to find the story in the first place. Even if the end goal is a nice explanatory chart, the process likely involved many exploratory charts.
Communicate and Explain: This is the most outward-facing goal – using visualization to tell others what the data shows. When you have an insight or message (thanks to analysis), you then design a visualization to convey that to your audience (be it your teacher, your boss, or the public). Here, you are curating and perhaps simplifying what the data shows in order to highlight the points of interest. The principles we discussed (clarity, using the right chart, etc.) are mostly about this stage – making sure the final product effectively communicates. These visualizations often appear in presentations, reports, articles, or even social media infographics. A well-crafted explanatory visualization can often replace paragraphs of text – as the saying goes, “a picture is worth a thousand words.” For example, instead of writing “our market share grew from 10% to 15% over the year while two main competitors declined slightly,” you might show a bar chart or line chart that in an instant conveys that comparison. The goal is that the audience grasps the insight quickly and accurately. Additionally, an explanatory visualization can be persuasive. If you’re advocating for a change in policy, a chart showing the trend (like rising temperatures over decades) can be more convincing and memorable than just stating numbers. Many of the famous “data stories” in media (such as Gapminder animations by Hans Rosling showing development over time) serve to explain complex phenomena in an understandable way.

These three goals aren’t mutually exclusive – a single visualization can sometimes do more than one. For instance, a well-designed dashboard might simultaneously let a user monitor current status and explore by drilling down interactively, and it might include explanatory annotations to communicate insights. But it’s useful to distinguish them to clarify your primary intent.

If your goal is record-keeping or monitoring, you’ll focus on clarity and perhaps real-time accuracy, and you might sacrifice some depth for simplicity (maybe big bold numbers or simple sparkline charts that update frequently).
If your goal is analysis, you might not worry about aesthetics at first – you’ll make quick plots, maybe many of them, aiming to uncover truth. You’ll keep an eye on integrity (of course) but you might not label everything beautifully in initial exploration. Tools for analysis might allow more interactivity and flexibility (like using R or Python in a Jupyter notebook to quickly plot different charts, or a tool like Tableau in an analyst mode).
If your goal is communication, you’ll apply all the principles we discussed meticulously to produce a polished, audience-ready visual, because now you know what needs to be communicated and how best to do it.

According to one source, the three main goals are essentially to explore, monitor, and explain – aligning with the above. It’s helpful to ask yourself when making a chart: Which goal am I serving right now? If you’re exploring, you might be more experimental. If you’re explaining, you’ll refine and perhaps annotate. This also ties to tools: some software or packages are better for quick exploration (e.g., using Pandas in Python to plot quick histograms, or using Excel for a quick look), while others are better for final communication (e.g., designing a detailed infographic in Adobe Illustrator, or a carefully coded ggplot in R for a report).

In an IB Diploma Programme context, for example, if you collect data for an Internal Assessment, you might first plot it in various ways to explore (Goal 2, analysis). Once you find the interesting pattern that answers your research question, you might create a refined chart with proper labels and perhaps a trendline to explain that finding in your write-up (Goal 3, communication). And perhaps you keep a log of your raw data in charts in an appendix as record (Goal 1, documentation of data).

By clarifying the goal, you also set the criteria for success of the visualization:

A monitoring viz is successful if it alerts or confirms status at a glance (e.g., the teacher can quickly see which student needs help from a dashboard of quiz scores).
An exploratory viz is successful if it helps you find something new or understand the data better (even if that something new is “there is no clear pattern here”).
An explanatory viz is successful if the target audience understands the intended insight quickly and accurately, and ideally remembers it.

Most of what we focus on in reports and presentations are the explanatory visuals, but never forget the exploratory stage that comes before – visualization is a tool for thinking, not just presenting. And monitoring is a huge practical use of viz in everyday life (from your fitness tracker’s daily graphs to COVID-19 dashboards tracking cases).

To sum up: Whether you’re monitoring data streams in real-time, exploring a dataset for insights, or explaining a finding to others, data visualization is an indispensable approach. Many visualizations you encounter can be classified as serving one (or more) of these purposes. When you design your own, knowing your primary goal will guide your design choices. For example, a chart meant for exploration by you might have interactive filters, whereas the one meant for explanation in your IB Geography presentation might be static but richly annotated. Keeping the goal in mind ensures that your visualization is fit for its intended use.

(As an aside, some sources combine or add goals such as “decoration” or “enlightenment,” but those typically fall under communication – decoration isn’t really a valid goal on its own in our context, and enlightenment is essentially what good explanation or exploration should achieve. The Actian blog (2025) succinctly puts it: the goals are to help people explore data, monitor data, and explain insights. If your visualization isn’t doing at least one of these, it might be worth asking why you are making it at all!)

4.3 Loading Data

With principles and goals covered, let’s move into practice. To create visualizations (especially in a coding environment like R, which we use in this course), we need data in the right format and the right tools loaded. In previous work, we prepared a dataset called dataCanadaFullLong. Let’s quickly recall what this dataset is: it likely contains some data about Canada – perhaps employment numbers by industry over years (given the name and context of upcoming plots). The “Long” in the name suggests it’s in long format (tidy format), which is ideal for plotting with ggplot2.

First, we need to load the data into our R environment. Assuming the data file dataCanadaFullLong.csv is available in our working directory (for example, in the data/ folder), we can use readr::read_csv to read it:

dataCanadaFullLong <- readr::read_csv("./data/dataCanadaFullLong.csv")

This command will import the CSV file into an R data frame called dataCanadaFullLong. We should check that it loaded correctly (e.g., using head(dataCanadaFullLong) to see the first few rows, though that’s not shown here).

Next, we need to ensure the data types of each column make sense for plotting. In particular, the dataset likely has an isicCode column (perhaps representing industry codes) and a year column, along with a value column (perhaps number of employees). Since isicCode is a categorical code (even if it’s numeric codes, they represent categories of industry), we should treat it as a factor or character in R, rather than a numeric value. If it were left numeric, ggplot might treat it as a continuous variable, which we don’t want in this case (we don’t want, say, a numeric scale for industry codes). The snippet given is:

dataCanadaFullLong$isicCode <- as.character(dataCanadaFullLong$isicCode)

This converts the isicCode column to character type. We could also use as.factor – in many ggplot scenarios, character and factor behave similarly (ggplot will treat character as discrete categories). But converting to character is fine, as it indicates we won’t be doing numeric operations on those codes.

Now dataCanadaFullLong is ready. It’s in a tidy long format, meaning each row is a single observation (with columns like year, isicCode, and value). This is the preferred format for ggplot2, where you map columns to aesthetics (x, y, color, etc.). If it were in a wide format (multiple columns for different industries, for example), we’d have to reshape it. But since it’s already long, we can proceed to plotting.

(Quick aside on why “long” format is good: ggplot2 expects something like: one column for x values, one for y values, one for categories if needed, etc. If your data were wide – e.g., separate column for each industry’s employment – you’d either need to call geom for each or reshape. We’ll actually practice reshaping data later in the chapter exercises.)

Let’s apply the principles in practice with a few types of charts, using this dataset.

4.4 Bar Chart

A bar chart is a common way to visualize and compare quantities across categories. In our dataset, we have employment numbers (value) by industry (isicCode) and by year. We might want to compare how large each industry is in each year. A good approach is a grouped bar chart: for each year (group), show bars for each industry.

In ggplot2, you create bar charts with either geom_bar (which by default counts observations, useful for frequencies) or geom_col/geom_bar(stat="identity") when you already have values to plot (here we do have a pre-summarized value: number of employees). We will use geom_bar(stat="identity") which tells ggplot to use the actual values in the data rather than counting rows.

Here’s the code to produce a grouped bar chart of number of employees by year and industry:

library(ggplot2)
library(ggthemes)  # for additional themes and color scales

ggplot(data = dataCanadaFullLong, aes(x = year, y = value, fill = isicCode)) + 
  geom_bar(stat = "identity", width = 0.5, position = "dodge") + 
  xlab("") + 
  ylab("Number of employees") +
  labs(fill = "Industry (ISIC code)") +  
  theme_minimal() + 
  scale_fill_brewer(palette = "Dark2", direction = -1)

Let’s break down what each part of that does:

ggplot(data = dataCanadaFullLong, aes(x = year, y = value, fill = isicCode)): This initializes the ggplot object with our dataset and sets up default aesthetics. We map x-axis to year, y-axis to value (number of employees), and fill color to isicCode. That means whenever we add geoms, they’ll use these mappings by default (unless overridden). Year is likely an integer or factor. Since we want discrete bars by year, year could be treated as a discrete axis; if year is numeric, ggplot might treat it continuous which would place bars in a continuous scale. Typically, for bar charts, we ensure the x is a factor or categorical. If year is numeric but we want distinct bars per year, ggplot might still handle it by treating it as discrete when used in a bar (it often does if stat=“identity”). But to be safe, converting year to factor might be wise, or specify group = year. In this code, it likely works as intended.
geom_bar(stat="identity", width=0.5, position="dodge"): Here we add the bar geometries. stat="identity" means use the given y values as heights (instead of the default which would count rows per x). width=0.5 sets the width of the bars (the default is 0.9, which sometimes is too thick for grouped bars). A smaller width gives some gap between bars. position="dodge" is key for grouped bars: it places bars for different fill categories (industries) side by side, instead of stacking them. “Dodge” separates them on the x-axis grouping. If we omitted position, by default geom_bar with fill would stack the bars (creating a stacked bar chart, which is another way to view composition but not what we want here).
xlab("") removes the x-axis label. We do this because “year” might be self-evident or already indicated by axis tick labels. It’s a style choice; often, if the x-axis is a time or category obvious from context, we omit a redundant label to reduce clutter.
ylab("Number of employees") sets a descriptive y-axis label. Always label units or what the number represents; here it’s presumably count of employees.
labs(fill = "Industry (ISIC code)") changes the legend title for the fill colors. By default, ggplot would use the name of the variable (isicCode) which might not be user-friendly (maybe it’s coded as just numbers or short codes). We provide a nicer label. If, for instance, ISIC codes correspond to sectors like “Agriculture”, “Manufacturing”, etc., we might even use those names instead of codes in the legend. But that would require mapping code to name (maybe via a factor with labels or a lookup table). Here we just label the legend generically.
theme_minimal() applies a clean theme with a white background and minimal grid lines. This immediately makes the chart look more modern (compared to the default gray background theme of ggplot). Minimal theme aligns with our principle of avoiding clutter.
scale_fill_brewer(palette = "Dark2", direction = -1): This uses a ColorBrewer palette for the fill colors. “Dark2” is a palette of 8 distinct colors that are relatively colorblind-friendly and visually distinct (dark hues). We set direction = -1 perhaps to reverse the order of the colors (maybe to match a desired mapping of code to specific color). Brewer palettes are great for qualitative (categorical) data because they’re chosen for good contrast. This ensures each industry gets a different color that’s distinguishable. If our industry codes were more than 8 and we tried to use Dark2, ggplot would cycle or reuse colors – not ideal beyond 8 categories. In that case, another palette or manually defining might be needed, but let’s assume we have a manageable number of industries (or we’re only plotting a subset).

When this code runs, we’ll get a plot with the years on the x-axis and a group of bars for each year. For example, if the data covers 2015, 2016, 2017, 2018, each of those will be a category on the x-axis. At each year, multiple bars (each colored differently) will appear side by side (because of position=dodge), one for each industry code in that year, with height equal to the number of employees. The legend will show which color corresponds to which industry code (with our provided title).

A few things to check or tweak:

If year is numeric, ggplot might place them in continuous scale (which would put space between say 2015 and 2016 ticks based on numeric distance). But since they are consecutive, it might look fine. If not, one can do aes(x = factor(year), ...) to ensure discrete. Or convert year to factor beforehand.
Dodge position automatically dodges by the width of bars. With width 0.5, there will be slight gap between groups as well. If we wanted a different spacing, we could adjust, but default dodge should be okay.
The ggthemes package was loaded to allow scale_fill_brewer to use named palettes easily (though base ggplot has scale_fill_brewer too if RColorBrewer is installed, which comes with ggplot2 typically). Perhaps ggthemes was intended for other theme options if needed. We could also try theme_economist() or others from ggthemes for style, but theme_minimal() is usually fine.

Now, how does this relate to our principles?

We chose a bar chart (the right display) for comparing categories at a point in time (year), which is apt.
We keep it simple: minimal theme, clear labels.
Colors distinguish categories but are chosen from a palette known to be distinct and generally accessible (Dark2 is a decent choice, though one might check colorblind friendliness).
Integrity: we start bars at zero (ggplot does that by default for bar charts’ y-axis), and we’re not manipulating scales unfairly. We’re directly plotting actual values (stat=identity).
If the story we want is within this, we might annotate the chart or highlight an industry of interest in a different color if needed. For now, it’s just showing comparisons.

(One could consider ordering the bars or facets for better storytelling. In a grouped bar, the x-axis is time which is natural order. The legend listing industries might be sorted alphabetically or by code; if we wanted to emphasize one, we might reorder factor levels to control legend order or use guides.)

This bar chart allows easy comparison within a year (comparing bar heights side by side) and across years (following a given color across adjacent year groups, though that is a bit harder in grouped bars – a line chart may be better for seeing a trend over years, which we’ll do next). Grouped bars are good for showing two categorical dimensions (year and industry) together. If the chart feels too busy (say 10 industries across 5 years = 50 bars), an alternative could be faceted charts or multiple charts. But grouped bars are a compact way to show a lot if done carefully.

(Technical note: If there are many categories or long category names, one might use a dodge with narrower bars or horizontal orientation. But with year on x and industry as fill, horizontal doesn’t apply here. If industry names were long, a direct bar chart with industry on x would require flipping axes or angled text. But we have numeric codes, which are short labels.)

4.5 Line Chart

Next, let’s visualize the data as a line chart. Line charts excel at showing trends over continuous intervals (typically time). In our case, we can plot year on the x-axis and number of employees on the y-axis, drawing a line for each industry. This will show the trajectory of each industry’s employment over time, which is hard to see in the bar chart at a glance (the bar chart grouped by year requires mentally connecting bars across groups to see a trend; a line chart makes that explicit).

Here’s code to create a line chart from the same data:

ggplot(data = dataCanadaFullLong, aes(x = year, y = value, color = isicCode)) +
  geom_line(size = 1.5) +
  xlab("") +
  ylab("Number of employees") +
  labs(color = "Industry") +  
  theme_minimal() +
  scale_color_brewer(palette = "Dark2", direction = -1)

This is quite similar to the bar chart code, with a few changes:

We map color = isicCode instead of fill. For line charts, the aesthetic to distinguish groups is usually color (affecting the line color). There is also linetype or shape for points, but color is the most straightforward way to separate multiple lines. We used color so each industry’s line gets a distinct color.
We use geom_line() to draw lines. We set size = 1.5 to make the lines a bit thicker for visibility (default might be 0.5 or 1 depending on theme).
The legend will now be for color (by default it would use the variable name, we override to “Industry”).
We again use theme_minimal() for a clean look.
We use scale_color_brewer(...) to apply the same palette to line colors as we did for fill in the bar chart, ensuring consistency (Dark2 again, reversed order). scale_color_brewer is analogous to scale_fill_brewer but for line/point colors.

The resulting plot will have year on the x-axis (likely treated as continuous, so the points for each year are connected sequentially by industry), and number of employees on y. Each industry becomes a line of a different color. The x-axis by default will likely show every year or every few years depending on how many years – if it’s a lot, ggplot might auto-select breaks. Since we removed the x-axis label again, the years will be the ticks. If year is numeric (say 2010, 2011, …), ggplot might place them evenly on a continuous scale – which is fine if they are evenly spaced. If some years are missing, the line will naturally break if data is missing for a year (or if NA), or it will interpolate if the data just doesn’t have a certain year value (but normally, missing years should either be filled with NA or it will just connect across the gap). If we had irregular time intervals, we’d have to consider that, but assuming yearly data.

To make the line chart clearer, it’s often helpful to add markers at the data points. Humans can read exact values or at least see where the points are if we place small symbols at each year’s data on the line. In ggplot, that’s easy: just add geom_point() on top:

ggplot(data = dataCanadaFullLong, aes(x = year, y = value, color = isicCode)) +
  geom_line(size = 1.5) +
  geom_point(size = 2.5) +
  xlab("") +
  ylab("Number of employees") +
  labs(color = "Industry") +  
  theme_minimal() +
  scale_color_brewer(palette = "Dark2", direction = -1)

Now each line will have points (of size 2.5 here, you can adjust) at each year’s value. This serves a few purposes:

It makes it easier to discern overlapping lines (two lines crossing might be confusing, but the points help track which goes where).
It emphasizes the data observations themselves (the line is actually a connection between them).
If the data had missing years, lines might skip but points would make it clear where actual data exists.
It just looks nice and is commonly expected in multi-line charts to identify specific data points (especially if you eventually label some points or interactively hover, etc.).

Be mindful that too many lines can clutter a chart (the infamous “spaghetti plot”). If our dataset has many industries, the line chart may become hard to read with all lines overlapping. In such a case, one might use small multiples (facets) or highlight only a few significant ones. But assuming a moderate number (maybe 5-10 industries), color and some direct labeling could make it digestible. For presentation, sometimes people label lines at the end with the name instead of using a legend, to avoid having to match colors to legend. That can be effective if lines are well-separated on the right side – but that gets into advanced labeling techniques. For now, the legend works.

The line chart focuses on trends: you can easily see which industries are rising, which are falling, which are flat over time. This temporal pattern might be the story of interest. For example, maybe industry “A” steadily grew, while “B” declined after 2018, etc. If we wanted to tell a story, we might annotate certain lines (“Industry X saw a boom after 2015” with an arrow on the line).

From a principles perspective:

Graphical integrity: The line chart properly uses position to encode values – assuming year spacing is correct, no distortion. (One check: if the years are numeric, the distance from 2010 to 2011 on x will be same as 2011 to 2012, which is correct. If we had irregular intervals, we’d want the x scale to reflect actual time differences, which ggplot’s continuous scale does. If year was treated as factor, it would space equally even if a gap, which might be slightly misleading if the gap is longer. But let’s assume annual data.)
Chosen the right display: line for trend, appropriate.
Kept it simple: minimal theme, no clutter. We might even consider removing legend and directly labeling lines if feasible, but with many categories legend is simpler. The background is clean. We could reduce gridlines if desired.
Color usage: used effectively to separate categories. The palette is the same as bar chart for consistency (so if a reader sees bar and line for Industry 101, it’s the same color, aiding comprehension).
Storytelling: If explaining, we’d highlight in text or annotation the key trend(s). As is, it’s an analytical chart one can read.

(In an IB assignment, one might present both a bar chart and line chart of similar data depending on what they discuss – maybe the bar chart emphasizes the magnitude in a particular year, whereas the line shows the trend. But one should avoid redundancy. Often, choose the chart that best supports the point you’re making.)

4.6 Bubble Chart

A bubble chart is essentially an extension of a scatter plot where a third numeric dimension is represented by the size of the points (bubbles). It’s not as commonly used for time series data (since we usually use line charts for that), but bubble charts can be useful to show three variables at once – e.g., comparing countries by GDP (x-axis), life expectancy (y-axis), and population (bubble size). In our context, we have year, industry, and value. We could conceive a bubble chart where x = year, y = value, and the size of the point also represents value. That’s a bit redundant (plotting value on y and also on size). Bubble charts typically are more useful when the third dimension is a different variable. However, for the sake of example (and perhaps the exercise intended this), we can show bubbles growing or shrinking to double-encode the value.

Why would you double-encode? It can add emphasis – a larger bubble draws the eye, so years/industries with higher values stand out not just by position but also by size. But it can also be a bit visually confusing if not done carefully, because you’re essentially repeating the information. Alternatively, maybe the dataset had another measure (like perhaps value is employment, and bubble size could be, say, revenue or something), but since not mentioned, we’ll assume we’re just trying out the aesthetic.

The code for a bubble chart might look like:

ggplot(data = dataCanadaFullLong, aes(x = year, y = value, color = isicCode)) +
  geom_point(aes(size = value), alpha = 0.7) +
  xlab("") +
  ylab("Number of employees") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2", direction = -1) +
  scale_size_continuous(range = c(3, 10)) + 
  guides(size = FALSE)

What’s happening here:

We still map x = year, y = value, and color = isicCode. So points are positioned by year and value, and colored by industry.
geom_point(aes(size = value), alpha = 0.7) means we draw points with the size aesthetic mapped to value. So higher value => larger point (bubble). We set alpha = 0.7 to make them slightly transparent. This can help if bubbles overlap, so you can see through one a bit to others beneath. It also gives a nice visual softness.
We use scale_size_continuous(range = c(3, 10)) to control the size mapping. By default, ggplot will choose a range of point sizes (in mm units) for the size scale, but it might be too small or large. We specify that the smallest value should correspond to size 3 and largest to size 10 (just an example range) – which is a decent range for a bubble. You might adjust those numbers to get aesthetically pleasing bubble sizes. (If the range is too wide, small values practically disappear or large ones become huge circles overlapping everything.)
guides(size = FALSE) hides the legend for size. Why hide it? Because the size is exactly the same data as on the y-axis in this redundant case. If we put a legend for size, it would say something like a gradient of sizes from lowest to highest value, but we already have the axis for value. It would be confusing to have both. So we turn off the size legend. The color legend (for industries) we probably keep (we left it default here; we might add labs(color="Industry") if we want to ensure a nice legend title).
We keep theme_minimal() and color scaling same as before, for consistency. The alpha transparency is purely aesthetic/functional for overlap.

The resulting chart will show, for each industry and year, a bubble. All bubbles along a given year vertically align (since x is year), and all bubbles for an industry would (if you connected their centers) form the line we had earlier. But now, the higher the value, the higher the point (y-axis) and also the bigger the bubble. So trends can be seen by vertical position and the bubble size pattern. It’s a bit of a novelty way to show a time series – not as straightforward as lines. But if the data were sparse or categorical both axes, bubble charts can work (like a matrix of categories with bubble sizes).

It might be more useful to demonstrate a bubble chart in another context, but since the exercise likely wanted to illustrate usage of aes(size= ), this does that.

Potential uses in our case: Perhaps to emphasize magnitude differences in a cluttered scenario. For instance, if two lines in the line chart cross, the bubble chart might make the larger one obviously larger in bubble size, providing another visual cue. However, one must be cautious: humans are not as precise in judging area as position. Position on the y-axis (how high the bubble is) is still the more accurate read of the value. The area/size is more for general impression (big vs small). So you wouldn’t want to rely on bubble size to read exact values – that’s what the axis is for.

One thing to note: scale_size_continuous is used to specify the size mapping. We might also ensure that the scale is proportional to something meaningful (by default, scale_size uses area to map numeric, I believe, meaning a value twice as large gets a bubble area twice as large, which is generally good). Using range=c(3,10) just sets min and max size. If the data has outliers, bubble sizes might be skewed – sometimes one might use scale_size_area(max_size=10) which ensures area proportional and sets a max. But these are details.

Since we removed the size legend, how does someone know what size means? In this redundant case, hopefully by context or a note in caption, or just because it correlates with height. If we were encoding a different variable in size, we would definitely include a legend or annotation to explain it.

Bubble charts can become messy if too many bubbles overlap. In our time series case, if points are dense per year or values close, bubbles might overlap a lot. Transparency helps, but sometimes it’s not ideal. In exploratory data analysis, though, making a bubble chart can quickly show where clusters of high values are.

Another scenario: If instead of year on x, we had, say, industry on x and province on y, and bubble size = employees, that could show a matrix of industries vs provinces with bubble sizes indicating employment – a kind of heatmap-like bubble chart. Those are sometimes called “balloon charts”. But since our data is long format time series, the line chart is generally more useful for trend.

We included this to illustrate use of additional aesthetics. It’s a reminder that you can map any aesthetic in ggplot2 to data: color, size, shape, transparency, etc., as long as it makes sense. Principle-wise, only do so if it helps clarity. A common beginner mistake is to use too many aesthetics at once (rainbow of colors, different shapes, and sizes all in one chart – very confusing!). Here we just added size in a redundant but possibly visually interesting way.

Side note: If the dataset had another measure, e.g., suppose value was employment and there was another field like revenue, one could do x=year, y=employment, and size=revenue, color=industry – a four-dimensional chart (x, y, color, size). That might be insightful if, say, some industries employ few people but generate high revenue (bubbles small in one dimension but large in another). In absence of that, we double-encode.

To not overwhelm, one could facet by industry and use bubble size for some measure per year, but that’s a different approach.

At this point, we’ve covered basic static chart types (bar, line, scatter/bubble) using ggplot2. We’ve seen how to apply themes and scales to adhere to our principles of simplicity and effective color.

4.7 Maps

Moving from abstract data to geographic data: creating maps is another important aspect of data visualization. Maps allow us to visualize spatial data – anything that has a location component (countries, cities, regions). In R, making maps can be done with specialized packages (sf for modern shapefiles, or using maps for quick outlines, or leaflet for interactive). Here we’ll show a simple example using the base maps available via map_data function (which comes with ggplot2 via the maps package data).

First, we need map data. The code uses:

world <- map_data("world")

This fetches the coordinates for drawing world country polygons. world will be a data frame with columns like long, lat, group, region, etc. Each row is a point (a vertex of country outlines), and groups define which set of points make a polygon (one per country, often split further if countries have multiple parts like islands).

Now plotting the world map outline is straightforward with geom_polygon:

ggplot(data = world, aes(x = long, y = lat, group = group)) + 
  geom_polygon(fill = "white", color = "black") +
  theme_void()

What does this do?

We set up ggplot with world data and aesthetics x = long, y = lat, and group = group. Grouping is crucial here: it ensures that when drawing the polygon, ggplot knows which points belong together. If we didn’t group, it might try to connect all world points in one giant polygon (a big mess). The group variable provided by map_data identifies each country’s outline as a separate group.
geom_polygon(fill="white", color="black") draws the polygons. We fill them white (so land is white) and outline in black. That gives a classic outline map (like a simple political map with no fill color). We could fill with another color or even map fill to some variable (if we had data per country, we could join it and map fill to that value to make a choropleth). But here we’re just showing the base map.
theme_void() is an excellent choice for maps: it removes axes, grid, background, etc. We don’t need a coordinate frame for the world usually – latitude/longitude lines can be added if needed, but often a clean map is void of such unless relevant. theme_void() basically leaves just the plot panel with no annotations, perfect for maps or other diagram-like plots.
The result will be a flat projection of the world (by default, map_data("world") likely uses a simple long-lat (equirectangular) projection). Countries are outlined in black on a white background. It might not be the fanciest map, but it’s a start.

Basic world map: An outline map of the world generated with ggplot2 (using geom_polygon). Countries are drawn as white shapes with black borders, and the projection is the default long-lat. Notice the clean appearance due to theme_void() (no axes or gridlines).

If you run the above, you’ll see all countries. Perhaps small islands may appear as tiny specks or not at all depending on resolution. But major landmasses will be there.

Now, often we want to zoom into a specific region or subset of the world. The code suggests an example focusing on the Americas. They created a subset:

americas <- subset(world, region %in% c("USA","Brazil","Mexico","Colombia",
                                       "Argentina","Canada","Peru","Venezuela",
                                       "Chile","Guatemala","Ecuador","Bolivia",
                                       "Cuba","Honduras","Paraguay","Nicaragua",
                                       "El Salvador","Costa Rica","Panama",
                                       "Uruguay","Jamaica","Trinidad and Tobago",
                                       "Guyana","Suriname","Belize","Barbados",
                                       "Saint Lucia","Grenada",
                                       "Saint Vincent and the Grenadines",
                                       "Antigua and Barbuda","Saint Kitts and Nevis"))

This is a long list of country names representing North, Central, and South America, and the Caribbean. The world data frame has a column region which contains country names (and possibly some geopolitical entities like French Guiana as part of France maybe, etc.). By subsetting to these names, they effectively filter the dataset to only those countries.

Then they plot americas similarly:

ggplot(data = americas, aes(x = long, y = lat, group = group)) + 
  geom_polygon(fill = "white", color = "black") +
  coord_fixed(ratio = 1.1, xlim = c(-180, -35)) +
  theme_void()

Differences from before:

We use americas data, which already is grouped by country as needed.
After drawing polygons, they added coord_fixed(ratio = 1.1, xlim = c(-180, -35)). coord_fixed fixes the aspect ratio of the plot. Usually, coord_map() could be used for map projections, but coord_fixed with the right ratio can approximate it. They gave ratio = 1.1; this likely is to account for latitude scaling vs longitude in the region to look visually correct (maybe because a simple lat/long on a flat plot without projection looks a bit squashed; 1.1 might be empirically chosen to make countries look right – often at the equator, a 1:1 aspect works if you use proper lat scaling. But anyway, the fixed ratio ensures one unit on x is same length as one unit on y (with that multiplier). Without it, if your plotting window isn’t square, the map could stretch.
They set xlim = c(-180, -35) to zoom the longitude range from -180 to -35. That covers from the mid-Pacific (180°W) to roughly the mid-Atlantic (~35°W). Essentially, it includes Alaska out to maybe just east of Brazil. It cuts off Africa and Europe on the right (which start around 0 to 20W to East), and it includes a chunk of Pacific on left. Why -35 and not say -30 or 0? Possibly to just show the Americas and not include a slice of Africa or maybe to include all of South America which goes to about 34°W at its eastern tip (Brazil).
They didn’t specify ylim, which means vertically it will include whatever latitudes present in the data (likely from northern Canada/Greenland down to the southern tip of South America around -55°S or so). Maybe they kept all to show entire Americas.
theme_void() again for a clean look.

The resulting map will show just the Western Hemisphere. The aspect ratio being fixed is important so that latitude spacing is not distorted; otherwise, when you specify xlim, R might auto adjust aspect making countries look squashed. coord_fixed ensures 1 degree lat is plotted about same length as 1 degree long (times 1.1 factor; slight tweak). The value 1.1 might be chosen to make sure e.g. the map looks nice and not too elongated.

We could refine it further by excluding Antarctica if it sneaks in (not in that region list so okay) or by including all Caribbean tiny states (which they did include a number of them). Some very small ones might not appear clearly at this map scale (points might though, but polygon outlines might be too small to see).

One could fill countries with different colors (maybe if you have data, or even just to differentiate neighboring countries). Here they used all white fill. If the goal were just an outline map, that’s fine. If you wanted to color by region or value, you’d map fill to a variable. For example, if you had unemployment rate by country in the americas, you could merge that data into americas and do aes(fill = unemployment) and use scale_fill_gradient or such.

We should mention the coord_fixed vs coord_quickmap: ggplot has a coord_quickmap() which sets asp ratio based on lat without doing a true projection transform – that’s often used for quick maps too. But coord_fixed works when you basically assume a plat map.

Now, creating maps in R can get far more advanced:

Using geom_sf with sf objects for true projections and to handle spatial objects is a modern approach.
Adding layers like points or paths (for data on top of maps) is straightforward once you have the base map.
Interactive maps can be done with leaflet (an htmlwidget).

In our context, we showed static maps which serve either as context (like showing location of something) or as an analysis (show pattern across geography). Even with a static outline, one might annotate it (e.g., mark specific cities or areas on it).

Important principles for maps:

Integrity: Use appropriate projections and don’t distort the visual in misleading ways. For example, map projections can vastly change how big areas appear (Mercator makes high-latitude countries look huge compared to equatorial). If you are comparing areas, choose an equal-area projection. Here, we just did simple long-lat (which is not equal area, but for a simple outline maybe fine – Greenland will look big, etc.). For thematic maps, consider projection choice as part of integrity.
Right display: A map is the right display when location is a key part of the data or message. If location isn’t relevant, don’t use a map just for decoration – that would violate integrity and simplicity.
Keep it simple: We applied theme_void which is a great way to reduce clutter. We could also drop tiny islands if they clutter, or use lighter boundaries for less important regions, etc., depending on focus.
Color strategically: In our outlines, color is just binary (land vs ocean). We could highlight a particular country in a different fill if telling a story (“here is Country X we’re focusing on”). If showing data, use a sensible color gradient or categories. Also mind color for land/sea – typically land in a light neutral and water blue looks good, but if printing monochrome it might not matter. We did white land on white background, which effectively means background merges with land except for the borders. That’s fine for an outline-only look.

Mapping with ggplot2 is a huge topic itself; here we covered the basics enough to integrate into our chapter.

(One more note: coord_fixed(ratio=1.1) – often one finds ratio as 1/cos(lat0) to account for latitude scaling at a given center latitude. 1.1 might correspond to around cos(40°) or so, implying maybe centering around mid-latitudes. Possibly just trial-and-error to make shapes look undistorted. It’s not exact but okay for visualization.)

Now we’ve gone through several examples of creating visualizations in R using ggplot2, applying our principles along the way.

Before moving to more advanced topics, it’s beneficial to practice these basics. The chapter likely includes some “Get Your Hands Dirty” exercises to reinforce these skills, which we will walk through next.

4.8 Getting Your Hands Dirty (Practice Exercises)

To truly learn data visualization, one must practice making charts and adjusting them. The following exercises use a dataset provided in the course repository (Chapter 10 on GitHub) to recreate specific figures. This hands-on approach will help cement the process of building a plot from data.

(The exercises span multiple steps, referencing chapters 6–9 for data wrangling steps and chapter 10 for plotting. We will focus on the visualization part but also ensure the data is prepared correctly.)

Step 1: Import the data. The data for this chapter’s examples is in chapter10data.csv. Let’s assume this file contains some GDP information for 5 entities (could be countries or something, given the name gdp5 appears, maybe 5 countries) over several years, since later they filter year 2017 and talk about line chart and bar chart for figure 10-1 and 10-2.

Use readr::read_csv to load it:

library(readr)
gdp5 <- read_csv("chapter10data.csv")

Now, gdp5 should be a data frame. We don’t have the printout here, but likely it has columns like Country (or similar) and Year and GDP values, or maybe each column is a country (since later they mention needing to lengthen data, which suggests it might initially be wide).

Step 2: Create a line chart (Figure 10-1). The task is to reproduce a given Figure 10-1 which presumably is a line chart of GDP over time for some countries. We need to use gdp5 data.

First, if gdp5 is not already in long format, we have to see what format it is. The exercise hint that later they do a pivot_longer suggests gdp5 might currently be wide: e.g., columns might be something like: Country, 2013, 2014, 2015, 2016, 2017 (just guessing a typical layout). Or it could be Country, Year, GDP already (long). Let’s carefully parse: In Step 3 they filter year == 2017 into gdp6, implying that in gdp5 there likely is a column named year. So perhaps gdp5 is already long or partly long. Alternatively, maybe gdp5 is wide and they convert in Step 5 to gdp3. Actually, reading ahead: Step 5 says “Lengthen the data (wide to long) using gdp2 to gdp3 with pivot_longer.” That suggests earlier steps (maybe Steps 1-4 in chapters 6-9) have already created a gdp2 which might be a trimmed or slightly processed version of raw data. It’s a bit confusing. But since in Step 3 they easily filter by year on gdp5, it indicates gdp5 might be already long (with a Year column).

However, they might have done:

Step 4 perhaps created gdp2 which is a wide version (somehow), and Step 5 pivots it to gdp3.
But then why filter year on gdp5? Possibly gdp5 was already partially long or a different cut.

It might be the case that:

gdp5 is already long (maybe containing data for 5 countries over multiple years).
Then Step 3 filter year 2017 to get gdp6 for the bar chart (Figure 10-2).
Step 5 is a separate thread from earlier chapters (maybe gdp2 came from an earlier exercise where they had a wide dataset and now they demonstrate how to pivot it; gdp3 becomes a tidy version).
This might be an attempt to tie multiple exercises together. Possibly gdp5 is the combined cleaned data for plotting, while gdp2 was raw or something. Without the earlier context, let’s assume for making the charts, gdp5 is already tidy.

So for the line chart (Fig 10-1): We want a line (or lines) over years. Likely multiple lines, one per country, since usually GDP comparisons are among countries.

So we do:

ggplot(gdp5, aes(x = year, y = GDP, color = country)) +
  geom_line(size = 1) +
  xlab("Year") + ylab("GDP (in ... units)") +
  theme_minimal() +
  labs(color = "Country")

This is a generic structure. To match Figure 10-1 exactly, we’d need to know specifics like:

Which countries are included (maybe 5 countries, hence gdp5).
The axes labeling and units.
The color scheme (we could use default or something specific).
Perhaps they want points too or not? The instructions say “Using the imported gdp5 data, recreate the line chart shown in Figure 10-1.” They likely expect a multi-line chart with legends, etc.

We already went through an example of a multi-line chart above (for employment by industry). Same principles apply: Map year to x, GDP to y, country to color. Use geom_line (and maybe geom_point if the figure shows markers). Ensure labels and theme match the figure.

This exercise tests understanding of ggplot basics:

Mapping aesthetics
Adding geom_line
Possibly customizing scales (if particular colors or breaks are needed)
Labeling axes appropriately.

If, for example, Figure 10-1 had a title or certain colors, the student could add those. But from the given text, they only explicitly mention axes and legend.

We might not have the actual data values here, but at least conceptually: If one of the countries had much higher GDP, it will dominate the y-axis scale, etc. We trust the exercise.

So the answer might look like:

ggplot(gdp5, aes(x = year, y = gdp, color = country)) +
  geom_line(size = 1.2) +
  geom_point() +  # if the figure has points
  labs(x = "Year", y = "GDP (US$ billions)", color = "Country") +
  theme_minimal()

(Adjust theme or colors to match if needed, e.g., maybe scale_color_brewer(palette="Set1") for distinct colors if default not nice.)

Step 3: Subset data for 2017. Now they instruct to filter for year 2017 and store in gdp6:

library(dplyr)
gdp6 <- gdp5 %>% filter(year == 2017)

This makes gdp6 a dataset of GDP values only for the year 2017 (for each country). This is essentially one row per country (if one entry per year originally). They want to use this for Figure 10-2, which is likely a bar chart comparing those countries in 2017.

Step 4: Create a bar chart (Figure 10-2). Now using gdp6 (only 2017 data), we can make a bar chart of GDP by country.

We can use geom_col() (which is a shorthand for geom_bar stat identity):

ggplot(gdp6, aes(x = country, y = gdp, fill = country)) +
  geom_col() +
  xlab("Country") + ylab("GDP in 2017 (US$ billions)") +
  theme_minimal() +
  guides(fill = FALSE)

We map fill = country just to give each bar a distinct color (maybe the same colors as the lines had, if we want consistency – if we set the same palette or not dropping factor levels, ggplot might even carry that over). We then remove the legend (guides(fill = FALSE)) because with bars, if each bar is a country and the x-axis already labels them, a legend is redundant. Often, we color bars just to look nice but the axis label tells what each bar is, so legend can be omitted.

Alternatively, we could choose a single fill color for all bars (if the figure was monochrome or just one color with no legend). The figure might not have differently colored bars since we already label categories on x-axis. If that’s the case, we’d do fill="skyblue" or something static and not use fill mapping. But since in line chart we used color for country, maybe to keep consistency or visual grouping, they want same colors in bar chart for those countries. That’s plausible if you want to reinforce which country is which color in a report.

However, if you’re printing in grayscale or need simplicity, single color bars with category labels is fine.

We’ll assume they want color variety (since they explicitly mention highcharter and stuff later, maybe they like colors).

So the bar chart shows, say, GDP of Country A, B, C, D, E in 2017 side by side. Possibly sorted by some order? The default would be alphabetical by country unless the factor levels were in a different order (or if in gdp5, country was a factor ordered by something). Maybe they want descending order? They didn’t mention reordering, but often for bar charts it’s nice to sort descending. They might not expect students to know factor reordering at this stage unless taught.

The given references had something about reordering factors (in the Data Carpentry doc or cheat sheets). But anyway.

So key: using gdp6, geom_bar with stat identity.

Thus, two plots made: Figure 10-1 (line chart, likely multiple lines for multiple years for each country) and Figure 10-2 (bar chart of 2017 values).

This practice covers using filter() to get subset data, and plotting different chart types accordingly.

At this point, a student would have practiced:

Line plot for time series comparison.
Bar chart for single-year comparison across categories.
Basic ggplot2 syntax and theming.

The next steps in the book likely go back to data manipulation: Step 5 is pivoting data, which seems to correspond to something earlier (maybe gdp2 wide to gdp3 long). Since this is mentioned under “GYHD (Continued Practice)” likely because earlier in chapters 6-9 they did some steps (Step 1-4 might have been cleaning data and saving as gdp2 or gdp5, unclear).

Anyway, they instruct: Step 5: Lengthen the data (wide to long). They mention gdp2 to gdp3. We don’t have context for gdp2, but presumably gdp2 is a wide version of data (maybe gdp of countries over years with each year as column, or each country as column). The goal is to use pivot_longer from tidyr to make it long.

They provide:

gdp3 <- gdp2 %>% pivot_longer(
            cols = -country,
            names_to = "year",
            values_to = "gdp"
         )

This indicates:

gdp2 presumably had a column “country” and other columns which are something like years.
cols = -country means take all columns except country and pivot them longer (so all those columns presumably are years).
names_to = "year" means the names of those columns (e.g., “2015”, “2016”, etc.) will become values in a new column called “year”.
values_to = "gdp" means the cell values that were in those year columns will go into a new column “gdp”.

After this, gdp3 will have three columns: country, year, gdp. Exactly the tidy format we used for plotting.

This is a critical data transformation step: converting data from a wide format (one row per country, columns for each year’s GDP) to a long format (one row per country-year combination, with year as a value and GDP as value). Why? Because ggplot2 and other tidy tools generally prefer data in long format. For example, to make the line chart earlier, we needed one column for year, one for GDP, and one to distinguish country – this is exactly what gdp3 provides. If we had gdp2 (wide), we could not directly map multiple year columns to the y-axis. We’d have to pivot or use multiple geom_line calls manually for each column (tedious). So pivot_longer simplifies that greatly.

They note “(If gdp2 had columns Country, 2015, 2016, 2017, the new gdp3 will have multiple rows for each country…)” – which is basically elaborating the result of pivot_longer. Indeed, if gdp2 looked like:

country	2015	2016	2017
USA	18000	18500	19000
China	11000	12000	13000
…	…	…	…

Then gdp3 will look like:

country	year	gdp
USA	2015	18000
USA	2016	18500
USA	2017	19000
China	2015	11000
China	2016	12000
China	2017	13000
…	…	…

Which is exactly what we need for a plot. After this transformation, “now gdp3 is ready for visualization tasks” as the text might say.

They likely want the student to realize the importance of long format.

Conclusion of practice: The student has:

Imported data (Step 1).
Created a multi-line time series plot (Step 2).
Filtered data for a specific year (Step 3).
Created a comparative bar chart (Step 4).
Transformed data from wide to long (Step 5). (This step 5 presumably would have enabled step 2 if needed, but maybe it was not needed because gdp5 was already long? Possibly gdp5 was actually the result of such pivot with only a subset of countries or something. The naming is a bit unclear.)

Anyway, the exercise demonstrates a typical workflow: Collect/prepare data -> visualize over time -> focus on a single year -> ensure data is tidy.

Such practices help solidify the interplay between data manipulation and visualization.

Having done these R visualizations, the chapter might then transition to discussing other tools or enhancements. In fact, the user text goes on to mention Tableau in R (esquisse and shinytableau) and advanced visualizations (r2d3, htmlwidgets). So let’s cover those.

4.9 Tableau in R

Tableau is a popular visualization tool known for its drag-and-drop interface and interactivity. While R is code-driven and can create very customized visuals, sometimes users, especially beginners or those coming from a GUI background, appreciate a more interactive GUI to create plots. The good news is, R has solutions to accommodate such needs, effectively bringing a Tableau-like experience into RStudio or combining R with Tableau’s capabilities.

Two tools highlighted are esquisse and shinytableau.

esquisse: Drag-and-Drop ggplot2 Interface

esquisse is an R package that provides a graphical user interface (GUI) for building ggplot2 plots. It’s often described as a Tableau-like add-in for R. When you launch esquisse (typically via the RStudio Addins menu), it opens a gadget where you can select a data frame and then assign variables to the x-axis, y-axis, color, size, etc., choose a geom (bar, line, boxplot, etc.), and tweak some options – all with mouse clicks and drag-and-drop. As you do this, it’s constructing the ggplot code in the background, which you can then extract.

Key features of esquisse:

Drag-and-Drop Plot Construction: You literally drag a column name onto an “X” axis drop zone, another onto “Y”, maybe one onto “Color” or “Fill”. You immediately see a preview of the plot.
Choose Geoms Easily: There are menu options to switch between common chart types. It might even suggest a default based on data types (e.g., if X is categorical and Y is numeric, it might start with a bar chart).
Facetting and Filtering: It provides ways to facet by a variable or apply filters to the dataset within the interface.
Interactive Aesthetics Tuning: You can change titles, colors, themes through GUI controls.
Code Generation: Perhaps the best part for learning – as you build the plot with the GUI, esquisse generates the corresponding ggplot2 code. You can copy this code out to refine further or save it. This helps beginners see how their actions translate to code (a great learning mechanism).

The advantage of esquisse is that it lowers the barrier for R novices or those who are not comfortable remembering ggplot syntax. It’s also a quick way to explore data visually without writing code, which can speed up insight discovery (like a quick EDA tool). Essentially, it tries to marry the ease of Tableau with the power of ggplot2.

From a workflow perspective, you might use esquisse to quickly prototype a chart, then copy the code and integrate it into your R script or R Markdown for final output, tweaking as necessary (maybe to add custom annotations or do things the GUI might not support).

The user text references that this add-in is Tableau-like and meant for a drag-and-drop interface for ggplot. Indeed, an article by Appsilon (Radečić 2023) introduced esquisse to a wider audience. They emphasized that the interface is familiar to Tableau users, and you can generate ggplot2 charts with just clicks. It’s basically bringing the “no-code” or “low-code” approach to R plotting.

One can imagine an IB student or any analyst not experienced in coding might initially use esquisse to build required charts for an assignment and gradually pick up the code. It’s also great for quick hypothesis testing: “What if I put X vs Y, colored by Z?” – do it in a few seconds with esquisse rather than writing code from scratch (which, if you know ggplot well isn’t too slow, but for complex data it might be).

(In the course of our discussion we recall that the user had a demonstration GIF or mention of esquisse – perhaps showing a drag and drop in action. Since we can’t show that here, the description suffices.)

shinytableau: Embedding R Shiny in Tableau

While esquisse brings Tableau-like functionality into R, shinytableau goes the other direction: it brings R’s capabilities into Tableau.

shinytableau is an R package (developed by RStudio, now Posit, folks like Joe Cheng) that allows R programmers to create Tableau Dashboard Extensions using Shiny. Tableau (since version 2020.3) supports extensions, which are essentially web apps you can embed into a dashboard for extra functionality not native to Tableau. Shinytableau acts as a bridge: you write a Shiny app in R, and then you deploy it as an extension (.trex file) that Tableau can use.

Why is this cool or useful?

Tableau has a limited set of native chart types and calculations. If you want something advanced that Tableau can’t do out-of-the-box (for instance, a complex statistical visualization, or an interactive control that isn’t provided, or integrating with R’s machine learning packages to do something dynamic), you historically would be out of luck or have to hack around it.
With shinytableau, you can leverage the full power of R (and its packages) inside a Tableau dashboard. For example, you could embed a ggplot2 chart that Tableau doesn’t support (like a violin plot, as their motivating example suggests). Or run an R model on data and show the results interactively.
This allows better collaboration between data science teams (who use R/Python) and BI teams (who use Tableau). The data scientist can package their R work into a piece that fits into the existing Tableau dashboards of the company, rather than saying “please use this separate R Shiny dashboard”. So it integrates workflows.

From the documentation:

“The shinytableau package allows you to easily extend Tableau dashboards using the power of R and Shiny. In typical Shiny fashion, it’s not necessary to know web technologies like HTML, JavaScript, and CSS to create compelling Tableau extensions.”

So, R developers don’t need to become JavaScript developers to make an extension; shinytableau handles the behind-the-scenes of hooking Shiny into Tableau’s extension API.

How it works (simplified):

You create a Shiny UI that defines what the extension will show (this might be a plot output, some controls, etc.).
You define how it can get data from Tableau (the extension can receive data from the Tableau dashboard, e.g., data underlying a sheet, or filters).
You wrap it with some metadata (shinytableau helps generate a .trex file which is basically a manifest telling Tableau how to render this extension and where).
In Tableau, a user adds an “Extension” object to a dashboard, points it to your .trex (which if deployed, points to a running Shiny app or uses RStudio Connect, etc.), and then it appears as a panel or element in the dashboard.

So for example, the violin plot scenario: Tableau can’t draw violin plots natively. But R can (with ggplot2 or others). The shinytableau team provided an example where they embed a violin plot created by ggplot2 into Tableau. The Tableau user can configure which data to send to the extension (like choose a dimension and measure for the violin plot). Then the Shiny app generates that violin using R and displays it. Now you have a violin plot alongside your normal Tableau charts, expanding what insights you can show. Another example could be running k-means clustering in R on data and showing an interactive cluster plot in Tableau – something not straightforward in pure Tableau.

One could see an IB student or teacher who uses Tableau for some things but wants to integrate R advanced analysis benefiting from this – though it’s more likely used in professional environments.

The key point for our chapter: shinytableau demonstrates how R is not an island; it can work with other tools. It underscores R’s flexibility:

If you prefer code, use R directly (ggplot).
If you prefer GUI, use R’s esquisse or use Tableau.
If you want the best of both, mix them with solutions like shinytableau.

This also preludes the next section about Positron maybe – as Posit is making tools to integrate languages and environments.

(We note that Posit (formerly RStudio) is indeed focusing on bridging languages and platforms – shinytableau is one such product bridging R with a popular BI tool.)

In summary, esquisse is for making plots in R without coding much, and shinytableau is for using R’s power within Tableau dashboards. Both lower the barriers for creating and sharing data visualizations:

Esquisse lowers the barrier for R users (especially beginners).
Shinytableau lowers the barrier for Tableau users to incorporate R enhancements, and for R developers to deliver results in the Tableau ecosystem.

As the text likely suggests: these tools show how R can be both easier to use (with GUIs) and more integrated with enterprise tools. One might say: “esquisse brings a Tableau-like interface into RStudio”, and “shinytableau brings R’s capabilities into Tableau”. This synergy can be very powerful in practice.

(They might have included an image or screenshot of a Tableau dashboard with a Shiny component, illustrating e.g. a custom chart inside Tableau labeled as an extension. Since we can’t show that here, we rely on description.)

After exploring these integrations, the chapter likely moves on to resources and more advanced possibilities for data viz in R.

4.10 Useful Documentation

When creating visualizations, especially complex ones, you often need to fine-tune many details (legend appearance, axis breaks, annotations, etc.). It’s impossible to memorize every function or argument. That’s where documentation and community resources are invaluable.

For ggplot2, some key resources include:

The official ggplot2 documentation site (or the reference section in R help). The tidyverse website for ggplot2 (as shown in references) provides a well-organized listing of functions and examples. For instance, want to know how to adjust themes? There’s a reference page for theme elements. Want to see all geoms? There’s an index (geom_point, geom_line, etc.).
The ggplot2 book (“ggplot2: Elegant Graphics for Data Analysis” by Hadley Wickham) – great for understanding deeper principles and extensions.
Cheat Sheets: RStudio provides a handy cheat sheet for Data Visualization with ggplot2. It’s basically a two-page quick reference that shows common geoms and how to use them, how to map aesthetics, how to facet, sample code for themes, etc. Many people pin this cheat sheet by their desk when working with R. It’s very helpful when you forget, say, the name of a certain scale function or the syntax for element_text in theme.
Community examples and Q&A: The R community is vibrant. Websites like R Graph Gallery (r-graph-gallery.com) have hundreds of example plots (with code) showcasing how to achieve certain visuals. If you want to do something slightly off the beaten path, chances are someone has posted an example. Also, Stack Overflow is filled with questions and answers for specific ggplot tweaks (“How to place legend outside plot”, “How to add labels to each bar”, etc.). Searching those can save you time and teach you new tricks.
ggplot2 extension gallery: As mentioned, ggplot2 can be extended by other packages. There is an official site tracking these extensions, which is great to find if someone has already made a package for a certain type of chart (like ggmap for maps, gganimate for animations, ggthemes for extra themes, etc.). The references even mention “tidyverse/ggplot2 README” that likely points to that extension gallery.

The user text specifically references “the official reference site for ggplot2” and that it includes all themes etc., and mentions a ggplot2 extensions gallery, and the RStudio Data Visualization cheat sheet.

They also note (perhaps via a figure or example) something about reordering factors for plotting. A common thing: bar charts by default will plot factors in the order of factor levels (often alphabetical). Many times, we want to order bars by value. This requires reordering the factor levels based on the data (using fct_reorder from forcats, or the reorder() function). For instance, reorder country factor by GDP before plotting so that the bar chart is sorted. They might have shown an example image of doing that. It’s not explicitly described in the user text excerpt, but it’s likely a tip they wanted to convey.

In any case, the key advice is:

Use documentation for fine control over ggplot elements. For example, how to rotate axis text, how to manually set colors, how to increase figure margins – all these are in docs.
Look at examples/galleries to learn how to create less common charts or to get inspiration. For example, the R Graph Gallery might show a beautiful circular bar chart or a network diagram with ggplot2, which isn’t in the official docs because it’s not a basic chart – but someone figured it out and shared.
The community extension gallery reveals packages like treemapify, ggridges (ridgeline plots), ggforce (for additional geoms and functionalities), etc. If you need something special (like a heatmap or venn diagram or even humorous things like ggjoy which became ggridges), likely an extension exists.

The text might specifically mention an extension with an example – possibly the animated treemap (treemapify + gganimate) because the reference to the gallery had something about a GIF. Or googleway example for maps was in references.

Example (reordering factors): They likely illustrate that if you want bars sorted by height, you can do:

gdp6$country <- with(gdp6, reorder(country, -gdp))

before plotting, to reorder country by GDP descending. That’s a common tip to improve clarity (matching principle of simplicity and telling a clearer story – sorting from largest to smallest, for instance).

In summary, no one expects you to memorize all of ggplot2. Knowing that resources exist and how to use them is part of being an effective data visualizer:

If something isn’t working, check the documentation or Google it.
If you wonder “Can ggplot do X?”, search for “ggplot [what you want]” – likely someone has asked it or an example exists.
Keep cheat sheets and references at hand while you work on projects.

The text explicitly points out that the Data Carpentry R lesson mentions where to find theme info – which we actually saw in that lesson excerpt: “The complete list of themes is available at ggplot2.tidyverse.org/reference/ggtheme.html”, proving that point.

They also mention the ggplot2 extension gallery, perhaps highlighting how people have extended ggplot2 for various purposes. Possibly they mention the ggplot2 extensions gallery showing things like animated graphs (some extension or combination enabling animation, which leads us to the next advanced topic – gganimate maybe, or htmlwidgets for interactivity).

Which leads into the next section of the chapter about advanced visuals: D3, htmlwidgets.

4.11 Advanced Visualizations

While ggplot2 covers a vast majority of static visualization needs, sometimes you want to go beyond static charts – maybe to interactive visualizations, web-based visuals, or highly custom graphics that aren’t easy to do with existing R packages.

The chapter highlights two major pathways for advanced visuals:

Using D3.js via R (the r2d3 package).
Using htmlwidgets (R interfaces to various JS libraries).

D3 via r2d3

D3.js (Data-Driven Documents) is a powerful JavaScript library for creating interactive and highly customizable web visualizations. Many modern interactive charts you see on the web (like those in the New York Times or The Guardian) are often powered by D3. It allows fine-grained control, animations, and dynamic behavior – basically anything you can do in a web browser with SVG or Canvas, D3 gives you tools to bind data to it.

However, writing D3 code means writing JavaScript, which many R users might not be familiar with. The r2d3 package is created to facilitate using D3 from R:

It lets you write D3 code but in a way that can be embedded in R contexts (like in RStudio viewer, in Shiny, or R Markdown).
It handles the communication of data from R to the D3 script. As the README states, r2d3 provides a suite of tools for using D3 with R, including translating R objects to D3-friendly data structures and rendering D3 output in RStudio or notebooks.
It basically wraps a D3 script in a function call r2d3() so that you can, for example, have a file plot.js with D3 code and then call r2d3(data = mydata, script = "plot.js") in R. This will launch that D3 visualization using mydata within RStudio’s viewer or an RMarkdown HTML output.
You can also distribute D3 visualizations as htmlwidgets or in Shiny apps using r2d3, and it even integrates with R Markdown for knitting interactive documents.

The advantage: anything that can be done in D3 can be integrated into your R workflow. This means you’re not limited to ggplot2’s chart types or styles. If you dream up a custom interactive visualization (say an interactive network graph, or a map with animated transitions, or a novel visualization technique you saw in a research paper), if you’re willing to write some D3, you can implement it and still feed it with R data easily.

For example, if you find a cool D3 example on bl.ocks.org (a popular gallery of D3 examples), you could take the D3 code, and with minimal adjustments, use r2d3 to feed it data from R and display it. The r2d3 README even suggests you can take examples from the D3 gallery and use them (it mentions linking to bl.ocks and such in the README snippet).

They gave in references an example of a Voronoi diagram possibly (since a figure was mentioned) – Voronoi diagrams are not typical base R or ggplot charts, but D3 can do them. So maybe they showed one via r2d3 as an example of what’s possible beyond ggplot. Or the reference [46] itself shows the snippet about a bar chart example.

An example of usage might be:

You have a very custom requirement: e.g., an interactive timeline where hovering highlights connections, etc. Instead of waiting for someone to write an R package for it, you can write (or adapt) a D3 script.
R sends data to it (r2d3 handles converting R data frames or vectors to JSON, etc.).
The result appears in, say, your R Markdown output. If that output is HTML, the D3 is fully interactive in the viewer. If it’s just RStudio Viewer, you see it but not as a static image, as an interactive widget.

Thus, r2d3 opens the door to web-native visualizations while staying in R.

It does require some knowledge of JavaScript and D3’s way of thinking (which has a learning curve). But for those willing, it’s rewarding, because D3 is extremely powerful.

Think of it like: ggplot2 gives you high-level abstraction but finite options, whereas D3 gives you low-level control but infinite possibilities (with more code needed).

The user text likely says something like: “r2d3 allows R users to harness the full power of D3.js library for custom and interactive visualizations”, highlighting that it translates R objects to D3 and renders them.

It might also mention that RStudio (Posit) has good integration – RStudio v1.2 added support for previewing D3 scripts as you develop them (as the README snippet says). They really tried to make R a friendly place to create D3 viz by bridging the gap.

So if you can’t find a ggplot or widget solution, and you’re up for it, r2d3 is your go-to for ultimate customization.

htmlwidgets

While r2d3 is one specific way focusing on D3, htmlwidgets is a broader ecosystem in R that allows integration of many JavaScript visualization libraries through ready-made packages.

An htmlwidget in R is basically a special object (usually created by some package’s function) that knows how to render itself as HTML/JavaScript in a web context (like RStudio viewer, Shiny, or RMarkdown’s HTML output). R provides many such widgets so that you don’t have to write JS yourself.

Some popular htmlwidgets:

leaflet: interactive maps (pan, zoom, markers, etc.)
plotly: interactive plots (built on plotly JS library, can even take a ggplot and make it interactive with tooltips)
DT: interactive data tables (for viewing data frames with sorting, filtering in a webpage)
networkD3: interactive network graphs using D3
highcharter: interface to Highcharts (a powerful JS chart library) for interactive charts
googleVis or googleCharts: interfaces to Google Charts
visNetwork: another network viz package
rgl and threejs widgets: for 3D graphics in browser
crosstalk: a framework to link widgets for coordinated filtering (multiple htmlwidgets reacting to common inputs)
And many more: there’s a gallery (gallery.htmlwidgets.org) showing lots of them.

The user text references googleway (an htmlwidget for Google Maps) with an image example likely, and highcharter, and “d3.js R wrapper” which could refer to networkD3 or r2d3, but likely networkD3 or similar.

The great thing about htmlwidgets:

Easy to use: You call an R function, get an interactive chart. For example, leaflet(data) %>% addTiles() %>% addCircles(...) creates an interactive map. No JS coding needed by you. The R package handles converting your data and options to JavaScript under the hood.
Self-contained: When you knit an R Markdown to HTML or host a Shiny app, the widget’s needed JS/CSS is packaged in, so it just works. Even sharing an HTML file, the widgets are embedded (they can be standalone or use CDN).
Brings the best of JS libraries to R: Want a fancy JavaScript chart? Likely someone made an R htmlwidget for it if it’s popular.

The references mention:

googleway example (with a legend) – likely showing a Google Map with colored legend, meaning R user can use Google Maps API from R easily via that widget.
highcharter – an R interface to Highcharts, which is a widely-used commercial JS chart library (free for personal use). Highcharts can produce very nice interactive charts. highcharter lets R users use it without writing JS.
“d3.js R wrapper” might point to specific ones like networkD3, d3heatmap, etc.
htmlwidgets in general and their gallery.

One nice aspect: Many htmlwidgets support pipeable interfaces or syntax similar to their JS counterpart. For example, leaflet uses the pipe to add layers similar to how you’d call methods in JS.

Also, htmlwidgets can be combined; with crosstalk, you can have multiple widgets share a common data filter. E.g., a leaflet map and a DT table such that selecting a row in the table highlights a point on the map – all in R, without writing JS glue. Crosstalk does behind scenes.

Essentially, htmlwidgets give R users the power of the web’s interactive visualization world, within R’s comfort.

The user text likely encourages exploring the htmlwidgets gallery which showcases many options. It might mention how there’s a widget for almost everything, including specialized ones like googleway for maps or threejs for 3D, diagrammeR for diagrams, etc.

One of the references (maybe [38]) describes a Google map with a legend, showing how one can incorporate external services (Google Maps) in R visuals. The figure probably illustrated it.

Takeaway: for interactive needs, check if an htmlwidget exists before trying to build from scratch. Chances are high it does if the library is famous.

One might ask: how do htmlwidgets differ from r2d3? In a sense, r2d3 is like a way to create a custom htmlwidget (with D3 specifically) by writing JS yourself. Htmlwidgets packages wrap existing JS libs with R functions for you. So if what you need aligns with an available library, just use the htmlwidget package. If you need something truly custom, use r2d3 or develop a new widget package.

Finally, mention that htmlwidgets work seamlessly in Shiny too. In Shiny apps, you can output any htmlwidget just like a plot, but it’s interactive.

The references encourage exploring the htmlwidgets gallery (pointing to examples created with them). They might specifically mention plotly or others. The user did mention highcharter and d3js R wrapper in their notes, implying those were discussed.

The combination of ggplot2, r2d3, and htmlwidgets covers a huge range of visualization possibilities:

Static publication-quality charts (ggplot2).
Custom web interactive charts (r2d3 or writing new widgets).
Ready-to-use interactive charts (htmlwidgets).

So an R user is empowered to handle almost any viz scenario. The chapter likely emphasizes that you’re not limited to static plots in R; you can go fully interactive and dynamic if needed.

It probably ends with something like: “We’ve only scratched the surface. With these tools, you can create advanced visuals – from animated maps to interactive dashboards – all within R.”

4.12 GYHD (Continued Practice)

The chapter’s hands-on practice earlier (steps 1-5) was partly about data transformation. It appears Step 5 of that practice involved turning gdp2 (wide) into gdp3 (long). Possibly they left off in a previous chapter after creating gdp2. So now in Chapter 10 they continue with Step 5 to tidy the data, presumably in order to then do some visualization with it (maybe repeating what we did manually with gdp5? I suspect gdp5 might have been that tidy data already, but not sure).

However, the user text for Step 5 is given in detail, we already covered it. They emphasize after pivot_longer: “Now gdp3 has three columns: country, year, gdp… if gdp2 had columns each year, then gdp3 has multiple rows per country etc.” So they are explaining the transformation outcome.

Then they likely would instruct to use gdp3 for plotting (which we effectively did with gdp5, maybe gdp5 was already similar).

Anyway, since that content was more data-wrangling oriented but done in the visualization chapter, it’s reinforcing that to visualize data effectively (with ggplot), you often need to reshape data into the long format.

In conclusion, the chapter provided:

Principles of data viz (integrity, right chart, simplicity, color, story).
Goals of viz (explore, monitor, explain).
Practical ggplot2 examples (bar, line, bubble, maps).
Integration with GUI (esquisse) and other software (shinytableau).
Advanced possibilities (D3, htmlwidgets).
Pointers to resources and further practice.

Finally, they list references to sources that informed these guidelines and tools, which we’ll compile in the next section.

5 References

Agnese Jaunosane (2024). Data Visualization Principles With Good & Bad Examples – Ajelix Blog. (Introduces core principles such as clarity/simplicity, accuracy/integrity, choosing the right chart type, effective use of color, and avoiding misleading visuals, with illustrative examples)
Sahin Ahmed (2023). Essential Principles for Effective Data Visualization – The Deep Hub on Medium. (Outlines key principles including clarity and simplicity, choosing appropriate chart types, strategic color usage, and storytelling with data, with practical guidelines for each)
Data Visualization – Everything You Need to Know – Actian Blog (2025). (Describes the purpose of data visualization and highlights the three main goals: to explore, monitor, and explain data insights, explaining why each is important for decision-making)
Data Visualization with ggplot2 – Data Carpentry R Ecology Lesson (2022). (Tutorial on using ggplot2 for creating plots; includes notes on customizing themes and a reminder that the complete list of ggplot2 themes is available in the official documentation for reference)
Dario Radečić (2023). R Esquisse: How to Explore Data in R Through a Tableau-like Drag-and-Drop Interface – Appsilon Blog. (Introduces the esquisse package, a GUI add-in for RStudio that allows users to create ggplot2 charts by dragging and dropping variables, much like Tableau, thus lowering the learning curve for R visualizations)
Joe Cheng et al. (2022). Introduction to shinytableau – Posit (RStudio) shinytableau Documentation. (Explains how the shinytableau package enables R/Shiny developers to create Tableau dashboard extensions. It bridges Tableau’s extension API with Shiny, allowing Tableau users to incorporate R-powered visuals and analyses in dashboards without needing to write JavaScript)
r2d3: R Interface to D3 Visualizations – RStudio Package Documentation. (Details the r2d3 package, which provides tools to integrate D3.js visualizations into R workflows, including translating R data to JavaScript and rendering D3 scripts in R contexts. Empowers creation of highly custom interactive graphics using D3 within R)
ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics – tidyverse/ggplot2 Documentation. (The official ggplot2 documentation and user guide. Includes references to its extensive ecosystem, noting that a community-maintained extensions gallery showcases over 100 ggplot2 extension packages for specialized charts and themes)
GeeksforGeeks (2025). Mastering Tufte’s Data Visualization Principles. (Summarizes Edward Tufte’s principles such as graphical integrity and maximizing data-ink ratio. Emphasizes truthful representation of data – e.g., avoiding non-zero baselines that exaggerate differences – aligning with our discussion on integrity)
Polymer Search (2023). 10 Good and Bad Examples of Data Visualization – Polymer Blog. (Provides examples of common pitfalls like truncated axes and misleading scales, versus improved versions. Reinforces why proper scaling and context are vital for truthful visuals. For instance, demonstrates how a tiny increase can appear huge with a truncated y-axis, underscoring our points on graphical integrity)

# Data Visualization Data visualization is the practice of translating information into a visual context, such as a chart or map, to make data easier for the human brain to understand. In any context – whether in a business dashboard, a scientific report, or a student project – effective visualizations help reveal insights and tell a story. In the **International Business (IB)** curriculum and other academic programs, there is a strong emphasis on clear communication and integrity of information. Visualizing data is not just about making charts look appealing; it’s about conveying truth and meaning in a way that an audience can grasp quickly. The **IB learner profile** values of being principled and communicative are directly relevant here: one must present data honestly and in a globally understandable way. With that in mind (along with some new tools like Posit’s *Positron* IDE for data science, which we’ll touch on later), let’s explore the fundamentals of effective data visualization. ## Key Principles of Data Visualization Experts and educators often highlight a handful of core principles that good data visualizations should follow. We will discuss **five key principles** and why they matter: ### 1. Graphical Integrity **Graphical integrity** means that a visual representation of data should tell the **truth** about the numbers. In other words, never let the design of a chart distort or mislead the viewer about what the data actually says. Visualizations must be accurate, **honest**, and **transparent** about the data. This principle aligns with what statistician Edward Tufte famously emphasized: *“Visual representations of data must tell the truth.”* In Tufte’s list of fundamentals, one is explicitly *Graphical Integrity*, i.e. ensuring that charts and graphs are truthful and clear. Here are some aspects of maintaining integrity in visuals: * **Use appropriate axes and scales:** Perhaps the most common breach of integrity is manipulating the axis scales. For example, starting a bar chart’s y-axis at a value higher than zero can exaggerate differences. Tufte notes that a non-zero baseline can amplify small differences and mislead the viewer. If two values differ by only 1%, but the y-axis is truncated to a narrow range, a bar for the higher value might look **several times taller** than the other – a deceptive impression. Always consider if your axes should include zero (for bar charts, usually yes) and use a consistent scale. Log scales or breaks can be used when appropriate, but these choices must be clearly communicated. * **Avoid “chart junk” that distorts data:** Unnecessary three-dimensional effects, overly fancy graphics, or pictograms that don’t scale linearly with the data can misrepresent values. For instance, using area or volume to represent 1D data (like showing one number with a larger 3D object) can trick the eye about magnitude differences. * **Proportion and context:** Ensure that visual proportions correspond to the numerical proportions. If one category is twice as large as another, its bar or slice or point should appear twice as large (not 10 times larger). Provide context like baselines or reference lines so the audience can interpret the magnitude correctly. Label axes clearly with units, so the viewer isn’t guessing scale. * **Disclose uncertainty or variability:** Integrity also means not hiding the fact that data has uncertainty. Use error bars, confidence intervals, or annotations to indicate data quality where relevant. In an academic or IB context, being honest about the limitations of data is part of ethical communication. To illustrate the importance of graphical integrity, consider the following example of a misleading vs. an accurate chart. The first chart (left) shows Company X’s profits over 5 months with a truncated y-axis that starts at 20 million instead of 0, making the growth look **dramatic**:  *Misleading visualization:* The y-axis begins at 20 rather than 0, exaggerating a small increase in profits. In reality, profits only rose from 20.1M to 20.4M (a very modest change), but the bar heights differ greatly, giving a false impression of a big jump. In contrast, the second chart (right) shows the same data with a y-axis starting at zero:  *Accurate visualization:* With the y-axis starting at 0, it’s clear that the profit increase is minimal – the bars for each month are almost the same height. This chart maintains **graphical integrity** by not overstating the change. As these examples demonstrate, a dishonest tweak in scaling can lead viewers to incorrect conclusions. **Integrity is paramount**: A visualization should never sacrifice truth for aesthetic or dramatic effect. For IB students and professionals alike, this means your charts and graphs should uphold the same standards of honesty as your written or spoken statements. Misleading visuals can be considered as serious as misquoting a source. Always double-check: are we “telling the truth” with this graphic? Finally, keep in mind global and cultural fairness in visualization. An aspect of integrity is not only *what* you show, but *how* you show it to a diverse audience. For example, using color (addressed later) or symbols that have different meanings in different cultures could inadvertently mislead or confuse. In a globally minded framework like IB, ensuring your visuals are interpreted correctly by people from various backgrounds is part of maintaining integrity. ### 2. Use the Right Display (Appropriate Chart Type) Choosing the **appropriate type of chart or graph** for your data is crucial. The effectiveness of a visualization hinges on using a form that suits the structure of your data and the message you want to convey. As one data visualization guide puts it, the goal is to match *“different chart types \[to] different kinds of data and analytical purposes”*. In practice, this means: * **Match the chart to your question:** Figure out what you want to show. Is it a comparison between categories? A trend over time? A distribution of values? A relationship between variables? Different goals suggest different chart types: * Use a **bar chart** or **column chart** to compare quantities across categories (e.g. sales by region). Bars make it easy to compare lengths. * Use a **line chart** to show trends over continuous data, typically time. Line charts excellently display how something changes sequentially (e.g. monthly unemployment rate over years). * Use a **scatter plot** to show relationships or correlations between two quantitative variables (e.g. height vs weight). * Use a **pie chart** (or its variation, a donut chart) to show parts of a whole (proportions), but only when you have a few categories and want to emphasize percentages. Avoid pie charts with too many slices – they become hard to read. * Use a **histogram** or **box plot** to show distributions of a single variable (histogram for the frequency distribution shape, box plot for summary statistics like median and quartiles). * Use a **scatter plot** or **bubble chart** for multivariate relationships (two variables on axes, perhaps a third as color or size). * Use a **map** if your data has a geographic component and you want to show spatial patterns. * This list can go on: heatmaps, tree maps, network graphs, etc., each suited for specific data structures. * **Consider the nature of your data:** The type of variables matters. If your x-axis is time or an ordered sequence, a line graph makes sense to connect points in order. If your data is categorical (unordered categories like types of fruit, or industries), bars might be better. If you have a part-to-whole relationship, a pie or stacked bar could work (though use with caution). If you have a hierarchy, maybe a tree map. Always ask: *What is the data type (time series? categories? continuous? proportions? spatial?)* and *what chart types handle that well?* A helpful summary from one source: identify if you’re comparing values, looking at distribution, checking for correlation, or composition, and **“choose the simplest chart that accomplishes your goal”**. For example, if comparing a few categories, a simple bar chart is often clearer than, say, a radar chart or some novelty visual. Overly complex or uncommon chart types can confuse more than clarify. * **Leverage conventions:** Some chart types are so standard that audiences immediately grasp them. For instance, line = trend, bar = compare sizes, etc. Using the expected chart for a given data/story can make comprehension faster. Conversely, using an unconventional chart (when a simpler one would do) may require more effort from the viewer to decode. Aim for clarity over cleverness. A famous guideline is *“Do not reinvent the wheel unnecessarily – if a basic chart type communicates well, use it.”* * **Avoid chart-type pitfalls:** Each chart type has its pitfalls. Pie charts, for example, are notorious when there are many slices or the differences are subtle; our eyes are not as good at comparing angles or areas as they are at comparing lengths (bars) or positions. If you find yourself adding data labels to each tiny pie slice because the visual differences aren’t clear, that’s a sign another chart might be better. Stacked area charts can obscure data if too many segments are stacked. Know the limitations: for instance, a single pie chart can show composition at one point in time, but to compare compositions across many points, multiple pies are hard to compare – a stacked bar or line might be better in that case. In short, **choose wisely**. There isn’t one chart type that’s always best; it must fit the data. A good practice is to *identify your goal (comparison, distribution, trend, relationship, etc.), consider the data’s nature (categorical, numeric, time-based, part-whole, etc.), and then select the simplest visual form that highlights what you want to show*. Simplicity here also means familiarity – what form will your audience intuitively understand? By using the right display for the job, you ensure the audience “gets it” quickly and correctly. *(One handy reference: there are tables that map “what do I want to show?” to possible chart types. For example, the **Ajelix** blog provides a chart suggesting which charts are best at which tasks. Such guidelines reiterate: bar for comparisons, line for trends, scatter for relationships, etc. Adhering to these tried-and-true matches is usually a safe bet.)* ### 3. Keep it Simple (Clarity Over Clutter) The third principle is all about **simplicity and clarity**. As the saying goes, *“simplicity is the ultimate sophistication.”* In data visualization, this translates to designs that are clean and free of unnecessary clutter. A simple chart is **easier to read** and interpret correctly. In contrast, a busy, cluttered chart can overwhelm or confuse the viewer, obscuring the data’s message. One set of guidelines succinctly advises: *“Focus on Clarity: Keep it simple and avoid clutter.”*. Let’s break down what that means in practice: * **Remove non-essential elements (Declutter):** **Chart junk** refers to all the extraneous or decorative elements that don’t improve understanding of the data. Gridlines, heavy backgrounds, excessive text, glossy effects, superfluous icons – ask if each element serves a purpose. If not, consider removing it. For example, in a line graph, light gridlines can help estimate values, but too many lines or very dark gridlines become visual noise. In bar charts, unnecessary shading or 3D effects add nothing but distraction. Keep your ink ratio high (Tufte’s principle of maximizing *“data-ink”*, the ink used to display data, vs. minimizing *non-data-ink*). Every pixel on your chart should earn its keep. Simpler fonts, fewer borders, and minimalistic design help the data shine. As a tip: after drawing a chart, step back and see if removing something (a border, a tick mark, a label) would still convey the information – if yes, perhaps it wasn’t needed. * **Highlight what’s important:** Simplicity doesn’t mean you show less information than necessary, but it does mean emphasizing the **key insights** and not everything with equal weight. Use annotations or callouts sparingly to draw attention to critical data points (like an all-time high on a time series, or a significant benchmark). Keep other elements muted. For example, in a trend line chart with many categories, you might gray out most lines and use a bright color for the one line you want the audience to focus on. This technique, sometimes called “soften the background to let the focus pop,” is part of simplifying the viewer’s task – they’re not wading through a jungle, their path is lit to what matters. * **Use clear labels and titles:** A simple chart still needs context. A direct, informative title (perhaps stating the key message of the chart), clearly labeled axes with units, and a legend (if needed) placed intuitively all contribute to clarity. Don’t make viewers guess what they’re looking at. However, avoid verbose labels or redundant text that could clutter. For instance, if a bar chart already has labels on each bar, you might not need value labels on the axis as well (or vice versa). Strike a balance so that the chart can *stand on its own* without a paragraph of explanation, but also without drowning in text. In IB assessments or professional reports, you often have captions – utilize those for additional explanation so the graph itself can remain lean. * **One idea per chart:** Often, simplicity means narrowing the scope. Each visualization should ideally communicate **one main idea** or a tightly related set of insights. If you find a single graph trying to convey everything (e.g., comparing categories, showing trend, distribution, and highlighting outliers all at once), it might be better to split that into multiple visuals. As one expert advises, *“Each visualization should convey a single key insight. If you have multiple messages, use multiple charts.”*. This way, each chart can be simpler, and together they can still tell a comprehensive story. To visualize the difference, imagine two versions of the same data plot. The first is cluttered: a busy background, dozens of tick marks, overlapping labels, maybe a cute but irrelevant image in the corner, and too many colors. The second version uses a clean **minimal theme**, only essential gridlines, concise labeling, and maybe two colors (one for highlight, one neutral for the rest). The latter will *feel* easier to read. In fact, experiments show that removing clutter improves comprehension and retention. The Ajelix article on data viz principles provides a compelling example contrasting a cluttered vs. clean chart – the clean chart highlights the important information and enables quick decision-making, whereas the cluttered one leaves viewers lost in details. In R’s `ggplot2`, using themes like `theme_minimal()` or `theme_classic()` can instantly remove heavy backgrounds and gridlines, giving a simpler look (more on themes soon). You can then add back only what you need. For example, you might start with `theme_minimal()` and then use `theme()` to turn off even the few gridlines it has if they’re not needed. Simplicity is often achieved through iterative refinement: make the plot, then *edit it down*. > **IB Context:** In an IB Mathematics or Science project, presenting data simply can also demonstrate your understanding. It shows you can focus on what matters and communicate effectively to an audience that might not be technical. Also, remember the IB core value of *clarity in communication* – a de-cluttered visualization exemplifies clear communication, just as a well-structured essay does. **Key takeaway:** *Clarity trumps complexity.* By keeping designs simple and clean, you allow the data to speak. As one source succinctly advises: *“Focus on clarity – keep it simple and avoid clutter”*. Your audience (be it examiners, colleagues, or the public) will thank you, because you’ve made the information accessible and easy to digest. ### 4. Use Color Strategically Color is a powerful tool in data visualization, but with great power comes great responsibility. **Using color strategically** means applying color deliberately to **enhance** comprehension, not to decorate the chart arbitrarily. A well-chosen color palette can highlight key patterns or group related data; a poor choice can distract or even mislead. As a rule of thumb, *“Color Counts: Use color strategically to highlight key data points.”* and ensure consistency and meaning in your color choices. Consider these best practices for color in charts: * **Use color to differentiate or categorize:** Color is very effective for distinguishing categories or data series, especially when shape or position alone isn’t sufficient. For example, in a line chart with multiple lines, using different colors for each line (with a clear legend) helps the viewer track which is which. In a bar chart, color can group bars into logical sets (e.g. shading all bars for 2020 in one color and 2021 in another in a grouped bar chart). Without color, one might have to rely on patterns or labels that can be harder to follow. That said, **don’t overload on colors** – a limited palette (perhaps 4-5 max distinct colors in one view) is usually enough. If you need more than about 6 distinct categories, consider whether the chart is too busy or if there’s a way to break it into multiple charts. * **Highlight with color:** Our eyes are naturally drawn to color, especially bright or saturated color. Use this to your advantage by coloring the most important data point or series differently from the rest. For example, you might make all bars grey but one bar (representing, say, “our company” vs. competitors) in a bold blue. Or in a time series, highlight the current year’s line in color and previous years in light gray. This technique directs attention where you want it. However, use sparingly – if you highlight too many things, nothing stands out. A common approach is to have a neutral/base color for most data and a contrasting accent color for the focus. This is an intentional use of color to *tell a story* (connecting to principle 5): e.g., **red** line for a critical threshold or event, **green** to highlight a target being met, etc. * **Maintain consistency:** If the same category or variable appears in multiple charts, use the same color for it throughout your report/dashboard. Inconsistency can confuse the viewer (imagine one chart where “Asia” is represented by blue and another where “Asia” is orange – a reader might momentarily mistake it for a different category). Consistent coloring becomes a visual language or legend that persists across visuals. Many data viz tools (like ggplot2 in R) do this automatically if you keep the factor levels consistent, but if you manually tweak colors, be mindful to apply the scheme everywhere. The Ajelix tips mention “maintain consistent fonts, colors, and styles throughout” – unity in design helps a viewer focus on the data, not re-learn the color encoding for each chart. * **Contrast and readability:** Ensure there is enough contrast between colors (and between the data and background). If you have a plot with a light background, use strong colors for lines/points; if you have a dark background, use light-colored lines/points. Contrast also matters for overlapping elements: in a scatter plot, if points are semi-transparent, choose fill colors that stand out but still allow overlaps to be seen. Avoid light yellow on white, for instance, which is nearly invisible, or red on green for small text which many people can’t read (and is problematic for colorblind viewers, as we’ll note below). When in doubt, test your chart on a projector or printed in grayscale – do things still distinguish? Tools like ColorBrewer (and the `scale_color_brewer()` in R) are designed to give palettes with good contrast and visual appeal. * **Be mindful of color meaning and culture:** Colors carry associations. Red often implies bad or down in financial contexts (or urgent, or negative), while green implies good or up (like profit). Blue is often seen as neutral or positive. But these interpretations can vary by context and culture. For example, red in some cultures is a sign of prosperity or luck (e.g., in China), and green can have negative connotations in some contexts. Also, think of domain-specific uses: in healthcare, blue vs red might be used for cold vs hot, or certain diseases; in politics (at least in the U.S.), red and blue represent opposing parties. If you inadvertently use a color strongly associated with something else, your audience might subconsciously misinterpret your intent. The key is to **use color intentionally** and be aware of your audience. If your visualization is part of an IB global politics project comparing countries, check if the colors you use align or clash with any national or cultural colors that could introduce bias. One guideline is: *“Different colors can evoke different emotional responses and cultural associations. Use these to your advantage but be aware of differences in perception.”*. For instance, using bright red to highlight one bar might signal “danger/alert” to many viewers – which is fine if that bar is something like a failing score or an alarming trend, but not fine if it’s just “Team Red’s result” unless you intend that connotation. * **Color for quantitative data (gradients):** If you’re showing a heatmap or choropleth map (color representing a value), choose an appropriate color scale. Sequential data (e.g., low-to-high values) usually use a single-hue or multi-hue gradient (light to dark of one color, or from one color to another). Divergent data (deviations from a midpoint, say above/below zero or above/below average) should use a diverging palette (two hues with a neutral middle, like blue–white–red). For example, temperature anomalies might use blue for colder than average, red for hotter than average, with white at zero difference. This encodes meaning in the color. Avoid arbitrary gradients that have no intuitive order (like rainbow for ordered data – it can be misleading). Ensure the gradient is perceptually uniform (ColorBrewer or `scale_fill_viridis()` in R provide good options). And always include a legend for color so the viewer knows what the colors signify in terms of numbers. * **Accessibility (colorblind-friendly):** A significant percentage of the population (around 8% of males, for example) have some form of color vision deficiency (often red-green colorblindness). Thus, a **color scheme that relies on red vs green alone could be problematic** – those viewers might not distinguish them well. To be strategic with color, you need to ensure your visualization is still effective for those viewers. Use colorblind-friendly palettes (there are well-known ones, like the “Okabe-Ito” palette, or using blues and oranges which are generally distinguishable). Alternatively, use differences in lightness or pattern in addition to hue. Tools like Color Oracle can simulate how your chart looks to someone who is colorblind. Additionally, never use color as the sole means of conveying information if possible. For example, instead of just a red dot and a green dot on a plot, you could also use different shapes or a clear legend that labels them, so even if someone can’t see the color difference, they can still tell by another cue. * **Don’t go crazy with the rainbow:** Using too many distinct colors is a common novice mistake. A chart with 10 differently colored lines is hard to follow – the viewer has to keep looking back and forth to the legend, and some colors might be hard to tell apart. If you truly have that many categories, consider another approach (like small multiples or labeling lines directly). If you must, try to group them using shade variations of a few main hues rather than 10 unrelated colors. Remember that color printing might turn them all to different shades of gray – will the chart still make sense? Often, less is more. By using color thoughtfully, you **enhance** the visualization’s clarity and impact. Color can guide the eye to the **story** in the data, which is what you want. For example, a Medium article on data viz principles notes how color can be used to **differentiate** data sets, **emphasize** key points, and add **emotional impact**, but warns that improper use can confuse and detract. They also provide sensible guidelines: use contrasting colors for important distinctions, limit the palette, be mindful of cultural color associations, and ensure accessibility. In summary, **color should be used with purpose**. Before adding a color to a chart, ask: *Why am I using this color? What does it convey?* If you have a good answer (to separate these two groups, to highlight this outlier, to encode a third dimension of data), go ahead. If the answer is “it looks pretty” or “I just felt like it,” reconsider. A purposeful approach to color ensures that your visuals are both attractive **and** meaningful. Remember the adage from the guidelines: *“Color counts.”* Use it to make your critical data stand out while maintaining a cohesive, easy-to-read design. When done right, color is often the element that takes a visualization from good to great – making it not only clear, but also engaging and memorable. ### 5. Tell a Story with Your Data At the end of the day, **data visualization is a form of storytelling**. It’s not just about plotting points or bars; it’s about conveying a message or insight to your audience. The best visualizations lead the viewer to an “aha” moment – they illuminate something interesting or important in the data. Therefore, our fifth principle is to **tell a story with your data**. Rather than showing data for data’s sake, think about *what narrative or insight* you want the visualization to communicate. As one author on data visualization notes, this transforms numbers into a narrative that *“engages and informs,”* turning a chart into a compelling story. Here’s how to infuse storytelling into your visuals: * **Have a clear message or focus:** Before you create the visualization, ask yourself: *What is the main point I want someone to take away from this?* Is it that a certain trend is increasing? That one category is outperforming others? That two variables are correlated? Once you identify the key message, design the visual to highlight that. This might mean choosing a specific chart type that best shows that message (tying back to principle #2), and using annotations or color (principle #4) to draw attention to it. For example, if the story is “Company A’s market share has been steadily rising over 10 years,” a line chart with an annotation at the end (“Market share doubled from 10% to 20%”) tells that story clearly. Every element in the chart should support that message or at least not distract from it. * **Structure and sequence (if multiple visuals):** In many cases, a single chart may not tell the whole story; you might have a series of visuals. Think of this like chapters in a story or scenes in a film. There should be a logical flow. Perhaps you start with a broad overview (e.g., “Overall sales are growing”) and then zoom into specific components (e.g., “Online sales are driving this growth, whereas retail is flat”). Each subsequent visualization builds on or complements the previous, so together they form a coherent narrative. In a presentation setting, you might literally have multiple slides that step through a data story. In a static report, you can achieve a similar effect by the order and arrangement of figures, and through explanatory captions. * **Use annotations and context:** A good data story doesn’t leave interpretation entirely to the viewer. Guide them with subtle narration on the chart. This could be a note on a spike (“Promotion campaign here boosted sales”) or shading a region of a timeline (“Recession period”) to give context. These annotations act like the narrator’s voice in a story, helping the audience understand causation or significance. Storytelling with data often involves combining the visual (show *what* happened) with a hint of *why* it happened or *why it matters*. In academic contexts, linking data to real events (e.g., marking policy changes on a graph of emissions) is powerful. In business, annotating when a new product launched on a revenue chart can make the story clear (e.g., “after launch of X, growth accelerated”). Instead of making the audience guess or recall these contextual details, provide them on the chart so the story is self-contained. * **Engage the audience’s emotions or curiosity:** While maintaining accuracy (never sacrifice integrity for drama), you can design visuals that are intriguing or impactful. Perhaps start with a question that the data answers, either in the title or leading text: “Which energy source grew the fastest in the last decade?” Then the chart (maybe a slopegraph or line chart with multiple energy sources) visually answers it, and you highlight the winner (e.g., solar in bright color surging upward). This question-and-answer format creates a mini-story and invites the viewer to think and conclude – engaging them actively. Another technique is to highlight human implications of data: instead of just “30%,” an annotation might say “30% (that’s 1 in 3 people)”. It connects data to real-life understanding, making the story relatable. * **Make it memorable:** Stories stick in the mind better than raw facts. If you craft a visualization story well, viewers will remember it. Use metaphors or relatable framing if appropriate (but carefully). For instance, comparing the scale of something to known objects (“the area lost to deforestation is equivalent to X football fields per day”) can drive a point home. Visual metaphors (icons or pictograms) can sometimes be used, but ensure they don’t distort the data – they should supplement, not replace, standard quantitative encoding. * **Example – Story of a rising trend:** Suppose you have data of a city’s bicycle usage over years. A pure presentation of data would just show the numbers going up. To tell a story, you might frame it as “The Bike Boom: How Cycling Became Mainstream in City X.” Your line chart of number of cyclists from 2010 to 2025 could have annotations: a note in 2015 “Bike-share program introduced,” a highlight in 2020 “city builds 200 km of bike lanes,” and a projection or target for 2025 if available. The narrative might be: initial slow growth, a turning point (policy intervention), then rapid growth. The title, the annotations, and perhaps a concluding caption tie it all together: “By 2025, cycling is not just recreation – it’s a main mode of transport for city residents.” Such a visualization does more than report numbers; it conveys a transformation over time – a story of change with causes and effects. * **Interactive storytelling:** If you have the chance to use interactive visuals (say in a Shiny app or Tableau dashboard), storytelling can also be interactive where users explore data at their own pace guided by prompts. For example, an interactive might start zoomed out (global view) but allow the user to click on their country to see local trends – effectively letting them find their own “story” under the umbrella of the bigger narrative. Even then, providing hints (“Click on a country to see its trend”) ensures they engage with the intended story. The notion of storytelling might sound a bit abstract, but it’s increasingly recognized as what elevates a visualization. A Medium article on this principle states: *“Through visualization, hidden trends and patterns emerge, telling a story of change over time or relationships that might not be evident from raw data.”*. In other words, it’s the visuals that can reveal the narrative the data contains. Our job as the designer is to bring that narrative to light in a compelling and truthful way. Finally, a story has a **point** – so ensure your data story has a clear takeaway. Don’t assume the audience will “get it” automatically; often, it helps to explicitly state the insight either in the chart title or caption. For instance: instead of a generic title like “GDP of Country A 1990–2020,” a story title would be “Country A’s GDP has doubled every decade since 1990.” That primes the audience on what to see in the chart. It’s akin to giving the moral of the story up front. In summary, think of each visualization as a communication, not just an analysis artifact. You want to **inform or persuade** the viewer of something. By applying principles 1–4 (integrity, appropriate display, simplicity, color) in service of principle 5 (storytelling), you ensure that the data’s story is told ethically, clearly, and effectively. Whether you’re creating a visualization for an IB Internal Assessment, a scientific paper, or a business meeting, framing it as a story will make it more impactful. People remember stories; if your data visualization tells one, they’re more likely to remember the data (and insight) as well. ## Goals of Visualization Now that we’ve covered principles for *how* to visualize, let’s step back and recall *why* we visualize data. What purposes do charts and graphs actually serve? Understanding the goals will reinforce why those principles matter. Broadly, data visualizations serve three main purposes (or goals): **to record/monitor, to analyze/explore, and to communicate/explain**. Different sources and textbooks sometimes use varying terms, but a common breakdown is **Explore, Monitor, and Explain**. * **Record (or Monitor):** One fundamental use of visualization is to **document information** or keep an eye on data as it updates. This could be as simple as making a chart to log experimental observations in a lab notebook, or as complex as a real-time dashboard tracking stock prices or website traffic. Historically, explorers drew maps – that’s data visualization to record geographical data. In business, a dashboard that shows key metrics (sales today, number of active users, etc.) is used to *monitor* the status in real-time. In this mode, the visualization itself might be straightforward (maybe even a table or simple time series updated regularly), and the goal is not deep analysis but maintaining an *awareness* of the data’s state. Good monitoring visuals surface the important status quickly (for example, a big KPI figure with green or red indicator to show if it’s on target). When you hear about “situational awareness” or operational dashboards, that’s this goal. It answers: *“What’s happening? Is everything normal or do I need to take action?”* * **Analyze and Explore:** Data visualization is a vital tool for **analysis** – this is when you don’t yet know what the data will reveal and you’re using visuals to discover patterns. Analysts and data scientists often plot data in many ways to see distributions, relationships, outliers, and trends as part of exploratory data analysis (EDA). Visualization can reveal a pattern (e.g., a cluster of points suggesting two groups in your data), or show that an assumption was wrong (e.g., a supposed trend is actually very noisy). Interactive visualizations can be especially useful here: you might filter, zoom, or highlight subsets of data dynamically to test hypotheses. The goal in exploratory visualization is insight *for the analyst themselves*. It’s a thinking tool. For example, an IB student doing a science project might plot various combinations of variables to see which correlates with which before writing up results. Or a statistician might use a scatter matrix plot to see pairwise relationships in a large dataset. Often, the visualizations made for analysis are not the same ones you would use in a final report – they might be messier, or just quick plots – but they are crucial for *figuring out what the data tells you*. In short, we visualize to *explore*, to find the story in the first place. Even if the end goal is a nice explanatory chart, the process likely involved many exploratory charts. * **Communicate and Explain:** This is the most outward-facing goal – using visualization to **tell others what the data shows**. When you have an insight or message (thanks to analysis), you then design a visualization to convey that to your audience (be it your teacher, your boss, or the public). Here, you are curating and perhaps simplifying what the data shows in order to highlight the points of interest. The principles we discussed (clarity, using the right chart, etc.) are mostly about *this* stage – making sure the final product effectively communicates. These visualizations often appear in presentations, reports, articles, or even social media infographics. A well-crafted explanatory visualization can often replace paragraphs of text – as the saying goes, “a picture is worth a thousand words.” For example, instead of writing “our market share grew from 10% to 15% over the year while two main competitors declined slightly,” you might show a bar chart or line chart that in an instant conveys that comparison. The goal is that the audience *grasps the insight quickly and accurately*. Additionally, an explanatory visualization can be persuasive. If you’re advocating for a change in policy, a chart showing the trend (like rising temperatures over decades) can be more convincing and memorable than just stating numbers. Many of the famous “data stories” in media (such as Gapminder animations by Hans Rosling showing development over time) serve to explain complex phenomena in an understandable way. These three goals aren’t mutually exclusive – a single visualization can sometimes do more than one. For instance, a well-designed dashboard might simultaneously let a user monitor current status and explore by drilling down interactively, and it might include explanatory annotations to communicate insights. But it’s useful to distinguish them to clarify your *primary* intent. * If your goal is **record-keeping or monitoring**, you’ll focus on clarity and perhaps real-time accuracy, and you might sacrifice some depth for simplicity (maybe big bold numbers or simple sparkline charts that update frequently). * If your goal is **analysis**, you might not worry about aesthetics at first – you’ll make quick plots, maybe many of them, aiming to uncover truth. You’ll keep an eye on integrity (of course) but you might not label everything beautifully in initial exploration. Tools for analysis might allow more interactivity and flexibility (like using R or Python in a Jupyter notebook to quickly plot different charts, or a tool like Tableau in an analyst mode). * If your goal is **communication**, you’ll apply all the principles we discussed meticulously to produce a polished, audience-ready visual, because now you know what needs to be communicated and how best to do it. According to one source, the three main goals are essentially to **explore**, **monitor**, and **explain** – aligning with the above. It’s helpful to ask yourself when making a chart: *Which goal am I serving right now?* If you’re exploring, you might be more experimental. If you’re explaining, you’ll refine and perhaps annotate. This also ties to tools: some software or packages are better for quick exploration (e.g., using Pandas in Python to plot quick histograms, or using Excel for a quick look), while others are better for final communication (e.g., designing a detailed infographic in Adobe Illustrator, or a carefully coded ggplot in R for a report). In an IB Diploma Programme context, for example, if you collect data for an Internal Assessment, you might first plot it in various ways to *explore* (Goal 2, analysis). Once you find the interesting pattern that answers your research question, you might create a refined chart with proper labels and perhaps a trendline to *explain* that finding in your write-up (Goal 3, communication). And perhaps you keep a log of your raw data in charts in an appendix as *record* (Goal 1, documentation of data). By clarifying the goal, you also set the criteria for success of the visualization: * A monitoring viz is successful if it alerts or confirms status at a glance (e.g., the teacher can quickly see which student needs help from a dashboard of quiz scores). * An exploratory viz is successful if it helps you find something new or understand the data better (even if that something new is “there is no clear pattern here”). * An explanatory viz is successful if the target audience understands the intended insight quickly and accurately, and ideally remembers it. Most of what we focus on in reports and presentations are the explanatory visuals, but never forget the exploratory stage that comes before – *visualization is a tool for thinking*, not just presenting. And monitoring is a huge practical use of viz in everyday life (from your fitness tracker’s daily graphs to COVID-19 dashboards tracking cases). To sum up: Whether you’re **monitoring** data streams in real-time, **exploring** a dataset for insights, or **explaining** a finding to others, data visualization is an indispensable approach. Many visualizations you encounter can be classified as serving one (or more) of these purposes. When you design your own, knowing your primary goal will guide your design choices. For example, a chart meant for exploration by you might have interactive filters, whereas the one meant for explanation in your IB Geography presentation might be static but richly annotated. Keeping the goal in mind ensures that your visualization is fit for its intended use. *(As an aside, some sources combine or add goals such as “decoration” or “enlightenment,” but those typically fall under communication – decoration isn’t really a valid goal on its own in our context, and enlightenment is essentially what good explanation or exploration should achieve. The **Actian blog (2025)** succinctly puts it: the goals are to help people explore data, monitor data, and explain insights. If your visualization isn’t doing at least one of these, it might be worth asking why you are making it at all!)* ## Loading Data With principles and goals covered, let’s move into practice. To create visualizations (especially in a coding environment like R, which we use in this course), we need data in the right format and the right tools loaded. In previous work, we prepared a dataset called **`dataCanadaFullLong`**. Let’s quickly recall what this dataset is: it likely contains some data about Canada – perhaps employment numbers by industry over years (given the name and context of upcoming plots). The “Long” in the name suggests it’s in **long format** (tidy format), which is ideal for plotting with ggplot2. First, we need to **load the data** into our R environment. Assuming the data file `dataCanadaFullLong.csv` is available in our working directory (for example, in the `data/` folder), we can use `readr::read_csv` to read it: ```r dataCanadaFullLong <- readr::read_csv("./data/dataCanadaFullLong.csv") ``` This command will import the CSV file into an R data frame called `dataCanadaFullLong`. We should check that it loaded correctly (e.g., using `head(dataCanadaFullLong)` to see the first few rows, though that’s not shown here). Next, we need to ensure the data types of each column make sense for plotting. In particular, the dataset likely has an `isicCode` column (perhaps representing industry codes) and a `year` column, along with a `value` column (perhaps number of employees). Since `isicCode` is a categorical code (even if it’s numeric codes, they represent categories of industry), we should treat it as a **factor or character** in R, rather than a numeric value. If it were left numeric, ggplot might treat it as a continuous variable, which we don’t want in this case (we don’t want, say, a numeric scale for industry codes). The snippet given is: ```r dataCanadaFullLong$isicCode <- as.character(dataCanadaFullLong$isicCode) ``` This converts the `isicCode` column to character type. We could also use `as.factor` – in many ggplot scenarios, character and factor behave similarly (ggplot will treat character as discrete categories). But converting to character is fine, as it indicates we won’t be doing numeric operations on those codes. Now `dataCanadaFullLong` is ready. It’s in a *tidy long format*, meaning each row is a single observation (with columns like year, isicCode, and value). This is the preferred format for ggplot2, where you map columns to aesthetics (x, y, color, etc.). If it were in a wide format (multiple columns for different industries, for example), we’d have to reshape it. But since it’s already long, we can proceed to plotting. *(Quick aside on why “long” format is good: ggplot2 expects something like: one column for x values, one for y values, one for categories if needed, etc. If your data were wide – e.g., separate column for each industry’s employment – you’d either need to call geom for each or reshape. We’ll actually practice reshaping data later in the chapter exercises.)* Let’s apply the principles in practice with a few types of charts, using this dataset. ## Bar Chart A **bar chart** is a common way to visualize and compare quantities across categories. In our dataset, we have employment numbers (`value`) by industry (`isicCode`) and by year. We might want to compare how large each industry is in each year. A good approach is a grouped bar chart: for each year (group), show bars for each industry. In ggplot2, you create bar charts with either `geom_bar` (which by default counts observations, useful for frequencies) or `geom_col`/`geom_bar(stat="identity")` when you already have values to plot (here we do have a pre-summarized value: number of employees). We will use `geom_bar(stat="identity")` which tells ggplot to use the actual values in the data rather than counting rows. Here’s the code to produce a grouped bar chart of number of employees by year and industry: ```r library(ggplot2) library(ggthemes) # for additional themes and color scales ggplot(data = dataCanadaFullLong, aes(x = year, y = value, fill = isicCode)) + geom_bar(stat = "identity", width = 0.5, position = "dodge") + xlab("") + ylab("Number of employees") + labs(fill = "Industry (ISIC code)") + theme_minimal() + scale_fill_brewer(palette = "Dark2", direction = -1) ``` Let’s break down what each part of that does: * `ggplot(data = dataCanadaFullLong, aes(x = year, y = value, fill = isicCode))`: This initializes the ggplot object with our dataset and sets up default aesthetics. We map **x-axis to year**, **y-axis to value** (number of employees), and **fill color to isicCode**. That means whenever we add geoms, they’ll use these mappings by default (unless overridden). Year is likely an integer or factor. Since we want discrete bars by year, year could be treated as a discrete axis; if `year` is numeric, ggplot might treat it continuous which would place bars in a continuous scale. Typically, for bar charts, we ensure the x is a factor or categorical. If `year` is numeric but we want distinct bars per year, ggplot might still handle it by treating it as discrete when used in a bar (it often does if stat="identity"). But to be safe, converting year to factor might be wise, or specify `group = year`. In this code, it likely works as intended. * `geom_bar(stat="identity", width=0.5, position="dodge")`: Here we add the bar geometries. `stat="identity"` means use the given y values as heights (instead of the default which would count rows per x). `width=0.5` sets the width of the bars (the default is 0.9, which sometimes is too thick for grouped bars). A smaller width gives some gap between bars. `position="dodge"` is key for grouped bars: it places bars for different fill categories (industries) side by side, instead of stacking them. “Dodge” separates them on the x-axis grouping. If we omitted `position`, by default `geom_bar` with fill would stack the bars (creating a stacked bar chart, which is another way to view composition but not what we want here). * `xlab("")` removes the x-axis label. We do this because “year” might be self-evident or already indicated by axis tick labels. It’s a style choice; often, if the x-axis is a time or category obvious from context, we omit a redundant label to reduce clutter. * `ylab("Number of employees")` sets a descriptive y-axis label. Always label units or what the number represents; here it’s presumably count of employees. * `labs(fill = "Industry (ISIC code)")` changes the legend title for the fill colors. By default, ggplot would use the name of the variable (`isicCode`) which might not be user-friendly (maybe it’s coded as just numbers or short codes). We provide a nicer label. If, for instance, ISIC codes correspond to sectors like “Agriculture”, “Manufacturing”, etc., we might even use those names instead of codes in the legend. But that would require mapping code to name (maybe via a factor with labels or a lookup table). Here we just label the legend generically. * `theme_minimal()` applies a clean theme with a white background and minimal grid lines. This immediately makes the chart look more modern (compared to the default gray background theme of ggplot). Minimal theme aligns with our principle of avoiding clutter. * `scale_fill_brewer(palette = "Dark2", direction = -1)`: This uses a ColorBrewer palette for the fill colors. “Dark2” is a palette of 8 distinct colors that are relatively colorblind-friendly and visually distinct (dark hues). We set `direction = -1` perhaps to reverse the order of the colors (maybe to match a desired mapping of code to specific color). Brewer palettes are great for qualitative (categorical) data because they’re chosen for good contrast. This ensures each industry gets a different color that’s distinguishable. If our industry codes were more than 8 and we tried to use Dark2, ggplot would cycle or reuse colors – not ideal beyond 8 categories. In that case, another palette or manually defining might be needed, but let’s assume we have a manageable number of industries (or we’re only plotting a subset). When this code runs, we’ll get a plot with the years on the x-axis and a group of bars for each year. For example, if the data covers 2015, 2016, 2017, 2018, each of those will be a category on the x-axis. At each year, multiple bars (each colored differently) will appear side by side (because of `position=dodge`), one for each industry code in that year, with height equal to the number of employees. The legend will show which color corresponds to which industry code (with our provided title). A few things to check or tweak: * If `year` is numeric, ggplot might place them in continuous scale (which would put space between say 2015 and 2016 ticks based on numeric distance). But since they are consecutive, it might look fine. If not, one can do `aes(x = factor(year), ...)` to ensure discrete. Or convert year to factor beforehand. * Dodge position automatically dodges by the width of bars. With width 0.5, there will be slight gap between groups as well. If we wanted a different spacing, we could adjust, but default dodge should be okay. * The `ggthemes` package was loaded to allow `scale_fill_brewer` to use named palettes easily (though base ggplot has `scale_fill_brewer` too if RColorBrewer is installed, which comes with ggplot2 typically). Perhaps `ggthemes` was intended for other theme options if needed. We could also try `theme_economist()` or others from ggthemes for style, but `theme_minimal()` is usually fine. Now, how does this relate to our principles? * We chose a bar chart (the right display) for comparing categories at a point in time (year), which is apt. * We keep it simple: minimal theme, clear labels. * Colors distinguish categories but are chosen from a palette known to be distinct and generally accessible (Dark2 is a decent choice, though one might check colorblind friendliness). * Integrity: we start bars at zero (ggplot does that by default for bar charts’ y-axis), and we’re not manipulating scales unfairly. We’re directly plotting actual values (stat=identity). * If the story we want is within this, we might annotate the chart or highlight an industry of interest in a different color if needed. For now, it’s just showing comparisons. *(One could consider ordering the bars or facets for better storytelling. In a grouped bar, the x-axis is time which is natural order. The legend listing industries might be sorted alphabetically or by code; if we wanted to emphasize one, we might reorder factor levels to control legend order or use guides.)* This bar chart allows easy comparison within a year (comparing bar heights side by side) and across years (following a given color across adjacent year groups, though that is a bit harder in grouped bars – a line chart may be better for seeing a trend over years, which we’ll do next). Grouped bars are good for showing two categorical dimensions (year and industry) together. If the chart feels too busy (say 10 industries across 5 years = 50 bars), an alternative could be **faceted charts** or multiple charts. But grouped bars are a compact way to show a lot if done carefully. *(Technical note: If there are many categories or long category names, one might use a dodge with narrower bars or horizontal orientation. But with year on x and industry as fill, horizontal doesn’t apply here. If industry names were long, a direct bar chart with industry on x would require flipping axes or angled text. But we have numeric codes, which are short labels.)* ## Line Chart Next, let’s visualize the data as a **line chart**. Line charts excel at showing trends over continuous intervals (typically time). In our case, we can plot year on the x-axis and number of employees on the y-axis, drawing a line for each industry. This will show the trajectory of each industry’s employment over time, which is hard to see in the bar chart at a glance (the bar chart grouped by year requires mentally connecting bars across groups to see a trend; a line chart makes that explicit). Here’s code to create a line chart from the same data: ```r ggplot(data = dataCanadaFullLong, aes(x = year, y = value, color = isicCode)) + geom_line(size = 1.5) + xlab("") + ylab("Number of employees") + labs(color = "Industry") + theme_minimal() + scale_color_brewer(palette = "Dark2", direction = -1) ``` This is quite similar to the bar chart code, with a few changes: * We map `color = isicCode` instead of `fill`. For line charts, the aesthetic to distinguish groups is usually color (affecting the line color). There is also `linetype` or `shape` for points, but color is the most straightforward way to separate multiple lines. We used `color` so each industry’s line gets a distinct color. * We use `geom_line()` to draw lines. We set `size = 1.5` to make the lines a bit thicker for visibility (default might be 0.5 or 1 depending on theme). * The legend will now be for color (by default it would use the variable name, we override to "Industry"). * We again use `theme_minimal()` for a clean look. * We use `scale_color_brewer(...)` to apply the same palette to line colors as we did for fill in the bar chart, ensuring consistency (Dark2 again, reversed order). `scale_color_brewer` is analogous to `scale_fill_brewer` but for line/point colors. The resulting plot will have year on the x-axis (likely treated as continuous, so the points for each year are connected sequentially by industry), and number of employees on y. Each industry becomes a line of a different color. The x-axis by default will likely show every year or every few years depending on how many years – if it’s a lot, ggplot might auto-select breaks. Since we removed the x-axis label again, the years will be the ticks. If year is numeric (say 2010, 2011, ...), ggplot might place them evenly on a continuous scale – which is fine if they are evenly spaced. If some years are missing, the line will naturally break if data is missing for a year (or if NA), or it will interpolate if the data just doesn’t have a certain year value (but normally, missing years should either be filled with NA or it will just connect across the gap). If we had irregular time intervals, we’d have to consider that, but assuming yearly data. To make the line chart clearer, it’s often helpful to add **markers** at the data points. Humans can read exact values or at least see where the points are if we place small symbols at each year’s data on the line. In ggplot, that’s easy: just add `geom_point()` on top: ```r ggplot(data = dataCanadaFullLong, aes(x = year, y = value, color = isicCode)) + geom_line(size = 1.5) + geom_point(size = 2.5) + xlab("") + ylab("Number of employees") + labs(color = "Industry") + theme_minimal() + scale_color_brewer(palette = "Dark2", direction = -1) ``` Now each line will have points (of size 2.5 here, you can adjust) at each year’s value. This serves a few purposes: * It makes it easier to discern overlapping lines (two lines crossing might be confusing, but the points help track which goes where). * It emphasizes the data observations themselves (the line is actually a connection between them). * If the data had missing years, lines might skip but points would make it clear where actual data exists. * It just looks nice and is commonly expected in multi-line charts to identify specific data points (especially if you eventually label some points or interactively hover, etc.). Be mindful that too many lines can clutter a chart (the infamous “spaghetti plot”). If our dataset has many industries, the line chart may become hard to read with all lines overlapping. In such a case, one might use small multiples (facets) or highlight only a few significant ones. But assuming a moderate number (maybe 5-10 industries), color and some direct labeling could make it digestible. For presentation, sometimes people label lines at the end with the name instead of using a legend, to avoid having to match colors to legend. That can be effective if lines are well-separated on the right side – but that gets into advanced labeling techniques. For now, the legend works. The line chart focuses on **trends**: you can easily see which industries are rising, which are falling, which are flat over time. This temporal pattern might be the story of interest. For example, maybe industry “A” steadily grew, while “B” declined after 2018, etc. If we wanted to tell a story, we might annotate certain lines (“Industry X saw a boom after 2015” with an arrow on the line). From a **principles** perspective: * Graphical integrity: The line chart properly uses position to encode values – assuming year spacing is correct, no distortion. (One check: if the years are numeric, the distance from 2010 to 2011 on x will be same as 2011 to 2012, which is correct. If we had irregular intervals, we’d want the x scale to reflect actual time differences, which ggplot’s continuous scale does. If year was treated as factor, it would space equally even if a gap, which might be slightly misleading if the gap is longer. But let’s assume annual data.) * Chosen the right display: line for trend, appropriate. * Kept it simple: minimal theme, no clutter. We might even consider removing legend and directly labeling lines if feasible, but with many categories legend is simpler. The background is clean. We could reduce gridlines if desired. * Color usage: used effectively to separate categories. The palette is the same as bar chart for consistency (so if a reader sees bar and line for Industry 101, it’s the same color, aiding comprehension). * Storytelling: If explaining, we’d highlight in text or annotation the key trend(s). As is, it’s an analytical chart one can read. *(In an IB assignment, one might present both a bar chart and line chart of similar data depending on what they discuss – maybe the bar chart emphasizes the magnitude in a particular year, whereas the line shows the trend. But one should avoid redundancy. Often, choose the chart that best supports the point you’re making.)* ## Bubble Chart A **bubble chart** is essentially an extension of a scatter plot where a third numeric dimension is represented by the size of the points (bubbles). It’s not as commonly used for time series data (since we usually use line charts for that), but bubble charts can be useful to show three variables at once – e.g., comparing countries by GDP (x-axis), life expectancy (y-axis), and population (bubble size). In our context, we have year, industry, and value. We could conceive a bubble chart where **x = year, y = value, and the size of the point also represents value**. That’s a bit redundant (plotting value on y and also on size). Bubble charts typically are more useful when the third dimension is a different variable. However, for the sake of example (and perhaps the exercise intended this), we can show bubbles growing or shrinking to double-encode the value. Why would you double-encode? It can add emphasis – a larger bubble draws the eye, so years/industries with higher values stand out not just by position but also by size. But it can also be a bit visually confusing if not done carefully, because you’re essentially repeating the information. Alternatively, maybe the dataset had another measure (like perhaps value is employment, and bubble size could be, say, revenue or something), but since not mentioned, we’ll assume we’re just trying out the aesthetic. The code for a bubble chart might look like: ```r ggplot(data = dataCanadaFullLong, aes(x = year, y = value, color = isicCode)) + geom_point(aes(size = value), alpha = 0.7) + xlab("") + ylab("Number of employees") + theme_minimal() + scale_color_brewer(palette = "Dark2", direction = -1) + scale_size_continuous(range = c(3, 10)) + guides(size = FALSE) ``` What’s happening here: * We still map `x = year`, `y = value`, and `color = isicCode`. So points are positioned by year and value, and colored by industry. * `geom_point(aes(size = value), alpha = 0.7)` means we draw points with the `size` aesthetic mapped to `value`. So higher value => larger point (bubble). We set `alpha = 0.7` to make them slightly transparent. This can help if bubbles overlap, so you can see through one a bit to others beneath. It also gives a nice visual softness. * We use `scale_size_continuous(range = c(3, 10))` to control the size mapping. By default, ggplot will choose a range of point sizes (in mm units) for the size scale, but it might be too small or large. We specify that the smallest value should correspond to size 3 and largest to size 10 (just an example range) – which is a decent range for a bubble. You might adjust those numbers to get aesthetically pleasing bubble sizes. (If the range is too wide, small values practically disappear or large ones become huge circles overlapping everything.) * `guides(size = FALSE)` hides the legend for size. Why hide it? Because the size is exactly the same data as on the y-axis in this redundant case. If we put a legend for size, it would say something like a gradient of sizes from lowest to highest value, but we already have the axis for value. It would be confusing to have both. So we turn off the size legend. The color legend (for industries) we probably keep (we left it default here; we might add `labs(color="Industry")` if we want to ensure a nice legend title). * We keep `theme_minimal()` and color scaling same as before, for consistency. The alpha transparency is purely aesthetic/functional for overlap. The resulting chart will show, for each industry and year, a bubble. All bubbles along a given year vertically align (since x is year), and all bubbles for an industry would (if you connected their centers) form the line we had earlier. But now, the higher the value, the higher the point (y-axis) and also the bigger the bubble. So trends can be seen by vertical position *and* the bubble size pattern. It’s a bit of a novelty way to show a time series – not as straightforward as lines. But if the data were sparse or categorical both axes, bubble charts can work (like a matrix of categories with bubble sizes). It might be more useful to demonstrate a bubble chart in another context, but since the exercise likely wanted to illustrate usage of `aes(size= )`, this does that. Potential uses in our case: Perhaps to emphasize magnitude differences in a cluttered scenario. For instance, if two lines in the line chart cross, the bubble chart might make the larger one obviously larger in bubble size, providing another visual cue. However, one must be cautious: humans are not as precise in judging area as position. Position on the y-axis (how high the bubble is) is still the more accurate read of the value. The area/size is more for general impression (big vs small). So you wouldn’t want to rely on bubble size to read exact values – that’s what the axis is for. One thing to note: `scale_size_continuous` is used to specify the size mapping. We might also ensure that the scale is proportional to something meaningful (by default, scale\_size uses area to map numeric, I believe, meaning a value twice as large gets a bubble area twice as large, which is generally good). Using range=c(3,10) just sets min and max size. If the data has outliers, bubble sizes might be skewed – sometimes one might use `scale_size_area(max_size=10)` which ensures area proportional and sets a max. But these are details. Since we removed the size legend, how does someone know what size means? In this redundant case, hopefully by context or a note in caption, or just because it correlates with height. If we were encoding a different variable in size, we would definitely include a legend or annotation to explain it. Bubble charts can become messy if too many bubbles overlap. In our time series case, if points are dense per year or values close, bubbles might overlap a lot. Transparency helps, but sometimes it’s not ideal. In exploratory data analysis, though, making a bubble chart can quickly show where clusters of high values are. Another scenario: If instead of year on x, we had, say, industry on x and province on y, and bubble size = employees, that could show a matrix of industries vs provinces with bubble sizes indicating employment – a kind of heatmap-like bubble chart. Those are sometimes called “balloon charts”. But since our data is long format time series, the line chart is generally more useful for trend. We included this to illustrate use of additional aesthetics. It’s a reminder that you can map **any aesthetic in ggplot2 to data**: color, size, shape, transparency, etc., as long as it makes sense. Principle-wise, only do so if it helps clarity. A common beginner mistake is to use too many aesthetics at once (rainbow of colors, different shapes, and sizes all in one chart – very confusing!). Here we just added size in a redundant but possibly visually interesting way. *Side note:* If the dataset had another measure, e.g., suppose `value` was employment and there was another field like `revenue`, one could do x=year, y=employment, and size=revenue, color=industry – a four-dimensional chart (x, y, color, size). That might be insightful if, say, some industries employ few people but generate high revenue (bubbles small in one dimension but large in another). In absence of that, we double-encode. To not overwhelm, one could facet by industry and use bubble size for some measure per year, but that’s a different approach. At this point, we’ve covered basic static chart types (bar, line, scatter/bubble) using ggplot2. We’ve seen how to apply themes and scales to adhere to our principles of simplicity and effective color. ## Maps Moving from abstract data to geographic data: creating **maps** is another important aspect of data visualization. Maps allow us to visualize spatial data – anything that has a location component (countries, cities, regions). In R, making maps can be done with specialized packages (`sf` for modern shapefiles, or using `maps` for quick outlines, or `leaflet` for interactive). Here we’ll show a simple example using the base maps available via `map_data` function (which comes with ggplot2 via the `maps` package data). First, we need map data. The code uses: ```r world <- map_data("world") ``` This fetches the coordinates for drawing world country polygons. `world` will be a data frame with columns like long, lat, group, region, etc. Each row is a point (a vertex of country outlines), and groups define which set of points make a polygon (one per country, often split further if countries have multiple parts like islands). Now plotting the world map outline is straightforward with `geom_polygon`: ```r ggplot(data = world, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") + theme_void() ``` What does this do? * We set up ggplot with `world` data and aesthetics `x = long`, `y = lat`, and `group = group`. **Grouping** is crucial here: it ensures that when drawing the polygon, ggplot knows which points belong together. If we didn’t group, it might try to connect all world points in one giant polygon (a big mess). The `group` variable provided by `map_data` identifies each country’s outline as a separate group. * `geom_polygon(fill="white", color="black")` draws the polygons. We fill them white (so land is white) and outline in black. That gives a classic outline map (like a simple political map with no fill color). We could fill with another color or even map fill to some variable (if we had data per country, we could join it and map fill to that value to make a choropleth). But here we’re just showing the base map. * `theme_void()` is an excellent choice for maps: it removes axes, grid, background, etc. We don’t need a coordinate frame for the world usually – latitude/longitude lines can be added if needed, but often a clean map is void of such unless relevant. `theme_void()` basically leaves just the plot panel with no annotations, perfect for maps or other diagram-like plots. * The result will be a flat projection of the world (by default, `map_data("world")` likely uses a simple long-lat (equirectangular) projection). Countries are outlined in black on a white background. It might not be the fanciest map, but it’s a start.  *Basic world map:* An outline map of the world generated with ggplot2 (using `geom_polygon`). Countries are drawn as white shapes with black borders, and the projection is the default long-lat. Notice the clean appearance due to `theme_void()` (no axes or gridlines). If you run the above, you’ll see all countries. Perhaps small islands may appear as tiny specks or not at all depending on resolution. But major landmasses will be there. Now, often we want to zoom into a specific region or subset of the world. The code suggests an example focusing on the **Americas**. They created a subset: ```r americas <- subset(world, region %in% c("USA","Brazil","Mexico","Colombia", "Argentina","Canada","Peru","Venezuela", "Chile","Guatemala","Ecuador","Bolivia", "Cuba","Honduras","Paraguay","Nicaragua", "El Salvador","Costa Rica","Panama", "Uruguay","Jamaica","Trinidad and Tobago", "Guyana","Suriname","Belize","Barbados", "Saint Lucia","Grenada", "Saint Vincent and the Grenadines", "Antigua and Barbuda","Saint Kitts and Nevis")) ``` This is a long list of country names representing North, Central, and South America, and the Caribbean. The `world` data frame has a column `region` which contains country names (and possibly some geopolitical entities like French Guiana as part of France maybe, etc.). By subsetting to these names, they effectively filter the dataset to only those countries. Then they plot `americas` similarly: ```r ggplot(data = americas, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") + coord_fixed(ratio = 1.1, xlim = c(-180, -35)) + theme_void() ``` Differences from before: * We use `americas` data, which already is grouped by country as needed. * After drawing polygons, they added `coord_fixed(ratio = 1.1, xlim = c(-180, -35))`. `coord_fixed` fixes the aspect ratio of the plot. Usually, `coord_map()` could be used for map projections, but `coord_fixed` with the right ratio can approximate it. They gave ratio = 1.1; this likely is to account for latitude scaling vs longitude in the region to look visually correct (maybe because a simple lat/long on a flat plot without projection looks a bit squashed; 1.1 might be empirically chosen to make countries look right – often at the equator, a 1:1 aspect works if you use proper lat scaling. But anyway, the fixed ratio ensures one unit on x is same length as one unit on y (with that multiplier). Without it, if your plotting window isn’t square, the map could stretch. * They set `xlim = c(-180, -35)` to zoom the longitude range from -180 to -35. That covers from the mid-Pacific (180°W) to roughly the mid-Atlantic (\~35°W). Essentially, it includes Alaska out to maybe just east of Brazil. It cuts off Africa and Europe on the right (which start around 0 to 20W to East), and it includes a chunk of Pacific on left. Why -35 and not say -30 or 0? Possibly to just show the Americas and not include a slice of Africa or maybe to include all of South America which goes to about 34°W at its eastern tip (Brazil). * They didn’t specify `ylim`, which means vertically it will include whatever latitudes present in the data (likely from northern Canada/Greenland down to the southern tip of South America around -55°S or so). Maybe they kept all to show entire Americas. * `theme_void()` again for a clean look. The resulting map will show just the Western Hemisphere. The aspect ratio being fixed is important so that latitude spacing is not distorted; otherwise, when you specify xlim, R might auto adjust aspect making countries look squashed. `coord_fixed` ensures 1 degree lat is plotted about same length as 1 degree long (times 1.1 factor; slight tweak). The value 1.1 might be chosen to make sure e.g. the map looks nice and not too elongated. We could refine it further by excluding Antarctica if it sneaks in (not in that region list so okay) or by including all Caribbean tiny states (which they did include a number of them). Some very small ones might not appear clearly at this map scale (points might though, but polygon outlines might be too small to see). One could fill countries with different colors (maybe if you have data, or even just to differentiate neighboring countries). Here they used all white fill. If the goal were just an outline map, that’s fine. If you wanted to color by region or value, you’d map fill to a variable. For example, if you had unemployment rate by country in the americas, you could merge that data into `americas` and do `aes(fill = unemployment)` and use `scale_fill_gradient` or such. We should mention the `coord_fixed` vs `coord_quickmap`: ggplot has a `coord_quickmap()` which sets asp ratio based on lat without doing a true projection transform – that’s often used for quick maps too. But `coord_fixed` works when you basically assume a plat map. Now, creating maps in R can get far more advanced: * Using `geom_sf` with `sf` objects for true projections and to handle spatial objects is a modern approach. * Adding layers like points or paths (for data on top of maps) is straightforward once you have the base map. * Interactive maps can be done with `leaflet` (an htmlwidget). In our context, we showed static maps which serve either as context (like showing location of something) or as an analysis (show pattern across geography). Even with a static outline, one might annotate it (e.g., mark specific cities or areas on it). Important principles for maps: * **Integrity:** Use appropriate projections and don’t distort the visual in misleading ways. For example, map projections can vastly change how big areas appear (Mercator makes high-latitude countries look huge compared to equatorial). If you are comparing areas, choose an equal-area projection. Here, we just did simple long-lat (which is not equal area, but for a simple outline maybe fine – Greenland will look big, etc.). For thematic maps, consider projection choice as part of integrity. * **Right display:** A map is the right display when location is a key part of the data or message. If location isn’t relevant, don’t use a map just for decoration – that would violate integrity and simplicity. * **Keep it simple:** We applied `theme_void` which is a great way to reduce clutter. We could also drop tiny islands if they clutter, or use lighter boundaries for less important regions, etc., depending on focus. * **Color strategically:** In our outlines, color is just binary (land vs ocean). We could highlight a particular country in a different fill if telling a story (“here is Country X we’re focusing on”). If showing data, use a sensible color gradient or categories. Also mind color for land/sea – typically land in a light neutral and water blue looks good, but if printing monochrome it might not matter. We did white land on white background, which effectively means background merges with land except for the borders. That’s fine for an outline-only look. Mapping with ggplot2 is a huge topic itself; here we covered the basics enough to integrate into our chapter. *(One more note: `coord_fixed(ratio=1.1)` – often one finds ratio as `1/cos(lat0)` to account for latitude scaling at a given center latitude. 1.1 might correspond to around cos(40°) or so, implying maybe centering around mid-latitudes. Possibly just trial-and-error to make shapes look undistorted. It’s not exact but okay for visualization.)* Now we’ve gone through several examples of creating visualizations in R using ggplot2, applying our principles along the way. Before moving to more advanced topics, it’s beneficial to practice these basics. The chapter likely includes some **“Get Your Hands Dirty”** exercises to reinforce these skills, which we will walk through next. ## Getting Your Hands Dirty (Practice Exercises) To truly learn data visualization, one must practice making charts and adjusting them. The following exercises use a dataset provided in the course repository (Chapter 10 on GitHub) to recreate specific figures. This hands-on approach will help cement the process of building a plot from data. *(The exercises span multiple steps, referencing chapters 6–9 for data wrangling steps and chapter 10 for plotting. We will focus on the visualization part but also ensure the data is prepared correctly.)* **Step 1: Import the data.** The data for this chapter’s examples is in `chapter10data.csv`. Let’s assume this file contains some GDP information for 5 entities (could be countries or something, given the name `gdp5` appears, maybe 5 countries) over several years, since later they filter year 2017 and talk about line chart and bar chart for figure 10-1 and 10-2. Use `readr::read_csv` to load it: ```r library(readr) gdp5 <- read_csv("chapter10data.csv") ``` Now, `gdp5` should be a data frame. We don’t have the printout here, but likely it has columns like Country (or similar) and Year and GDP values, or maybe each column is a country (since later they mention needing to *lengthen* data, which suggests it might initially be wide). **Step 2: Create a line chart (Figure 10-1).** The task is to reproduce a given Figure 10-1 which presumably is a line chart of GDP over time for some countries. We need to use `gdp5` data. First, if `gdp5` is not already in long format, we have to see what format it is. The exercise hint that later they do a pivot\_longer suggests `gdp5` might currently be **wide**: e.g., columns might be something like: Country, 2013, 2014, 2015, 2016, 2017 (just guessing a typical layout). Or it could be Country, Year, GDP already (long). Let’s carefully parse: In Step 3 they filter year == 2017 into gdp6, implying that in `gdp5` there likely is a column named year. So perhaps `gdp5` is already long or partly long. Alternatively, maybe `gdp5` is wide and they convert in Step 5 to gdp3. Actually, reading ahead: Step 5 says “Lengthen the data (wide to long) using gdp2 to gdp3 with pivot\_longer.” That suggests earlier steps (maybe Steps 1-4 in chapters 6-9) have already created a `gdp2` which might be a trimmed or slightly processed version of raw data. It’s a bit confusing. But since in Step 3 they easily filter by year on gdp5, it indicates gdp5 might be already long (with a Year column). However, they might have done: * Step 4 perhaps created gdp2 which is a wide version (somehow), and Step 5 pivots it to gdp3. * But then why filter year on gdp5? Possibly gdp5 was already partially long or a different cut. It might be the case that: * gdp5 is already long (maybe containing data for 5 countries over multiple years). * Then Step 3 filter year 2017 to get gdp6 for the bar chart (Figure 10-2). * Step 5 is a separate thread from earlier chapters (maybe gdp2 came from an earlier exercise where they had a wide dataset and now they demonstrate how to pivot it; gdp3 becomes a tidy version). * This might be an attempt to tie multiple exercises together. Possibly gdp5 is the combined cleaned data for plotting, while gdp2 was raw or something. Without the earlier context, let’s assume for making the charts, gdp5 is already tidy. So for the line chart (Fig 10-1): We want a line (or lines) over years. Likely multiple lines, one per country, since usually GDP comparisons are among countries. So we do: ```r ggplot(gdp5, aes(x = year, y = GDP, color = country)) + geom_line(size = 1) + xlab("Year") + ylab("GDP (in ... units)") + theme_minimal() + labs(color = "Country") ``` This is a generic structure. To match Figure 10-1 exactly, we’d need to know specifics like: * Which countries are included (maybe 5 countries, hence gdp5). * The axes labeling and units. * The color scheme (we could use default or something specific). * Perhaps they want points too or not? The instructions say “Using the imported gdp5 data, recreate the line chart shown in Figure 10-1.” They likely expect a multi-line chart with legends, etc. We already went through an example of a multi-line chart above (for employment by industry). Same principles apply: Map year to x, GDP to y, country to color. Use geom\_line (and maybe geom\_point if the figure shows markers). Ensure labels and theme match the figure. This exercise tests understanding of ggplot basics: * Mapping aesthetics * Adding geom\_line * Possibly customizing scales (if particular colors or breaks are needed) * Labeling axes appropriately. If, for example, Figure 10-1 had a title or certain colors, the student could add those. But from the given text, they only explicitly mention axes and legend. We might not have the actual data values here, but at least conceptually: If one of the countries had much higher GDP, it will dominate the y-axis scale, etc. We trust the exercise. So the answer might look like: ```r ggplot(gdp5, aes(x = year, y = gdp, color = country)) + geom_line(size = 1.2) + geom_point() + # if the figure has points labs(x = "Year", y = "GDP (US$ billions)", color = "Country") + theme_minimal() ``` *(Adjust theme or colors to match if needed, e.g., maybe `scale_color_brewer(palette="Set1")` for distinct colors if default not nice.)* **Step 3: Subset data for 2017.** Now they instruct to filter for year 2017 and store in gdp6: ```r library(dplyr) gdp6 <- gdp5 %>% filter(year == 2017) ``` This makes gdp6 a dataset of GDP values only for the year 2017 (for each country). This is essentially one row per country (if one entry per year originally). They want to use this for Figure 10-2, which is likely a bar chart comparing those countries in 2017. **Step 4: Create a bar chart (Figure 10-2).** Now using gdp6 (only 2017 data), we can make a bar chart of GDP by country. We can use `geom_col()` (which is a shorthand for geom\_bar stat identity): ```r ggplot(gdp6, aes(x = country, y = gdp, fill = country)) + geom_col() + xlab("Country") + ylab("GDP in 2017 (US$ billions)") + theme_minimal() + guides(fill = FALSE) ``` We map fill = country just to give each bar a distinct color (maybe the same colors as the lines had, if we want consistency – if we set the same palette or not dropping factor levels, ggplot might even carry that over). We then remove the legend (`guides(fill = FALSE)`) because with bars, if each bar is a country and the x-axis already labels them, a legend is redundant. Often, we color bars just to look nice but the axis label tells what each bar is, so legend can be omitted. Alternatively, we could choose a single fill color for all bars (if the figure was monochrome or just one color with no legend). The figure might not have differently colored bars since we already label categories on x-axis. If that’s the case, we’d do `fill="skyblue"` or something static and not use fill mapping. But since in line chart we used color for country, maybe to keep consistency or visual grouping, they want same colors in bar chart for those countries. That’s plausible if you want to reinforce which country is which color in a report. However, if you’re printing in grayscale or need simplicity, single color bars with category labels is fine. We’ll assume they want color variety (since they explicitly mention highcharter and stuff later, maybe they like colors). So the bar chart shows, say, GDP of Country A, B, C, D, E in 2017 side by side. Possibly sorted by some order? The default would be alphabetical by country unless the factor levels were in a different order (or if in gdp5, country was a factor ordered by something). Maybe they want descending order? They didn’t mention reordering, but often for bar charts it’s nice to sort descending. They might not expect students to know factor reordering at this stage unless taught. The given references had something about reordering factors (in the Data Carpentry doc or cheat sheets). But anyway. So key: using gdp6, geom\_bar with stat identity. Thus, two plots made: Figure 10-1 (line chart, likely multiple lines for multiple years for each country) and Figure 10-2 (bar chart of 2017 values). This practice covers using `filter()` to get subset data, and plotting different chart types accordingly. At this point, a student would have practiced: * Line plot for time series comparison. * Bar chart for single-year comparison across categories. * Basic `ggplot2` syntax and theming. The next steps in the book likely go back to data manipulation: Step 5 is pivoting data, which seems to correspond to something earlier (maybe gdp2 wide to gdp3 long). Since this is mentioned under “GYHD (Continued Practice)” likely because earlier in chapters 6-9 they did some steps (Step 1-4 might have been cleaning data and saving as gdp2 or gdp5, unclear). Anyway, they instruct: **Step 5: Lengthen the data (wide to long).** They mention `gdp2` to `gdp3`. We don’t have context for `gdp2`, but presumably `gdp2` is a wide version of data (maybe gdp of countries over years with each year as column, or each country as column). The goal is to use `pivot_longer` from `tidyr` to make it long. They provide: ```r gdp3 <- gdp2 %>% pivot_longer( cols = -country, names_to = "year", values_to = "gdp" ) ``` This indicates: * `gdp2` presumably had a column "country" and other columns which are something like years. * `cols = -country` means take all columns except country and pivot them longer (so all those columns presumably are years). * `names_to = "year"` means the names of those columns (e.g., "2015", "2016", etc.) will become values in a new column called "year". * `values_to = "gdp"` means the cell values that were in those year columns will go into a new column "gdp". After this, gdp3 will have three columns: country, year, gdp. Exactly the tidy format we used for plotting. This is a critical data transformation step: converting data from a *wide* format (one row per country, columns for each year’s GDP) to a *long* format (one row per country-year combination, with year as a value and GDP as value). **Why?** Because ggplot2 and other tidy tools generally prefer data in long format. For example, to make the line chart earlier, we needed one column for year, one for GDP, and one to distinguish country – this is exactly what gdp3 provides. If we had gdp2 (wide), we could not directly map multiple year columns to the y-axis. We’d have to pivot or use multiple geom\_line calls manually for each column (tedious). So pivot\_longer simplifies that greatly. They note “(If gdp2 had columns Country, 2015, 2016, 2017, the new gdp3 will have multiple rows for each country…)” – which is basically elaborating the result of pivot\_longer. Indeed, if gdp2 looked like: | country | 2015 | 2016 | 2017 | | ------- | ----- | ----- | ----- | | USA | 18000 | 18500 | 19000 | | China | 11000 | 12000 | 13000 | | ... | ... | ... | ... | Then gdp3 will look like: | country | year | gdp | | ------- | ---- | ----- | | USA | 2015 | 18000 | | USA | 2016 | 18500 | | USA | 2017 | 19000 | | China | 2015 | 11000 | | China | 2016 | 12000 | | China | 2017 | 13000 | | ... | ... | ... | Which is exactly what we need for a plot. After this transformation, *“now gdp3 is ready for visualization tasks”* as the text might say. They likely want the student to realize the importance of long format. **Conclusion of practice:** The student has: * Imported data (Step 1). * Created a multi-line time series plot (Step 2). * Filtered data for a specific year (Step 3). * Created a comparative bar chart (Step 4). * Transformed data from wide to long (Step 5). (This step 5 presumably would have enabled step 2 if needed, but maybe it was not needed because gdp5 was already long? Possibly gdp5 was actually the result of such pivot with only a subset of countries or something. The naming is a bit unclear.) Anyway, the exercise demonstrates a typical workflow: Collect/prepare data -> visualize over time -> focus on a single year -> ensure data is tidy. Such practices help solidify the interplay between data manipulation and visualization. --- Having done these R visualizations, the chapter might then transition to discussing other tools or enhancements. In fact, the user text goes on to mention **Tableau in R** (esquisse and shinytableau) and **advanced visualizations** (r2d3, htmlwidgets). So let’s cover those. ## Tableau in R Tableau is a popular visualization tool known for its drag-and-drop interface and interactivity. While R is code-driven and can create very customized visuals, sometimes users, especially beginners or those coming from a GUI background, appreciate a more interactive GUI to create plots. The good news is, R has solutions to accommodate such needs, effectively bringing a Tableau-like experience into RStudio or combining R with Tableau’s capabilities. Two tools highlighted are **esquisse** and **shinytableau**. ### **esquisse**: Drag-and-Drop ggplot2 Interface **esquisse** is an R package that provides a **graphical user interface (GUI)** for building `ggplot2` plots. It’s often described as a **Tableau-like add-in** for R. When you launch esquisse (typically via the RStudio Addins menu), it opens a gadget where you can select a data frame and then assign variables to the x-axis, y-axis, color, size, etc., choose a geom (bar, line, boxplot, etc.), and tweak some options – all with mouse clicks and drag-and-drop. As you do this, it’s constructing the ggplot code in the background, which you can then extract. Key features of **esquisse**: * **Drag-and-Drop Plot Construction:** You literally drag a column name onto an “X” axis drop zone, another onto “Y”, maybe one onto “Color” or “Fill”. You immediately see a preview of the plot. * **Choose Geoms Easily:** There are menu options to switch between common chart types. It might even suggest a default based on data types (e.g., if X is categorical and Y is numeric, it might start with a bar chart). * **Facetting and Filtering:** It provides ways to facet by a variable or apply filters to the dataset within the interface. * **Interactive Aesthetics Tuning:** You can change titles, colors, themes through GUI controls. * **Code Generation:** Perhaps the best part for learning – as you build the plot with the GUI, esquisse generates the corresponding ggplot2 code. You can copy this code out to refine further or save it. This helps beginners see how their actions translate to code (a great learning mechanism). The advantage of esquisse is that it lowers the barrier for R novices or those who are not comfortable remembering ggplot syntax. It’s also a quick way to explore data visually without writing code, which can speed up insight discovery (like a quick EDA tool). Essentially, it tries to marry the ease of Tableau with the power of ggplot2. From a workflow perspective, you might use esquisse to quickly prototype a chart, then copy the code and integrate it into your R script or R Markdown for final output, tweaking as necessary (maybe to add custom annotations or do things the GUI might not support). The user text references that this add-in is *Tableau-like* and meant for a *drag-and-drop interface* for ggplot. Indeed, an article by Appsilon (Radečić 2023) introduced esquisse to a wider audience. They emphasized that the interface is familiar to Tableau users, and you can generate ggplot2 charts with just clicks. It’s basically bringing the *“no-code”* or *“low-code”* approach to R plotting. One can imagine an IB student or any analyst not experienced in coding might initially use esquisse to build required charts for an assignment and gradually pick up the code. It’s also great for quick hypothesis testing: “What if I put X vs Y, colored by Z?” – do it in a few seconds with esquisse rather than writing code from scratch (which, if you know ggplot well isn’t too slow, but for complex data it might be). *(In the course of our discussion we recall that the user had a demonstration GIF or mention of esquisse – perhaps showing a drag and drop in action. Since we can’t show that here, the description suffices.)* ### **shinytableau**: Embedding R Shiny in Tableau While esquisse brings Tableau-like functionality into R, **shinytableau** goes the other direction: it brings R’s capabilities into **Tableau**. **shinytableau** is an R package (developed by RStudio, now Posit, folks like Joe Cheng) that allows R programmers to create **Tableau Dashboard Extensions** using Shiny. Tableau (since version 2020.3) supports extensions, which are essentially web apps you can embed into a dashboard for extra functionality not native to Tableau. Shinytableau acts as a bridge: you write a Shiny app in R, and then you deploy it as an extension (.trex file) that Tableau can use. Why is this cool or useful? * Tableau has a limited set of native chart types and calculations. If you want something advanced that Tableau can’t do out-of-the-box (for instance, a complex statistical visualization, or an interactive control that isn’t provided, or integrating with R’s machine learning packages to do something dynamic), you historically would be out of luck or have to hack around it. * With shinytableau, you can leverage **the full power of R (and its packages)** inside a Tableau dashboard. For example, you could embed a ggplot2 chart that Tableau doesn’t support (like a violin plot, as their motivating example suggests). Or run an R model on data and show the results interactively. * This allows better collaboration between data science teams (who use R/Python) and BI teams (who use Tableau). The data scientist can package their R work into a piece that fits into the existing Tableau dashboards of the company, rather than saying “please use this separate R Shiny dashboard”. So it integrates workflows. From the documentation: > “The shinytableau package allows you to easily extend Tableau dashboards using the power of R and Shiny. In typical Shiny fashion, it’s not necessary to know web technologies like HTML, JavaScript, and CSS to create compelling Tableau extensions.” So, R developers don’t need to become JavaScript developers to make an extension; shinytableau handles the behind-the-scenes of hooking Shiny into Tableau’s extension API. How it works (simplified): * You create a Shiny UI that defines what the extension will show (this might be a plot output, some controls, etc.). * You define how it can get data from Tableau (the extension can receive data from the Tableau dashboard, e.g., data underlying a sheet, or filters). * You wrap it with some metadata (shinytableau helps generate a .trex file which is basically a manifest telling Tableau how to render this extension and where). * In Tableau, a user adds an “Extension” object to a dashboard, points it to your .trex (which if deployed, points to a running Shiny app or uses RStudio Connect, etc.), and then it appears as a panel or element in the dashboard. So for example, the **violin plot** scenario: Tableau can’t draw violin plots natively. But R can (with ggplot2 or others). The shinytableau team provided an example where they embed a violin plot created by ggplot2 into Tableau. The Tableau user can configure which data to send to the extension (like choose a dimension and measure for the violin plot). Then the Shiny app generates that violin using R and displays it. Now you have a violin plot alongside your normal Tableau charts, expanding what insights you can show. Another example could be running k-means clustering in R on data and showing an interactive cluster plot in Tableau – something not straightforward in pure Tableau. One could see an IB student or teacher who uses Tableau for some things but wants to integrate R advanced analysis benefiting from this – though it’s more likely used in professional environments. The key point for our chapter: shinytableau demonstrates how R is **not an island**; it can work with other tools. It underscores R’s flexibility: * If you prefer code, use R directly (ggplot). * If you prefer GUI, use R’s esquisse or use Tableau. * If you want the best of both, mix them with solutions like shinytableau. This also preludes the next section about **Positron** maybe – as Posit is making tools to integrate languages and environments. *(We note that Posit (formerly RStudio) is indeed focusing on bridging languages and platforms – shinytableau is one such product bridging R with a popular BI tool.)* In summary, **esquisse** is for making plots in R without coding much, and **shinytableau** is for using R’s power within Tableau dashboards. Both lower the barriers for creating and sharing data visualizations: * Esquisse lowers the barrier for R users (especially beginners). * Shinytableau lowers the barrier for Tableau users to incorporate R enhancements, and for R developers to deliver results in the Tableau ecosystem. As the text likely suggests: these tools show how R can be both easier to use (with GUIs) and more **integrated** with enterprise tools. One might say: *“esquisse brings a Tableau-like interface into RStudio”*, and *“shinytableau brings R’s capabilities into Tableau”*. This synergy can be very powerful in practice. *(They might have included an image or screenshot of a Tableau dashboard with a Shiny component, illustrating e.g. a custom chart inside Tableau labeled as an extension. Since we can’t show that here, we rely on description.)* After exploring these integrations, the chapter likely moves on to resources and more advanced possibilities for data viz in R. ## Useful Documentation When creating visualizations, especially complex ones, you often need to fine-tune many details (legend appearance, axis breaks, annotations, etc.). It’s impossible to memorize every function or argument. That’s where documentation and community resources are invaluable. For **ggplot2**, some key resources include: * The official **ggplot2 documentation** site (or the reference section in R help). The tidyverse website for ggplot2 (as shown in references) provides a well-organized listing of functions and examples. For instance, want to know how to adjust themes? There’s a reference page for theme elements. Want to see all geoms? There’s an index (geom\_point, geom\_line, etc.). * The **ggplot2 book** (“ggplot2: Elegant Graphics for Data Analysis” by Hadley Wickham) – great for understanding deeper principles and extensions. * **Cheat Sheets**: RStudio provides a handy cheat sheet for Data Visualization with ggplot2. It’s basically a two-page quick reference that shows common geoms and how to use them, how to map aesthetics, how to facet, sample code for themes, etc. Many people pin this cheat sheet by their desk when working with R. It’s very helpful when you forget, say, the name of a certain scale function or the syntax for element\_text in theme. * **Community examples and Q\&A**: The R community is vibrant. Websites like **R Graph Gallery** (r-graph-gallery.com) have hundreds of example plots (with code) showcasing how to achieve certain visuals. If you want to do something slightly off the beaten path, chances are someone has posted an example. Also, **Stack Overflow** is filled with questions and answers for specific ggplot tweaks (“How to place legend outside plot”, “How to add labels to each bar”, etc.). Searching those can save you time and teach you new tricks. * **ggplot2 extension gallery**: As mentioned, ggplot2 can be extended by other packages. There is an official site tracking these extensions, which is great to find if someone has already made a package for a certain type of chart (like ggmap for maps, gganimate for animations, ggthemes for extra themes, etc.). The references even mention “tidyverse/ggplot2 README” that likely points to that extension gallery. The user text specifically references “the official reference site for ggplot2” and that it includes all themes etc., and mentions a **ggplot2 extensions gallery**, and the **RStudio Data Visualization cheat sheet**. They also note (perhaps via a figure or example) something about reordering factors for plotting. A common thing: bar charts by default will plot factors in the order of factor levels (often alphabetical). Many times, we want to order bars by value. This requires reordering the factor levels based on the data (using `fct_reorder` from forcats, or the `reorder()` function). For instance, reorder country factor by GDP before plotting so that the bar chart is sorted. They might have shown an example image of doing that. It’s not explicitly described in the user text excerpt, but it’s likely a tip they wanted to convey. In any case, the key advice is: * Use documentation for **fine control** over ggplot elements. For example, how to rotate axis text, how to manually set colors, how to increase figure margins – all these are in docs. * Look at examples/galleries to learn how to create less common charts or to get inspiration. For example, the R Graph Gallery might show a beautiful circular bar chart or a network diagram with ggplot2, which isn’t in the official docs because it’s not a basic chart – but someone figured it out and shared. * The community extension gallery reveals packages like **treemapify**, **ggridges** (ridgeline plots), **ggforce** (for additional geoms and functionalities), etc. If you need something special (like a heatmap or venn diagram or even humorous things like ggjoy which became ggridges), likely an extension exists. The text might specifically mention an extension with an example – possibly the animated treemap (treemapify + gganimate) because the reference to the gallery had something about a GIF. Or googleway example for maps was in references. **Example (reordering factors)**: They likely illustrate that if you want bars sorted by height, you can do: ```r gdp6$country <- with(gdp6, reorder(country, -gdp)) ``` before plotting, to reorder country by GDP descending. That’s a common tip to improve clarity (matching principle of simplicity and telling a clearer story – sorting from largest to smallest, for instance). In summary, no one expects you to memorize all of ggplot2. Knowing that resources exist and how to use them is part of being an effective data visualizer: * If something isn’t working, check the documentation or Google it. * If you wonder “Can ggplot do X?”, search for “ggplot \[what you want]” – likely someone has asked it or an example exists. * Keep cheat sheets and references at hand while you work on projects. The text explicitly points out that the **Data Carpentry R lesson** mentions where to find theme info – which we actually saw in that lesson excerpt: *“The complete list of themes is available at ggplot2.tidyverse.org/reference/ggtheme.html”*, proving that point. They also mention the ggplot2 extension gallery, perhaps highlighting how people have extended ggplot2 for various purposes. Possibly they mention the **ggplot2 extensions gallery** showing things like animated graphs (some extension or combination enabling animation, which leads us to the next advanced topic – gganimate maybe, or htmlwidgets for interactivity). Which leads into the next section of the chapter about advanced visuals: D3, htmlwidgets. ## Advanced Visualizations While ggplot2 covers a vast majority of static visualization needs, sometimes you want to go beyond static charts – maybe to interactive visualizations, web-based visuals, or highly custom graphics that aren’t easy to do with existing R packages. The chapter highlights two major pathways for advanced visuals: 1. **Using D3.js via R (the r2d3 package).** 2. **Using htmlwidgets (R interfaces to various JS libraries).** ### D3 via r2d3 **D3.js** (Data-Driven Documents) is a powerful JavaScript library for creating interactive and highly customizable web visualizations. Many modern interactive charts you see on the web (like those in the New York Times or The Guardian) are often powered by D3. It allows fine-grained control, animations, and dynamic behavior – basically anything you can do in a web browser with SVG or Canvas, D3 gives you tools to bind data to it. However, writing D3 code means writing JavaScript, which many R users might not be familiar with. The **r2d3** package is created to facilitate using D3 from R: * It lets you write D3 code but in a way that can be embedded in R contexts (like in RStudio viewer, in Shiny, or R Markdown). * It handles the communication of data from R to the D3 script. As the README states, r2d3 provides a suite of tools for using D3 with R, including translating R objects to D3-friendly data structures and rendering D3 output in RStudio or notebooks. * It basically wraps a D3 script in a function call `r2d3()` so that you can, for example, have a file `plot.js` with D3 code and then call `r2d3(data = mydata, script = "plot.js")` in R. This will launch that D3 visualization using `mydata` within RStudio’s viewer or an RMarkdown HTML output. * You can also distribute D3 visualizations as **htmlwidgets** or in Shiny apps using r2d3, and it even integrates with R Markdown for knitting interactive documents. The advantage: **anything** that can be done in D3 can be integrated into your R workflow. This means you’re not limited to ggplot2’s chart types or styles. If you dream up a custom interactive visualization (say an interactive network graph, or a map with animated transitions, or a novel visualization technique you saw in a research paper), if you’re willing to write some D3, you can implement it and still feed it with R data easily. For example, if you find a cool D3 example on bl.ocks.org (a popular gallery of D3 examples), you could take the D3 code, and with minimal adjustments, use r2d3 to feed it data from R and display it. The r2d3 README even suggests you can take examples from the D3 gallery and use them (it mentions linking to bl.ocks and such in the README snippet). They gave in references an example of a **Voronoi diagram** possibly (since a figure was mentioned) – Voronoi diagrams are not typical base R or ggplot charts, but D3 can do them. So maybe they showed one via r2d3 as an example of what’s possible beyond ggplot. Or the reference \[46] itself shows the snippet about a bar chart example. An example of usage might be: * You have a very custom requirement: e.g., an interactive timeline where hovering highlights connections, etc. Instead of waiting for someone to write an R package for it, you can write (or adapt) a D3 script. * R sends data to it (r2d3 handles converting R data frames or vectors to JSON, etc.). * The result appears in, say, your R Markdown output. If that output is HTML, the D3 is fully interactive in the viewer. If it’s just RStudio Viewer, you see it but not as a static image, as an interactive widget. Thus, r2d3 opens the door to **web-native visualizations** while staying in R. It does require some knowledge of JavaScript and D3’s way of thinking (which has a learning curve). But for those willing, it’s rewarding, because D3 is extremely powerful. Think of it like: ggplot2 gives you high-level abstraction but finite options, whereas D3 gives you low-level control but infinite possibilities (with more code needed). The user text likely says something like: *“r2d3 allows R users to harness the full power of D3.js library for custom and interactive visualizations”*, highlighting that it translates R objects to D3 and renders them. It might also mention that RStudio (Posit) has good integration – RStudio v1.2 added support for previewing D3 scripts as you develop them (as the README snippet says). They really tried to make R a friendly place to create D3 viz by bridging the gap. So if you can’t find a ggplot or widget solution, and you’re up for it, r2d3 is your go-to for ultimate customization. ### htmlwidgets While r2d3 is one specific way focusing on D3, **htmlwidgets** is a broader ecosystem in R that allows integration of many **JavaScript visualization libraries** through ready-made packages. An **htmlwidget** in R is basically a special object (usually created by some package’s function) that knows how to render itself as HTML/JavaScript in a web context (like RStudio viewer, Shiny, or RMarkdown’s HTML output). R provides many such widgets so that you don’t have to write JS yourself. Some popular htmlwidgets: * **leaflet**: interactive maps (pan, zoom, markers, etc.) * **plotly**: interactive plots (built on plotly JS library, can even take a ggplot and make it interactive with tooltips) * **DT**: interactive data tables (for viewing data frames with sorting, filtering in a webpage) * **networkD3**: interactive network graphs using D3 * **highcharter**: interface to Highcharts (a powerful JS chart library) for interactive charts * **googleVis or googleCharts**: interfaces to Google Charts * **visNetwork**: another network viz package * **rgl and threejs widgets**: for 3D graphics in browser * **crosstalk**: a framework to link widgets for coordinated filtering (multiple htmlwidgets reacting to common inputs) * And many more: there’s a gallery (gallery.htmlwidgets.org) showing lots of them. The user text references googleway (an htmlwidget for Google Maps) with an image example likely, and highcharter, and “d3.js R wrapper” which could refer to networkD3 or r2d3, but likely networkD3 or similar. The great thing about htmlwidgets: * **Easy to use**: You call an R function, get an interactive chart. For example, `leaflet(data) %>% addTiles() %>% addCircles(...)` creates an interactive map. No JS coding needed by you. The R package handles converting your data and options to JavaScript under the hood. * **Self-contained**: When you knit an R Markdown to HTML or host a Shiny app, the widget’s needed JS/CSS is packaged in, so it just works. Even sharing an HTML file, the widgets are embedded (they can be standalone or use CDN). * **Brings the best of JS libraries to R**: Want a fancy JavaScript chart? Likely someone made an R htmlwidget for it if it’s popular. The references mention: * googleway example (with a legend) – likely showing a Google Map with colored legend, meaning R user can use Google Maps API from R easily via that widget. * highcharter – an R interface to Highcharts, which is a widely-used commercial JS chart library (free for personal use). Highcharts can produce very nice interactive charts. highcharter lets R users use it without writing JS. * “d3.js R wrapper” might point to specific ones like **networkD3**, **d3heatmap**, etc. * htmlwidgets in general and their gallery. One nice aspect: Many htmlwidgets support **pipeable** interfaces or syntax similar to their JS counterpart. For example, leaflet uses the pipe to add layers similar to how you’d call methods in JS. Also, htmlwidgets can be combined; with crosstalk, you can have multiple widgets share a common data filter. E.g., a leaflet map and a DT table such that selecting a row in the table highlights a point on the map – all in R, without writing JS glue. Crosstalk does behind scenes. Essentially, **htmlwidgets** give R users the power of the web’s interactive visualization world, within R’s comfort. The user text likely encourages exploring the **htmlwidgets gallery** which showcases many options. It might mention how there’s a widget for almost everything, including specialized ones like **googleway** for maps or **threejs** for 3D, **diagrammeR** for diagrams, etc. One of the references (maybe \[38]) describes a Google map with a legend, showing how one can incorporate external services (Google Maps) in R visuals. The figure probably illustrated it. Takeaway: for interactive needs, check if an htmlwidget exists before trying to build from scratch. Chances are high it does if the library is famous. One might ask: how do htmlwidgets differ from r2d3? In a sense, r2d3 is like a way to create a custom htmlwidget (with D3 specifically) by writing JS yourself. Htmlwidgets packages wrap existing JS libs with R functions for you. So if what you need aligns with an available library, just use the htmlwidget package. If you need something truly custom, use r2d3 or develop a new widget package. Finally, mention that htmlwidgets work seamlessly in **Shiny** too. In Shiny apps, you can output any htmlwidget just like a plot, but it’s interactive. The references encourage exploring the htmlwidgets gallery (pointing to examples created with them). They might specifically mention **plotly** or others. The user did mention highcharter and d3js R wrapper in their notes, implying those were discussed. The combination of ggplot2, r2d3, and htmlwidgets covers a huge range of visualization possibilities: * **Static publication-quality charts** (ggplot2). * **Custom web interactive charts** (r2d3 or writing new widgets). * **Ready-to-use interactive charts** (htmlwidgets). So an R user is empowered to handle almost any viz scenario. The chapter likely emphasizes that you’re not limited to static plots in R; you can go fully interactive and dynamic if needed. It probably ends with something like: *“We’ve only scratched the surface. With these tools, you can create advanced visuals – from animated maps to interactive dashboards – all within R.”* ## GYHD (Continued Practice) The chapter’s hands-on practice earlier (steps 1-5) was partly about data transformation. It appears Step 5 of that practice involved turning gdp2 (wide) into gdp3 (long). Possibly they left off in a previous chapter after creating gdp2. So now in Chapter 10 they continue with Step 5 to tidy the data, presumably in order to then do some visualization with it (maybe repeating what we did manually with gdp5? I suspect gdp5 might have been that tidy data already, but not sure). However, the user text for Step 5 is given in detail, we already covered it. They emphasize after pivot\_longer: “Now gdp3 has three columns: country, year, gdp… if gdp2 had columns each year, then gdp3 has multiple rows per country etc.” So they are explaining the transformation outcome. Then they likely would instruct to use gdp3 for plotting (which we effectively did with gdp5, maybe gdp5 was already similar). Anyway, since that content was more data-wrangling oriented but done in the visualization chapter, it’s reinforcing that to visualize data effectively (with ggplot), you often need to reshape data into the long format. **In conclusion**, the chapter provided: * Principles of data viz (integrity, right chart, simplicity, color, story). * Goals of viz (explore, monitor, explain). * Practical ggplot2 examples (bar, line, bubble, maps). * Integration with GUI (esquisse) and other software (shinytableau). * Advanced possibilities (D3, htmlwidgets). * Pointers to resources and further practice. Finally, they list references to sources that informed these guidelines and tools, which we’ll compile in the next section. # References * Agnese Jaunosane (2024). **Data Visualization Principles With Good & Bad Examples** – *Ajelix Blog*. (Introduces core principles such as clarity/simplicity, accuracy/integrity, choosing the right chart type, effective use of color, and avoiding misleading visuals, with illustrative examples) * Sahin Ahmed (2023). **Essential Principles for Effective Data Visualization** – *The Deep Hub on Medium*. (Outlines key principles including clarity and simplicity, choosing appropriate chart types, strategic color usage, and storytelling with data, with practical guidelines for each) * **Data Visualization – Everything You Need to Know** – *Actian Blog* (2025). (Describes the purpose of data visualization and highlights the three main goals: to **explore**, **monitor**, and **explain** data insights, explaining why each is important for decision-making) * **Data Visualization with ggplot2** – *Data Carpentry R Ecology Lesson* (2022). (Tutorial on using ggplot2 for creating plots; includes notes on customizing themes and a reminder that the complete list of ggplot2 themes is available in the official documentation for reference) * Dario Radečić (2023). **R Esquisse: How to Explore Data in R Through a Tableau-like Drag-and-Drop Interface** – *Appsilon Blog*. (Introduces the **esquisse** package, a GUI add-in for RStudio that allows users to create ggplot2 charts by dragging and dropping variables, much like Tableau, thus lowering the learning curve for R visualizations) * Joe Cheng et al. (2022). **Introduction to shinytableau** – *Posit (RStudio) shinytableau Documentation*. (Explains how the **shinytableau** package enables R/Shiny developers to create Tableau dashboard extensions. It bridges Tableau’s extension API with Shiny, allowing Tableau users to incorporate R-powered visuals and analyses in dashboards without needing to write JavaScript) * **r2d3: R Interface to D3 Visualizations** – *RStudio Package Documentation*. (Details the **r2d3** package, which provides tools to integrate D3.js visualizations into R workflows, including translating R data to JavaScript and rendering D3 scripts in R contexts. Empowers creation of highly custom interactive graphics using D3 within R) * **ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics** – *tidyverse/ggplot2 Documentation*. (The official ggplot2 documentation and user guide. Includes references to its extensive ecosystem, noting that a community-maintained extensions gallery showcases over 100 ggplot2 extension packages for specialized charts and themes) * GeeksforGeeks (2025). **Mastering Tufte’s Data Visualization Principles**. (Summarizes Edward Tufte’s principles such as graphical integrity and maximizing data-ink ratio. Emphasizes truthful representation of data – e.g., avoiding non-zero baselines that exaggerate differences – aligning with our discussion on integrity) * Polymer Search (2023). **10 Good and Bad Examples of Data Visualization** – *Polymer Blog*. (Provides examples of common pitfalls like truncated axes and misleading scales, versus improved versions. Reinforces why proper scaling and context are vital for truthful visuals. For instance, demonstrates how a tiny increase can appear huge with a truncated y-axis, underscoring our points on graphical integrity)