Chapter 3 Visualization

Visualizing proteomic data helps identify patterns, outliers, and biological signals. This chapter covers the main visualization methods in NULISAseqR.

3.1 Why Visualize?

Data visualization helps you:

Identify sample clusters and outliers
Assess data quality visually
Detect batch effects
Explore biological patterns
Communicate findings effectively

3.2 Heatmaps

Heatmaps show protein expression patterns across samples and targets. Hierarchical clustering can help organize heatmaps to indicate which samples and targets are most similar to each other.

The generate_heatmap() function:

Standardizes data by protein (z-scores): centers and scales each protein’s values
Clusters similar samples and proteins together
Annotates samples with metadata (disease type, sex, age, etc.)
Uses ComplexHeatmap with automatic RColorBrewer color palettes

See complete function documentation and additional options, use ?generate_heatmap().

3.2.1 Basic Heatmap

heatmap1 <- generate_heatmap(
  data = data$merged$Data_NPQ,
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = sample_list,
  row_fontsize = 4
)

3.2.2 Heatmap with Annotations

Add sample annotations to highlight key variables:

heatmap2 <- generate_heatmap(
  data = data$merged$Data_NPQ,
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = sample_list, 
  annotate_sample_by = c("disease_type", "sex", "age"),
  row_fontsize = 4
)

3.2.3 Understanding the Heatmap

The heatmap shows:

Rows: Protein targets
Columns: Samples
Colors: Expression levels (red = high, blue = low)
Dendrograms: Hierarchical clustering
- Top: Sample relationships
- Left: Protein relationships
Annotation bars: Sample characteristics

3.2.4 Interpreting Clusters

Look for:

Sample clusters by biology: Samples with similar disease states clustering together
Protein clusters: Co-expressed protein groups
Outlier samples: Samples far from their expected group
Batch effects: Clustering by plate/batch rather than biology

3.3 Principal Component Analysis (PCA)

PCA reduces high-dimensional data to its main components of variation.

The generate_pca() function:

Standardizes data by protein (z-scores): centers and scales each protein’s values
Performs PCA using PCAtools package to identify major sources of variation
Creates biplot showing sample relationships in PC space
Annotates samples with metadata colors and shapes
Uses automatic RColorBrewer color palettes or custom colors

See complete function documentation and additional options, use ?generate_pca().

3.3.1 PCA with Multiple Visual Encodings

Color by one variable, shape by another:

pca2 <- generate_pca(
  data = data$merged$Data_NPQ,
  plot_title = "NULISAseq Detectability Study PCA",
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = sample_list, 
  annotate_sample_by = "disease_type",  # Color
  shape_by = "sex"                       # Shape
)

3.3.2 Understanding PCA Plots

Axes

PC1 (x-axis): Captures the most variation in the data
PC2 (y-axis): Captures the second most variation
% variance: Shows how much variation each PC explains

Interpretation

Tight clusters: Samples with similar expression profiles
Separation: Groups with distinct expression patterns
Outliers: Samples far from the main cluster
Batch effects: If clustering by technical variables (plate, batch), indicates unwanted variation

What to Look For

Good signs:

✓ Samples cluster by biological group
✓ PC1/PC2 explain substantial variance (>20% combined)
✓ Clear separation between conditions

Warning signs:

✗ Clustering by technical variables (batch, plate)
✗ Outliers far from their group
✗ No visible separation despite known biology

3.4 Custom Visualizations with `ggplot2`

For exploratory analysis beyond heatmaps and PCA, create custom plots with ggplot2.

# setting ggplot theme
# custom ggplot theme

custom_theme <- theme_bw() + theme(
  panel.background = element_rect(fill='white'),
  plot.background = element_rect(fill='transparent', color = NA),
  legend.background = element_rect(fill='transparent'),
  legend.key = element_rect(fill = "transparent", color = NA)
)

theme_set(custom_theme)

showtext::showtext_auto() ## able to output beta symbol and other special character in pdf

3.4.1 NPQ Distribution - Histogram

Check the overall distribution of NPQ values across all samples and proteins:

data_long %>%
  filter(SampleName %in% sample_list) %>%
  ggplot(aes(x = NPQ)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  labs(title = "Distribution of NPQ across all targets",
       x = "NPQ", y = "Count")

Check the distribution of NPQ values across all samples for each protein:

data_long %>%
  filter(SampleName %in% sample_list) %>%
  ggplot(aes(x = NPQ)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  facet_wrap(~ Target, scales = "free_y") +
  labs(title = "Distribution of NPQ Values by Target",
       x = "NPQ", y = "Count") +
  theme(
    plot.title = element_text(size = 18, face = "bold"),      
    plot.subtitle = element_text(size = 16),                   
    axis.title.x = element_text(size = 15),                    
    axis.title.y = element_text(size = 15),                    
    axis.text.x = element_text(size = 12, angle = 45, hjust = 1), 
    axis.text.y = element_text(size = 12),                     
    strip.text = element_text(size = 11)                    
  )

3.4.2 NPQ Distribution by Group - Density Plot

Compare NPQ distributions between groups to check for systematic differences:

data_long %>%
  filter(SampleName %in% sample_list) %>%
  ggplot(aes(x = NPQ, fill = disease_type)) +
  geom_density(alpha = 0.5) +
  labs(title = "NPQ Distribution by Disease Type",
       x = "NPQ", y = "Density",
       fill = "Disease Type")

3.4.3 Single Protein Across Groups

Examine how a single protein varies across conditions:

# Plot a specific protein across groups
protein_of_interest <- "IL6"

data_long %>%
  filter(Target == protein_of_interest, SampleName %in% sample_list) %>%
  ggplot(aes(x = disease_type, y = NPQ, fill = disease_type)) +
  geom_jitter(width = 0.2, alpha = 0.7, size = 1) +
  geom_boxplot(alpha = 0.3, outlier.shape = NA) +
  labs(title = paste(protein_of_interest, "Expression"),
       x = "Disease Type", y = "NPQ") +
  theme(legend.position = "none")

3.4.4 Multiple Proteins Across Groups

Compare several proteins simultaneously using facets:

# Select top 6 proteins by variance
protein_variance <- apply(data$merged$Data_NPQ[, sample_list], 1, var, na.rm = TRUE)
top_proteins <- names(sort(protein_variance, decreasing = TRUE)[1:6])

# Plot multiple proteins
data_long %>% 
  filter(Target %in% top_proteins, SampleName %in% sample_list) %>%
  ggplot(aes(x = disease_type, y = NPQ, fill = disease_type)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  facet_wrap(~Target, scales = "free_y", ncol = 3) +
  labs(title = "Top Variable Proteins Across Disease Groups",
       x = "Disease Type", y = "NPQ") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text = element_text(size = 16, face = "bold"))

3.4.5 Protein-Protein Correlation

# Select two proteins to compare
protein1 <- "IL6"
protein2 <- "TNF"

plot_data <- data_long %>%
  filter(Target %in% c(protein1, protein2), SampleName %in% sample_list) %>%
  select(SampleName, Target, NPQ, disease_type) %>%
  pivot_wider(names_from = Target, values_from = NPQ)

ggplot(plot_data, aes(x = .data[[protein1]], y = .data[[protein2]], 
                      color = disease_type)) +
  geom_point(size = 1, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = 1) +
  labs(title = paste(protein1, "vs", protein2),
       x = paste(protein1, "NPQ"),
       y = paste(protein2, "NPQ"),
       color = "Disease Type")

3.5 Saving Plots

Both generate_heatmap() and generate_pca() have built-in options to save plots directly.

Save Heatmap

# Automatically saves to file
generate_heatmap(
  data = data$merged$Data_NPQ,
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = sample_list,
  annotate_sample_by = c("disease_type", "sex", "age"),
  output_dir = "figures",              # Where to save
  plot_name = "expression_heatmap.pdf",# Filename (PDF, PNG, JPG, SVG)
  row_fontsize = 5,
  plot_width = 12,                     # Width in inches
  plot_height = 10                     # Height in inches
)

Supported formats: PDF, PNG, JPG, SVG (determined by file extension)

Save PCA

# Automatically saves to file
generate_pca(
  data = data$merged$Data_NPQ,
  sampleInfo = metadata,
  sampleName_var = "SampleName",
  sample_subset = sample_list,
  annotate_sample_by = "disease_type",
  output_dir = "figures",              # Where to save
  plot_name = "pca.png",               # Filename (PDF, PNG, JPG, SVG)
  plot_title = "NULISAseq Detectability Study PCA",
  plot_width = 8,                      # Width in inches
  plot_height = 6                      # Height in inches
)