---
title: "Integration of ulrb in a simple microbial ecology workflow"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Integration of ulrb in a simple microbial ecology workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


# Ecological analysis of microbial rare biosphere defined by ulrb

In this tutorial we will show how the output from `define_rb()` can be readily used for some common ecological analysis. In this case, we will just look at alpha diversity. 

To do so, we have several steps:

a) Load and clean OTU table (we just want prokaryotes);
b) Rarefy samples (we show this option, because it is very common);
c) Classify OTUs into rare, undetermined or abundant (with `define_rb()` function);
d) Merge OTU table with metadata information;
e) Calculate and plot alpha diversity metrics against environmental variables.

## Quick overview of N-ICE dataset

The N-ICE dataset consists of 9 samples, collected during the winter to spring transition of 2015 by fixing the vessel on drifting ice (Granskog et al., 2018). Samples were collected at various depths (from 5m to 250m) and the ice drifted across different regions, DNA was collected for 16S amplicon sequencing and metagenomes (de Sousa et al., 2019); and bioinformatic processing of reads was performed by
Mgnify platform (v5) (Mitchel et al., 2020).

For this tutorial we are using the OTU table downloaded from the link https://www.ebi.ac.uk/metagenomics/studies/MGYS00001922#analysis in 06-01-2023 and 
focus solely on the 16S amplicon data.


```{r setup}
library(ulrb)
library(dplyr)
library(tidyr)
library(vegan)
library(ggplot2)
library(purrr)
#
set.seed(123)
```


## (a) Load and clean OTU table

```{r}
# Load raw OTU table from N-ICE
data("nice_raw", package = "ulrb")
data("nice_env", package = "ulrb")

# Change name of first column
nice_clean <- rename(nice_raw, Taxonomy = "X.SampleID")

# Select 16S rRNA amplicon sequencing samples
selected_samples <- c("ERR2044662", "ERR2044663", "ERR2044664",
                      "ERR2044665", "ERR2044666", "ERR2044667",
                      "ERR2044668", "ERR2044669", "ERR2044670")

# Add a column with phylogenetic units ID (OTU in this case)
nice_clean <- mutate(nice_clean, OTU = paste0("OTU_", row_number()))

# Select relevant collumns
nice_clean <- select(nice_clean, selected_samples, OTU, Taxonomy)

# Separate Taxonomy column into each taxonomic level
nice_clean <- separate(nice_clean, Taxonomy,
                       c("Domain","Kingdom","Phylum",
                         "Class","Order","Family","Genus","Species"),
                       sep=";")

# Remove Kingdom column, because it is not used for prokaryotes
nice_clean <- select(nice_clean, -Kingdom)

# Remove eukaryotes
nice_clean <- filter(nice_clean, Domain != "sk__Eukaryota")

# Remove unclassified OTUs at phylum level
nice_clean <- filter(nice_clean, !is.na(Phylum))

# Simplify name
nice <- nice_clean

# Check if everything looks normal
head(nice)

# Change table to tidy format
# You can automatically do this with an ulrb function
nice_tidy <- prepare_tidy_data(nice, ## data to tidy 
                  sample_names = contains("ERR"), ## vector with ID samples
                  samples_in = "cols") ## samples can be in columns (cols) or rows. 

```

## (b) Rarefy samples

Now that we have a tidy table, it will be important to keep in mind how the data is organized. In this tutorial, we will be using functions like `group_by()` or `nest()`, if you are not familiarized with them, try to search for tidyverse packages, functions and mindset. We are also going to use the package vegan, which can be used for most ecology statistics. Note, however, that vegan works with matrices in very specific formats, so hopefully this sections will also help you to see how you can use tidy verse approaches to very specific data types.

The first step will be to verify how many reads each sample got

```{r, fig.width = 6, fig.height = 4.5}
## First, check how many reads each sample got
nice_tidy %>% 
  group_by(Sample) %>% ## because data is in tidy format
  summarise(TotalReads = sum(Abundance)) %>% 
  ggplot(aes(Sample, TotalReads)) + 
  geom_hline(yintercept = 40000) + 
  geom_col() + 
  theme_bw() + 
  coord_flip()
```

All samples have more than about 50 000 reads, with one sample having a little bit less. But we think we can keep all samples and rarefy down to 40 000 reads.

```{r}
nice_rarefid <- 
  nice_tidy %>% 
  group_by(Sample) %>%
  nest() %>% 
  mutate(Rarefied_reads = map(.x = data, 
                              ~as.data.frame(
                                t(
                                  rrarefy(
                                    .x$Abundance, 
                                    sample = 40000))))) %>% 
  unnest(c(data,Rarefied_reads))%>% 
  rename(Rarefied_abundance = "V1")  
```

## (b) Classify OTUs into rare, undetermined or abundant (with `define_rb()` function);

At this stage, we can simply apply the `define_rb()` function, which will include a nesting and grouping step inside of itself. For this reason, the data does not need to be grouped before the function. During unit testing, however, we noticed that this could introduce problems, so the function implicitly removes any grouping of the tidy table used by the author. This way, only the pre-specific grouping of the function is used. To have control over this, you just need to be careful with the sample column, because ulrb method assumes that the calculations are made "by sample" -- so, it will do an `ungroup()` followed by `group_by(Sample)`.

In practice, this is very simple:

```{r}
nice_classified <- define_rb(nice_tidy, simplified = TRUE)
head(nice_classified)
```

By default, the function can give lots of details, but in this case, since we are just going to look at the ecology component, and not at the machine learning aspects of this, we can set the argument *simplified* to `TRUE`. Note that the additional computational time required for the additional details should be negligible for small datasets (we don't know how it behaves for bigger datasets).

## (c) Merge OTU table with metadata information

We are almost ready to analyze ecological questions, but first we need metadata. All we have to do is merge our classified OTU table with the metadata table. To do this we will use join functions from dplyr, just note that the sample ID from the two tables that we want to merge are different, but this is easy to solve.

```{r}
nice_eco <- nice_classified %>% left_join(nice_env, by = c("Sample" = "ENA_ID"))
```


## (e) Calculate and plot diversity metrics against environmental variables.

At this point, we can calculate some common alpha diversity metrics. If we use the dplyr and ggplot functions, then it becomes really easy to make a simple overview of prokaryotic diversity, by classification.

```{r}
# Calculate diversity
nice_diversity <- nice_eco %>% 
  group_by(Sample, Classification, Month, Depth, Water.mass) %>% 
  summarise(SpeciesRichness = specnumber(Abundance),
            ShannonIndex = diversity(Abundance, index = "shannon"),
            SimpsonIndex = diversity(Abundance, index = "simpson")) %>% 
  ungroup()
```

```{r}
## Re-organize table to have all diversity indices in a single col.
nice_diversity_tidy <- nice_diversity %>% 
  pivot_longer(cols = c("SpeciesRichness", "ShannonIndex", "SimpsonIndex"),
               names_to = "Index",
               values_to = "Diversity") %>% 
# correct order
  mutate(Month = factor(Month, levels = c("March", "April", "June"))) %>% 
# edit diversity index names
  mutate(Index = case_when(Index == "ShannonIndex" ~ "Shannon Index",
                           Index == "SimpsonIndex" ~ "Simpson Index",
                           TRUE ~ "Number of OTUs")) %>%
  # order diversity index names
  mutate(Index = factor(Index, c("Number of OTUs", "Shannon Index","Simpson Index")))
```

### Alpha diversity plots

At this stage, we have everything in a single tidy data.frame, which makes ggplot plots straightforward.

- Month variation

```{r, fig.width = 6, fig.height = 4.5}
qualitative_colors <- 
  c("#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

## Plots
nice_diversity_tidy %>%
  ggplot(aes(Month, Diversity, col = Classification)) + 
  geom_point() + 
  facet_wrap(~Index, scales = "free_y")+
  scale_color_manual(values = qualitative_colors[c(5,3,6)]) + 
  theme_bw() + 
  theme(strip.background = element_blank(),
        strip.text = element_text(size = 12),
        legend.position = "top") + 
  labs(col = "Classification: ",
       title = "Alpha diversity across month")
```

- Water mass variation

```{r, fig.width = 6, fig.height = 4.5}
nice_diversity_tidy %>% 
  ggplot(aes(Water.mass, Diversity, col = Classification)) + 
  geom_point() + 
  facet_wrap(~Index, scales = "free_y")+
  scale_color_manual(values = qualitative_colors[c(5,3,6)]) + 
  theme_bw() + 
  theme(strip.background = element_blank(),
        strip.text = element_text(size = 12),
        legend.position = "top") %>%
  labs(col = "Classification: ",
       title = "Alpha diversity across water mass")
```

### Beta diversity

We can also do some basic beta diversity. Note that it might require a little bit more of data wrangling.

In this section, we will only focus on the rare biosphere.

- Months

```{r}
rare_biosphere <- 
  nice_eco %>%  
  filter(Classification == "Rare") %>% 
  select(Sample, Abundance, OTU, Depth, Month, Water.mass) %>% 
  mutate(Sample_unique = paste(Sample, Month))

rb_env <- rare_biosphere %>% 
  ungroup() %>% 
  select(Sample_unique, Depth, Water.mass, Month) %>% 
  distinct()

rb_sp_prep <- rare_biosphere %>% 
  ungroup() %>% 
  select(Sample_unique, Abundance, OTU)

rb_sp_prep %>% head() # sanity check

rb_sp <- 
  rb_sp_prep %>% 
  tidyr::pivot_wider(names_from = "OTU",
                     values_from = "Abundance",
                     values_fn = list(count= list)) %>% 
  print() %>% 
  unchop(everything())
```

```{r}
rb_sp[is.na(rb_sp)] <- 0

# Prepare aesthetics
rb_env <- rb_env %>% 
  mutate(col_month = case_when(Month == "March" ~ qualitative_colors[1],
                               Month == "April" ~ qualitative_colors[2],
                               TRUE ~ qualitative_colors[3])) %>% 
  mutate()
```

```{r, fig.height = 8, fig.width= 8}
cca_plot_rare_biosphere <- 
  cca(rb_sp[,-1], display = "sites", scale = TRUE) %>% 
  plot(display = "sites", type = "p", main = "Rare biosphere")
#
cca_plot_rare_biosphere
points(cca_plot_rare_biosphere,
       bg = rb_env$col_month,
       pch = 21, 
       col = "grey", 
       cex = 2)
#
with(rb_env,
     ordispider(cca_plot_rare_biosphere,
                Month, lty = "dashed", label = TRUE))
```

From this analysis, it seems that the community composition of the rare biosphere is different between June and the remaining months.

# Final considerations

Hopefully, you got a sense of how the ulrb package can be integrated in a more general microbial ecology workflow. Of course, many more questions could be approached, such as taxonomy and so on. This was meant to serve as an example of how ulrb can be trivially integrated in a data analysis workflow. 

# References

- de Sousa, A. G. G., Tomasino, M. P., Duarte, P., Fernández-Méndez, M., Assmy, P., Ribeiro, H., Surkont, J., Leite, R. B., Pereira-Leal, J. B., Torgo, L., & Magalhães, C. (2019). Diversity and Composition of Pelagic Prokaryotic and Protist Communities in a Thin Arctic Sea-Ice Regime. Microbial Ecology, 78(2), 388–408.

- Mitchell, A. L., Almeida, A., Beracochea, M., Boland, M., Burgin, J., Cochrane, G., Crusoe, M. R., Kale, V., Potter, S. C., Richardson, L. J., Sakharova, E., Scheremetjew, M., Korobeynikov, A., Shlemov, A., Kunyavskaya, O., Lapidus, A., & Finn, R. D. (2019). MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research, 48(D1), D570–D578.

- Granskog, M. A., Fer, I., Rinke, A., & Steen, H. (2018). Atmosphere-Ice-Ocean-Ecosystem Processes in a Thinner Arctic Sea Ice Regime: The Norwegian Young Sea ICE (N-ICE2015) Expedition. Journal of Geophysical Research: Oceans, 123(3), 1586–1594.