Flow cytometry is a well-known technique for identifying cell populations contained in a biological smaple. It is largely applied in biomedical and medical sciences for cell sorting, counting, biomarker detections and protein engineering. The technique also provides an energy efficient alternative to microscopy that has long been the standard technique for cell population identification. Cyanobacteria are bacteria phylum believe to contribute more than 50% of atmospheric oxygen via oxygen and are found almost everywhere. These bacteria are also one of the known oldest life forms known to obtain their energy via photosynthesis.
We load the package and necessary dependencies below. We also load tidyverse for some data cleaning steps that we need to carry out.
library(dplyr)
library(magrittr)
library(tidyr)
library(purrr)
library(flowCore)
library(flowDensity)
library(cyanoFilter)
To illustrate the funtions contained in this package, we use two datafiles contained by default in the package. These are just demonstration dataset, hence are not documented in the helpfiles.
metadata <- system.file("extdata", "2019-03-25_Rstarted.csv",
package = "cyanoFilter",
mustWork = TRUE)
metafile <- read.csv(metadata, skip = 7, stringsAsFactors = FALSE,
check.names = TRUE)
#columns containing dilution, $\mu l$ and id information
metafile <- metafile %>%
dplyr::select(Sample.Number,
Sample.ID,
Number.of.Events,
Dilution.Factor,
Original.Volume,
Cells.L)
Each row in the csv file corresponds to a measurement from two types of cyanobacteria cells carried out at one of three dilution levels. The columns contain information about the dilution level, the number of cells per micro-litre (\(cell/\mu l\)), number of particles measured and a unique identification code for each measurement. The Sample.ID column is structured in the format cyanobacteria_dilution. We extract the cyanobacteria part of this column into a new column and also rename the \(cell/\mu l\) column with the following code:
#extract the part of the Sample.ID that corresponds to BS4 or BS5
metafile <- metafile %>% dplyr::mutate(Sample.ID2 =
stringr::str_extract(metafile$Sample.ID, "BS*[4-5]")
)
#clean up the Cells.muL column
names(metafile)[which(stringr::str_detect(names(metafile), "Cells."))] <- "CellspML"
To determine the appropriate data file to read from a FCM datafile, the desired minimum, maximum and column containing the \(cell\mu l\) values are supplied to the goodfcs() function. The code below demonstrates the use of this function for a situation where the desired minimum and maximum for \(cell/\mu l\) is 50 and 1000 respectively.
metafile <- metafile %>% mutate(Status = cyanoFilter::goodFcs(metafile = metafile,
col_cpml = "CellspML",
mxd_cellpML = 1000,
mnd_cellpML = 50)
)
knitr::kable(metafile)
Sample.Number | Sample.ID | Number.of.Events | Dilution.Factor | Original.Volume | CellspML | Sample.ID2 | Status |
---|---|---|---|---|---|---|---|
1 | BS4_20000 | 6918 | 20000 | 10 | 62.02270 | BS4 | good |
2 | BS4_10000 | 6591 | 10000 | 10 | 116.76311 | BS4 | good |
3 | BS4_2000 | 6508 | 2000 | 10 | 517.90008 | BS4 | good |
4 | BS5_20000 | 5976 | 20000 | 10 | 48.31036 | BS5 | bad |
5 | BS5_10000 | 5844 | 10000 | 10 | 90.51666 | BS5 | good |
6 | BS5_2000 | 5829 | 2000 | 10 | 400.72498 | BS5 | good |
The function adds an extra column, Status, with entries good or bad to the metafile. Rows containing \(cell/\mu l\) values outside the desired minimum and maximum are labelled bad. Note that the Status column for the fourth row is labelled bad, because it has a \(cell/\mu l\) value outside the desired range.
Although any of the files labelled good can be read from the FCM file, the retain() function can help select either the file with the highest \(cell/\mu l\) or that with the smallest \(cell/\mu l\) value. To do this, one supplies the function with the status column, \(cell/\mu l\) column and the desired decision. The code below demonstrates this action for a case where we want to select the file with the maximum \(cell/\mu l\) from the good measurements for each unique sample ID.
broken <- metafile %>% group_by(Sample.ID2) %>% nest()
metafile$Retained <- unlist(map(broken$data, function(.x) {
retain(meta_files = .x, make_decision = "maxi",
Status = "Status",
CellspML = "CellspML")
})
)
knitr::kable(metafile)
Sample.Number | Sample.ID | Number.of.Events | Dilution.Factor | Original.Volume | CellspML | Sample.ID2 | Status | Retained |
---|---|---|---|---|---|---|---|---|
1 | BS4_20000 | 6918 | 20000 | 10 | 62.02270 | BS4 | good | No! |
2 | BS4_10000 | 6591 | 10000 | 10 | 116.76311 | BS4 | good | No! |
3 | BS4_2000 | 6508 | 2000 | 10 | 517.90008 | BS4 | good | Retain |
4 | BS5_20000 | 5976 | 20000 | 10 | 48.31036 | BS5 | bad | No! |
5 | BS5_10000 | 5844 | 10000 | 10 | 90.51666 | BS5 | good | No! |
6 | BS5_2000 | 5829 | 2000 | 10 | 400.72498 | BS5 | good | Retain |
This function adds another column, Retained, to the metafile. The third and sixth row in the metadata are with the highest \(cell/\mu l\) values, thus one can proceed to read the fourth and sixth file from the corresponding FCS file for BS4 and BS5 respectively. This implies that we are reading in only two FCS files rather than the six measured files.
To read B4_18_1.fcs file into R, we use the read.FCS() function from the flowCore package. The dataset option enables the specification of the precise file to be read. Since this datafile contains one file only, we set this option to 1. If this option is set to 2, it gives an error since text.fcs contains only one datafile.
flowfile_path <- system.file("extdata", "B4_18_1.fcs", package = "cyanoFilter",
mustWork = TRUE)
flowfile <- read.FCS(flowfile_path, alter.names = TRUE,
transformation = FALSE, emptyValue = FALSE,
dataset = 1)
flowfile
#> flowFrame object ' B4_18_1'
#> with 8729 cells and 11 observables:
#> name desc range minRange maxRange
#> $P1 FSC.HLin Forward Scatter (FSC.. 1e+05 0.00000 99999
#> $P2 SSC.HLin Side Scatter (SSC-HL.. 1e+05 -34.47928 99999
#> $P3 GRN.B.HLin Green-B Fluorescence.. 1e+05 -21.19454 99999
#> $P4 YEL.B.HLin Yellow-B Fluorescenc.. 1e+05 -10.32744 99999
#> $P5 RED.B.HLin Red-B Fluorescence (.. 1e+05 -5.34720 99999
#> $P6 NIR.B.HLin Near IR-B Fluorescen.. 1e+05 -4.30798 99999
#> $P7 RED.R.HLin Red-R Fluorescence (.. 1e+05 -25.49018 99999
#> $P8 NIR.R.HLin Near IR-R Fluorescen.. 1e+05 -16.02002 99999
#> $P9 SSC.ALin Side Scatter Area (S.. 1e+05 0.00000 99999
#> $P10 SSC.W Side Scatter Width (.. 1e+05 -111.00000 99999
#> $P11 TIME Time 1e+05 0.00000 99999
#> 368 keywords are stored in the 'description' slot
The R object flowfile contains measurements about 8729 cells across 10 channels since the time channel does not contain any information about the properties of the measured cells.
To examine the need for transformation, a visual representation of the information in the expression matrix is of great use. The ggpairsDens() function produces a panel plot of all measured channels. Each plot is also smoothed to show the cell density at every part of the plot.
We obtain Figure above by using the ggpairsDens()
function after removing all NA
values from the expression
matrix with the nona() function. There is a version of
the function, pairs_plot() that produces standard base
scatter plots also smoothed to indicate cell density.
flowfile_noneg <- noNeg(x = flowfile_nona)
flowfile_logtrans <- lnTrans(x = flowfile_noneg,
notToTransform = c("SSC.W", "TIME"))
ggpairsDens(flowfile_logtrans, notToPlot = "TIME")