Last modified: 2024-10-24 20:57:46.745159
Compiled: 2024-11-05 11:08:47.270861

1 Utilisation and prospects of bioimage datasets

In recent years, there has been a growing need for data analysis using machine learning in the field of bioimaging research. Machine learning is an inductive approach using data, and the construction of models, such as image segmentation and classification, involves the use of image data itself. Therefore, the publication and sharing of bioimage datasets [1] as well as knowledge creation through providing metadata to bioimages [2,3] are important issues to be discussed. At present, there is no commonly used format for sharing bioimage datasets. Also, the data is scattered among various repositories. Therefore, different image repositories manage the data in different formats (image data itself and metadata, including image format, instruments/microscopes and biosamples).

In the data analysis and quantification using those images, it is assumed that several steps of image pre-processing are performed depending on the analysis environment used. However, the implementation of supervised learning starts with finding a repository of the bioimage dataset that contains original images and their corresponding supervised labels. Once the repository is found, the image data is downloaded from the repository, the data is loaded into each environment and it is prepared in a format suitable for analytical package. These processes are time consuming before the main analysis. Also, in most of the image repositories, the data are not published in a format suitable for reading and processing in R (.Rdata, etc.), and the data are not easy to use for R users.

For performing supervised learning of bioimage data, BioImageDbs provides R list objects of the original images and their corresponding supervised labels converted into a 4D or 5D array. After retrieving the data from ExperimentHub, it can be utilised for deep learning using Keras/Tensorflow [4] and other machine learning methods, without the need for pre-processing.

On the other hand, many image analysis packages are also available on R; however, there is a lack of standardisation in image analysis. The use of common, open datasets is one of the essential steps in standardising and comparing the analytical methods. The provision of the array data of images through ExperimentHub is also intended for applications such as (1) comparing models using common-sharing data among R users and (2) applying predictions to new datasets through transfer learning and fine-tuning based on these arrays.

2 Fetch Bioimage Datasets from ExperimentHub

The BioImageDbs package provides the metadata for all BioImage databases in ExperimentHub.

The BioImageDbs package provides the metadata for bioimage datasets, which is preprocessed as array format and saved in ExperimentHub.

First we load/update the ExperimentHub resource.

library(ExperimentHub)
eh <- ExperimentHub()

Next we list all BioImageDbs entries from ExperimentHub.

query(eh, "BioImage")
## ExperimentHub with 73 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: Satoshi Kume <satoshi.kume.1984@gmail.com>, CELL TRACKING C...
## # $species: Mus musculus, Homo sapiens, Rattus norvegicus, Drosophila melano...
## # $rdataclass: List, magick-image
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH6874"]]' 
## 
##            title                                                            
##   EH6874 | EM_id0001_Brain_CA1_hippocampus_region_5dTensor.Rds              
##   EH6875 | EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_dataset.gif
##   EH6876 | EM_id0002_Drosophila_brain_region_5dTensor.Rds                   
##   EH6877 | EM_id0002_Drosophila_brain_region_5dTensor_train_dataset.gif     
##   EH6878 | LM_id0001_DIC_C2DH_HeLa_4dTensor.Rds                             
##   ...      ...                                                              
##   EH6942 | EM_id0009_MurineBMMC_All_512_4dTensor_dataset.gif                
##   EH6943 | EM_id0010_HumanBlast_All_512_4dTensor.Rds                        
##   EH6944 | EM_id0010_HumanBlast_All_512_4dTensor_dataset.gif                
##   EH6945 | EM_id0011_HumanJurkat_All_512_4dTensor.Rds                       
##   EH6946 | EM_id0011_HumanJurkat_All_512_4dTensor_dataset.gif

We can confirm the metadata in ExperimentHub in Bioconductor S3 bucket with mcols().

mcols(query(eh, "BioImage"))
## DataFrame with 73 rows and 15 columns
##                         title           dataprovider                species
##                   <character>            <character>            <character>
## EH6874 EM_id0001_Brain_CA1_.. https://www.epfl.ch/..           Mus musculus
## EH6875 EM_id0001_Brain_CA1_.. https://www.epfl.ch/..           Mus musculus
## EH6876 EM_id0002_Drosophila.. the ISBI 2012 Challe.. Drosophila melanogas..
## EH6877 EM_id0002_Drosophila.. the ISBI 2012 Challe.. Drosophila melanogas..
## EH6878 LM_id0001_DIC_C2DH_H.. CELL TRACKING CHALLE..           Homo sapiens
## ...                       ...                    ...                    ...
## EH6942 EM_id0009_MurineBMMC.. Pattern Recognition ..           Mus musculus
## EH6943 EM_id0010_HumanBlast.. Pattern Recognition ..           Homo sapiens
## EH6944 EM_id0010_HumanBlast.. Pattern Recognition ..           Homo sapiens
## EH6945 EM_id0011_HumanJurka.. Pattern Recognition ..           Homo sapiens
## EH6946 EM_id0011_HumanJurka.. Pattern Recognition ..           Homo sapiens
##        taxonomyid      genome            description coordinate_1_based
##         <integer> <character>            <character>          <integer>
## EH6874      10090          NA 5D arrays with the b..                  1
## EH6875      10090          NA A animation file (.g..                  1
## EH6876       7227          NA 5D arrays with the b..                  1
## EH6877       7227          NA A animation file (.g..                  1
## EH6878       9606          NA 4D arrays with the m..                  1
## ...           ...         ...                    ...                ...
## EH6942      10090          NA A animation file (.g..                  1
## EH6943       9606          NA 4D arrays with the m..                  1
## EH6944       9606          NA A animation file (.g..                  1
## EH6945       9606          NA 4D arrays with the m..                  1
## EH6946       9606          NA A animation file (.g..                  1
##                    maintainer rdatadateadded preparerclass
##                   <character>    <character>   <character>
## EH6874 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6875 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6876 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6877 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6878 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## ...                       ...            ...           ...
## EH6942 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6943 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6944 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6945 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
## EH6946 Satoshi Kume <satosh..     2021-05-18   BioImageDbs
##                                          tags   rdataclass
##                                        <AsIs>  <character>
## EH6874     3D images,bioimage,CellCulture,...         List
## EH6875     animation,bioimage,CellCulture,... magick-image
## EH6876      3D image,bioimage,CellCulture,...         List
## EH6877     animation,bioimage,CellCulture,... magick-image
## EH6878 bioimage,cell tracking,CellCulture,...         List
## ...                                       ...          ...
## EH6942     2D images,bioimage,CellCulture,... magick-image
## EH6943     2D images,bioimage,CellCulture,...         List
## EH6944     2D images,bioimage,CellCulture,... magick-image
## EH6945     2D images,bioimage,CellCulture,...         List
## EH6946     2D images,bioimage,CellCulture,... magick-image
##                     rdatapath              sourceurl  sourcetype
##                   <character>            <character> <character>
## EH6874 BioImageDbs/v01/EM_i.. https://github.com/k..         PNG
## EH6875 BioImageDbs/v01/EM_i.. https://github.com/k..         PNG
## EH6876 BioImageDbs/v01/EM_i.. https://github.com/k..         PNG
## EH6877 BioImageDbs/v01/EM_i.. https://github.com/k..         PNG
## EH6878 BioImageDbs/v01/LM_i.. https://github.com/k..         PNG
## ...                       ...                    ...         ...
## EH6942 BioImageDbs/v02/EM_i.. https://github.com/k..         PNG
## EH6943 BioImageDbs/v02/EM_i.. https://github.com/k..         PNG
## EH6944 BioImageDbs/v02/EM_i.. https://github.com/k..         PNG
## EH6945 BioImageDbs/v02/EM_i.. https://github.com/k..         PNG
## EH6946 BioImageDbs/v02/EM_i.. https://github.com/k..         PNG

We can retrieve only the BioImageDbs tibble files as follows.

qr <- query(eh, c("BioImageDbs", "LM_id0001"))
qr
## ExperimentHub with 5 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: CELL TRACKING CHALLENGE (http://celltrackingchallenge.net/2...
## # $species: Homo sapiens
## # $rdataclass: List, magick-image
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH6878"]]' 
## 
##            title                                                    
##   EH6878 | LM_id0001_DIC_C2DH_HeLa_4dTensor.Rds                     
##   EH6879 | LM_id0001_DIC_C2DH_HeLa_4dTensor_train_dataset.gif       
##   EH6880 | LM_id0001_DIC_C2DH_HeLa_4dTensor_Binary.Rds              
##   EH6881 | LM_id0001_DIC_C2DH_HeLa_4dTensor_Binary_train_dataset.gif
##   EH6882 | LM_id0001_DIC_C2DH_HeLa_5dTensor.Rds
#Import data
#BioImageDbs_image_Dat <- qr[[1]]

3 5D Arrays from the ExperimentHub

The ordering of the array dimensions corresponds to the channels_last format (default) in R/Keras. The input shape of 5D array is to be batch, spatial_dim1, spatial_dim2, spatial_dim3 and channels. The number of this batch is the same as the number of the 3D image sets. The number of channels is 1 for grey images and 3 for RGB images.

4 4D Arrays from the ExperimentHub

The ordering of the array dimensions corresponds to the channels_last format (default) in R/Keras. The input shape of 4D array is to be batch, height, width and channels. The number of this batch is the same as the number of the 2D images.

5 Visualization of gif images from the ExperimentHub

As a test, we also provided gif files of some arrays for visualizations. We visualize the files using magick::image_read function.

qr <- query(eh, c("BioImageDbs", ".gif"))
qr
## ExperimentHub with 32 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: Satoshi Kume <satoshi.kume.1984@gmail.com>, CELL TRACKING C...
## # $species: Mus musculus, Homo sapiens, Rattus norvegicus, Drosophila melano...
## # $rdataclass: magick-image
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH6875"]]' 
## 
##            title                                                            
##   EH6875 | EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_dataset.gif
##   EH6877 | EM_id0002_Drosophila_brain_region_5dTensor_train_dataset.gif     
##   EH6879 | LM_id0001_DIC_C2DH_HeLa_4dTensor_train_dataset.gif               
##   EH6881 | LM_id0001_DIC_C2DH_HeLa_4dTensor_Binary_train_dataset.gif        
##   EH6884 | LM_id0002_PhC_C2DH_U373_4dTensor_train_dataset.gif               
##   ...      ...                                                              
##   EH6935 | EM_id0008_Human_NB4_2D_All_Nuc_512_4dTensor_dataset.gif          
##   EH6937 | EM_id0008_Human_NB4_2D_All_Nuc_1024_4dTensor_dataset.gif         
##   EH6942 | EM_id0009_MurineBMMC_All_512_4dTensor_dataset.gif                
##   EH6944 | EM_id0010_HumanBlast_All_512_4dTensor_dataset.gif                
##   EH6946 | EM_id0011_HumanJurkat_All_512_4dTensor_dataset.gif
#EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_data
qr[1]
## ExperimentHub with 1 record
## # snapshotDate(): 2024-10-24
## # names(): EH6875
## # package(): BioImageDbs
## # $dataprovider: https://www.epfl.ch/labs/cvlab/data/data-em/
## # $species: Mus musculus
## # $rdataclass: magick-image
## # $rdatadateadded: 2021-05-18
## # $title: EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_dataset.gif
## # $description: A animation file (.gif) of the train dataset of EM_id0001_Br...
## # $taxonomyid: 10090
## # $genome: NA
## # $sourcetype: PNG
## # $sourceurl: https://github.com/kumeS/BioImageDbs
## # $sourcesize: NA
## # $tags: c("animation", "bioimage", "CellCulture", "electron
## #   microscopy", "microscope", "scanning electron microscopy",
## #   "segmentation", "Tissue") 
## # retrieve record with 'object[["EH6875"]]'
##Display the gif image
#magick::image_read(qr[[1]])