% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/epiSeeker-package.R
\docType{data}
\name{gsminfo}
\alias{gsminfo}
\alias{ucsc_release}
\title{Information Datasets}
\format{
A data frame with `n` rows (GSM samples) and 14 columns.
}
\value{
data frame
}
\description{
ucsc genome version, precalculated data and gsm information
}
\section{Provenance}{

The `gsminfo` dataset was constructed programmatically from public
resources in the NCBI GEO and UCSC Genome Browser databases.
The data generation pipeline is implemented in
`data-raw/` (see `prepareGSMInfo()` in the package source).

Briefly, GEO metadata were retrieved using the `GEOmetadb` SQLite
database and `GEOquery`. The latest GEOmetadb SQLite file was downloaded
via `getSQLiteFile()` or, if unavailable, directly from
<http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz>.
Platform (GPL) records were queried to identify platforms associated with
high-throughput sequencing experiments. For each sequencing platform, the
corresponding GSM records were obtained using `Meta(getGEO())`.
Supplementary BED-like files for each GSM were collected using
`getGSMsuppFile()` and `batchGetGSMsuppFile()`.

Additional metadata fields (title, organism, extract protocol, characteristics,
data processing description, submission date, and supplementary file URLs)
were extracted from GSM SOFT files downloaded using `GEOquery`.
Genome assembly versions for each GSM were inferred using the function
`getGenomicVersion()`, which matches UCSC genome labels to either
the data processing description or the supplementary file names, using the
reference table provided in the internal dataset `ucsc_release`.

PubMed IDs associated with each GEO series (GSE) were obtained from the
`gse` table in GEOmetadb. All GSM-level metadata were merged, cleaned,
and converted to ASCII using `iconv()` to remove non-ASCII characters.

Finally, newly processed GSM entries were appended to any preexisting
`gsminfo` object stored in the package, deduplicated, and saved as
`gsminfo.rda` with `compress="xz"`.

Thus, `gsminfo` represents a curated, reproducibly constructed metadata
table summarizing GEO high-throughput sequencing samples, including organism,
platform, experimental descriptions, processing information, genome versions,
supplementary BED file locations, and associated PubMed IDs.
}

\section{Data structure}{

A data frame with one row per GSM sample and the following columns:
\describe{
  \item{`series_id`}{GEO series accession (GSE).}
  \item{`gsm`}{GEO sample accession (GSM).}
  \item{`gpl`}{GEO platform accession (GPL).}
  \item{`organism`}{Organism name (e.g., *Mus musculus*).}
  \item{`title`}{Sample title as provided in GEO.}
  \item{`characteristics`}{Experiment-specific metadata such as cell type, treatment, or antibody.}
  \item{`source_name`}{Source material for sequencing, typically cell or tissue type.}
  \item{`extract_protocol`}{Detailed wet-lab protocol for chromatin extraction, immunoprecipitation, and library preparation as reported in GEO.}
  \item{`description`}{Antibody information or additional sample description.}
  \item{`data_processing`}{Bioinformatics processing description including aligner, genome build, peak calling method, and filtering steps.}
  \item{`submission_date`}{Date when the sample was submitted to GEO.}
  \item{`supplementary_file`}{URL to supplementary processed files (e.g., BED).}
  \item{`genomeVersion`}{Genome assembly used in the processed data (e.g., mm8, hg19).}
  \item{`pubmed_id`}{PMID of the reference publication associated with the dataset.}
}
}

\keyword{datasets}
