The Data Model

Organization: Represent a company, univeristy, like "Metasub Consortium”
  • Have multiple projects
  • Researchers can join the organziation
Project: Represents a sample group, a project, like "MetaSUB Paris"
  • Store samples
  • Store group-level results
Sample: Represents a Single Datum, e. g. one swab
  • Can belong to multiple groups
  • Store sample-level results
  • Stores metadata
Result: Represents a result folder
  • Stores related files (e. g. Reads 1 & 2) together
Fields: Represents actual data, like sequencing reads
  • Track files in S3 (or similar)
  • Stores a version number
data model

GeoSeeq employs a simple data model that can support a variety of use cases. The core of this data model is the Sample.

Full Pangea data model

To group samples into projects GeoSeeq supports projects. Projects are quite literally just groups of samples. Samples may belong to many different groups to support different analyses and sub-group analyses with the only restrictions being related to privacy. The only exception to this are Sample Libraries (often called just Libraries in our documentation). Sample Libraries are also Sample Groups but have a special property that every sample must belong to exactly one Sample Library. This library is, in effect, the sample's home-base.

The real strength of GeoSeeq is its ability to connect data and analyses to samples. Samples contain Analysis-Results which represent either raw data from the sample or results derived from analysis of that data. An example of this could be the raw reads from paired-end DNA sequencing of a sample. The raw reads would be stored as an Analysis-Result with two Analysis-Result-Fields, one each for the forward and reverse reads. Each Field could point to a file stored on the cloud or, for results that require less storage, be stored directly in GeoSeeq.

Projects may also contain Analysis-Results. (On the group result tab) In this case Analysis-Results are used to represent anything that applies to all the samples at once. An example would be a pairwise distance matrix between all samples in a dataset.

Analysis-Results may contain multiple replicates of the same type and each Analysis-Result may contain a list of the other Analysis-Results it was derived from. This helps to ensure provenance of each result and reproducible research.