The SciDB DBMS: Open Source DBMS for Scientific Research
SciDB will not be a traditional database. SciDB will not be optimized for online transaction processing (OLTP) and will only minimally support transactions at all. It will not need to provide strict atomicity, consistency, isolation, and durability (ACID) constraints. It will not have a rigidly-defined, difficult-to-modify schema. Instead, SciDB will be built around analytics. Storage will be write-once, read-many. Bulk loads, rather than single-row inserts, will be the primary input method. "Load-free" access to minimally-structured data will be provided.
The standard relational model is often inefficient for the types of data used for complex analytics. Time series and spatial grids may be represented in relations, but only at a severe cost in both space and processing time. SciDB will be organized around multidimensional array storage, a generalization of relational tables that can provide orders of magnitude better performance.
In order to support the vast collections of data being obtained by new instruments or new simulations, SciDB will be scalable up to petabytes and beyond. This scale necessitates the use of more than one machine; SciDB will run on incrementally scalable clusters or clouds of industry standard hardware. The system will also be scalable down to megabytes to enable researchers to use the same interface on a laptop as on a 10,000-node cloud. Computation must scale equally with the storage. Functions and procedures will execute in parallel, as close to the data being operated on as possible.
Operating on a large number of industry standard nodes requires that reliability be engineered into the system from the very first release. SciDB will be designed to continue operating in the face of node failure, without even restarting a long-running operation in progress. Scalability also means that expensive human administrative costs cannot increase even linearly with the size of the data. The system will accordingly be designed for automated operations with minimal administrative overhead.
Complex analytics will be simplified with SciDB because arrays and vectors are first-class objects with built-in optimized operations. Spatial operators and time-series analysis will be easy to express. Interfaces to common scientific tools like R and eventually MATLAB and IDL, as well as programming languages like C++ and Python, will be provided.
Many features important to science that have been developed in the research community but have not been incorporated into commercial databases will be standard with SciDB, including versioning, provenance tracking, and support for uncertain data with error bars. By building these features into the system, rather than patching them on with external tools, the accuracy and consistency of the data and the resulting analyses will be ensured.




