Data Quality Asessment: Biological Databases

RefSeq Processing Pipelines.
Image source: RefSeq Handbook.

Data Acquisition - RefSeq

RefSeq is a curated database of genomic, transcript and protein sequence records (Pruitt et al. 2012b). For additional information see http://www.ncbi.nlm.nih.gov/refseq/about/. Data are downloaded automatically from collaborating databases/groups to provide an initial definition of the gene and sequence associations.

Curation includes:

Automatic validation and quality assurance (QA) evaulation checks for data conflicts and completeness.
If a sequence passes the QA check, a RefSeq record is created with a status of INFERRED, PREDICTED, or PROVISIONAL. Additional annotation may be added, e.g., publications, names, symbols, and aliases.
Manual data review by NCBI staff follows to determine the optimal sequence record and to fix sequence errors. A VALIDATED status is assigned after manual curation.

Data Quality Control »