Data Acquisition - RefSeq
RefSeq is a curated database of genomic, transcript and protein sequence records (Pruitt et al. 2012b). For additional information see http://www.ncbi.nlm.nih.gov/refseq/about/. Data are downloaded automatically from collaborating databases/groups to provide an initial definition of the gene and sequence associations.
Curation includes:
- Automatic validation and quality assurance (QA) evaulation checks for data conflicts and completeness.
- If a sequence passes the QA check, a RefSeq record is created with a status of INFERRED, PREDICTED, or PROVISIONAL. Additional annotation may be added, e.g., publications, names, symbols, and aliases.
- Manual data review by NCBI staff follows to determine the optimal sequence record and to fix sequence errors. A VALIDATED status is assigned after manual curation.