数据库 及数据下载

目录

要点: 熟悉生信分析常用的数据库及数据下载方法。

参考基因组

参考基因组主要包括序列文件fa和注释文件gtf。
本文主要关注物种是人和鼠。以human为例,常见的版本有 hg19和hg38。有三个提供者 ensemble,UCSC和NCBI。

注意:下载的时候要注意配套,即从同一个提供者下载fa和gtf,否则后期会多很多麻烦。

Ensembl /GENCODE

Ensembl

官网: https://asia.ensembl.org/index.html,建议自己到主页找,不要用下面的链接,因为每个月都在更新。

单击中部的 human,新页面右侧Gene annotation中 Download FASTA files for genes, cDNAs, ncRNA, proteins,及 Download GTF or GFF3 files for genes, cDNAs, ncRNA, proteins,

# fasta: ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/
$ wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
$ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
几个版本的区别:
* 'dna' - unmasked genomic DNA sequences.
* 'dna_rm' - masked genomic DNA.  Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's.
* 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base


# gtf: ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/
$ wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.chr.gtf.gz
# In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set.


# mouse 
$ wget ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr.gtf.gz #带chr的少很多行,可能是只有染色体,不包含未拼接上的。
$ wget ftp://ftp.ensembl.org/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

GENCODE

官网: https://www.gencodegenes.org/,建议自己到主页找,不要用下面的链接,因为每个月都在更新。

# https://www.gencodegenes.org/human/ 
## Nucleotide sequences of all transcripts on the reference chromosomes:
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.transcripts.fa.gz

## This is the main annotation file for most users:
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz

UCSC

# ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/

# ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
$ wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

# ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/
hg38.ensGene.gtf.gz	26.5 MB	2020/1/10 上午8:00:00
hg38.knownGene.gtf.gz	32.4 MB	2020/1/10 上午8:00:00
hg38.ncbiRefSeq.gtf.gz	34.2 MB	2020/1/10 上午8:00:00
hg38.refGene.gtf.gz	22.5 MB	2020/1/10 上午8:00:00
## 这四个版本的gtf有啥区别? //todo
$ wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.knownGene.gtf.gz

NCBI

解释: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

下载: https://www.ncbi.nlm.nih.gov/refseq/
$ wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz

$ wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/current/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz

GEO /SRA

$ fasterq-dump --split-files -e 10 SRR9689351 #下载rsa并转为2个fa文件(R1,R2)