数据库 及数据下载
目录
要点: 熟悉生信分析常用的数据库及数据下载方法。
参考基因组
参考基因组主要包括序列文件fa和注释文件gtf。 本文主要关注物种是人和鼠。以human为例,常见的版本有 hg19和hg38。有三个提供者 ensemble,UCSC和NCBI。
注意:下载的时候要注意配套,即从同一个提供者下载fa和gtf,否则后期会多很多麻烦。
Ensembl /GENCODE
Ensembl
官网: https://asia.ensembl.org/index.html,建议自己到主页找,不要用下面的链接,因为每个月都在更新。
单击中部的 human,新页面右侧Gene annotation中 Download FASTA files for genes, cDNAs, ncRNA, proteins,及 Download GTF or GFF3 files for genes, cDNAs, ncRNA, proteins,
# fasta: ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/ $ wget ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz $ gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 几个版本的区别: * 'dna' - unmasked genomic DNA sequences. * 'dna_rm' - masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's. * 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base # gtf: ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/ $ wget ftp://ftp.ensembl.org/pub/release-102/gtf/homo_sapiens/Homo_sapiens.GRCh38.102.chr.gtf.gz # In the case of human and mouse, the GTF files found here are equivalent to the GENCODE gene set. # mouse $ wget ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr.gtf.gz #带chr的少很多行,可能是只有染色体,不包含未拼接上的。 $ wget ftp://ftp.ensembl.org/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
GENCODE
官网: https://www.gencodegenes.org/,建议自己到主页找,不要用下面的链接,因为每个月都在更新。
# https://www.gencodegenes.org/human/ ## Nucleotide sequences of all transcripts on the reference chromosomes: $ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.transcripts.fa.gz ## This is the main annotation file for most users: $ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz
UCSC
# ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/ # ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ $ wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/ hg38.ensGene.gtf.gz 26.5 MB 2020/1/10 上午8:00:00 hg38.knownGene.gtf.gz 32.4 MB 2020/1/10 上午8:00:00 hg38.ncbiRefSeq.gtf.gz 34.2 MB 2020/1/10 上午8:00:00 hg38.refGene.gtf.gz 22.5 MB 2020/1/10 上午8:00:00 ## 这四个版本的gtf有啥区别? //todo $ wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.knownGene.gtf.gz
NCBI
解释: https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/ 下载: https://www.ncbi.nlm.nih.gov/refseq/ $ wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gtf.gz $ wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/current/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
GEO /SRA
$ fasterq-dump --split-files -e 10 SRR9689351 #下载rsa并转为2个fa文件(R1,R2)