免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版

打開APP
userphoto
未登錄

開通VIP,暢享免費(fèi)電子書等14項超值服

開通VIP
生信人的20個R語言習(xí)題及其答案


1. 安裝一些R包:

數(shù)據(jù)包: ALL, CLL, pasilla, airway
軟件包:limma,DESeq2,clusterProfiler
工具包:reshape2
繪圖包:ggplot2
不同領(lǐng)域的R包使用頻率不一樣,在生物信息學(xué)領(lǐng)域,尤其需要掌握bioconductor系列包。

if(F){source("http://bioconductor.org/biocLite.R")options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")#修改鏡像,安裝會加速BiocManager::install("clusterProfiler")BiocManager::install("ComplexHeatmap")BiocManager::install("maftools")BiocManager::install("ggplot2")BiocManager::install("jmzeng1314/biotrainee")}#或者如下:source("https://bioconductor.org/biocLite.R")options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")BiocManager::install(c('ALL','CLL','pasilla','clusterProfiler')) BiocManager::install(c('airway','DESeq2','edgeR','limma'))install.packages("reshape2", "ggplot2")

2.了解ExpressionSet對象

比如CLL包里面就有data(sCLLex) ,找到它包含的元素,提取其表達(dá)矩陣(使用exprs函數(shù)),查看其大小

  1. 參考:http://www.bio-info-trainee.com/bioconductor_China/software/limma.html

  2. 參考:https://github.com/bioconductor-china/basic/blob/master/ExpressionSet.md

suppressPackageStartupMessages(library(CLL))data(sCLLex)exprSet=exprs(sCLLex)##sCLLex是依賴于CLL這個package的一個對象samples=sampleNames(sCLLex)pdata=pData(sCLLex)group_list=as.character(pdata[,2])dim(exprSet)# [1] 12625    22exprSet[1:5,1:5]#            CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL# 1000_at    5.743132  6.219412  5.523328  5.340477  5.229904# 1001_at    2.285143  2.291229  2.287986  2.295313  2.662170# 1002_f_at  3.309294  3.318466  3.354423  3.327130  3.365113# 1003_s_at  1.085264  1.117288  1.084010  1.103217  1.074243# 1004_at    7.544884  7.671801  7.474025  7.152482  6.902932

3.了解 str,head,help函數(shù),作用于 第二步提取到的表達(dá)矩陣

str(exprSet)# str: Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput).head(exprSet)

4. 安裝并了解 hgu95av2.db 包,看看 ls("package:hgu95av2.db") 后 顯示的那些變量

hgu95av2.db是一個注釋包,它為hgu95av2平臺的芯片提供注釋,這個包中有很多注釋文件,如下所示:

BiocManager::install("hgu95av2.db")library(hgu95av2.db)ls("package:hgu95av2.db")#[1] "hgu95av2"              "hgu95av2.db"           "hgu95av2_dbconn"       "hgu95av2_dbfile"       "hgu95av2_dbInfo"       "hgu95av2_dbschema"     [7] "hgu95av2ACCNUM"        "hgu95av2ALIAS2PROBE"   "hgu95av2CHR"           "hgu95av2CHRLENGTHS"    "hgu95av2CHRLOC"        "hgu95av2CHRLOCEND"    [13] "hgu95av2ENSEMBL"       "hgu95av2ENSEMBL2PROBE" "hgu95av2ENTREZID"      "hgu95av2ENZYME"        "hgu95av2ENZYME2PROBE"  "hgu95av2GENENAME"     [19] "hgu95av2GO"            "hgu95av2GO2ALLPROBES"  "hgu95av2GO2PROBE"      "hgu95av2MAP"           "hgu95av2MAPCOUNTS"     "hgu95av2OMIM"         [25] "hgu95av2ORGANISM"      "hgu95av2ORGPKG"        "hgu95av2PATH"          "hgu95av2PATH2PROBE"    "hgu95av2PFAM"          "hgu95av2PMID"         [31] "hgu95av2PMID2PROBE"    "hgu95av2PROSITE"       "hgu95av2REFSEQ"        "hgu95av2SYMBOL"        "hgu95av2UNIGENE"       "hgu95av2UNIPROT"      

5. 理解 head(toTable(hgu95av2SYMBOL)) 的用法,找到 TP53 基因?qū)?yīng)的探針I(yè)D

?hgu95av2SYMBOL?toTablesummary(hgu95av2SYMBOL)#SYMBOL map for chip hgu95av2 (object of class "ProbeAnnDbBimap")|| Lkeyname: probe_id (Ltablename: probes)|    Lkeys: "1000_at", "1001_at", ... (total=12625/mapped=11460)|| Rkeyname: symbol (Rtablename: gene_info)|    Rkeys: "A1BG", "A2M", ... (total=61050/mapped=8585)|| direction: L --> Rids <- toTable(hgu95av2SYMBOL)View(ids)library(dplyr)# 方法1:filter(ids, symbol=="TP53") #用dplyr包的篩選功能,找到 TP53 基因?qū)?yīng)的探針I(yè)D#     probe_id symbol#1   1939_at   TP53#2 1974_s_at   TP53#3  31618_at   TP53#方法2:ids[grep("^TP53$", ids$symbol),]#       probe_id symbol# 966    1939_at   TP53# 997  1974_s_at   TP53# 1420  31618_at   TP53#方法1,2雖然結(jié)果相同,但是定義的對象是不同的

hug95av2SYMBOL是一個R對象,它提供的是芯片生產(chǎn)廠家與基因縮寫之間的映射信息。這個映射的信息主要依據(jù)Entrez Gene數(shù)據(jù)庫?,F(xiàn)在我們通過mappedkeys()這個函數(shù),得到映射到基因上的探針信息。

6.理解探針與基因的對應(yīng)關(guān)系,總共多少個基因,基因最多對應(yīng)多少個探針,是哪些基因,是不是因為這些基因很長,所以在其上面設(shè)計多個探針呢?

length(unique(ids$symbol))# [1] 8585tail(sort(table(ids$symbol)))#YME1L1  GAPDH INPP4A    MYB PTGER3  STAT1 #     7      8      8      8      8      8 table(sort(table(ids$symbol)))#  1    2    3    4    5    6    7    8 # 6555 1428  451  102   22   16    6    5 
image.png

不管是Agilent芯片,還是Affymetrix芯片,上面設(shè)計的探針都非常短。最長的如Agilent芯片上的探針,往往都是60bp,但是往往一個基因的長度都好幾Kb。因此一般多個探針對應(yīng)一個基因,取最大表達(dá)值探針來作為基因的表達(dá)量。

7.第二步提取到的表達(dá)矩陣是12625個探針在22個樣本的表達(dá)量矩陣,找到那些不在 hgu95av2.db 包收錄的對應(yīng)著SYMBOL的探針。

提示:有1165個探針是沒有對應(yīng)基因名字的。

%in% 邏輯判斷

用法 a %in% table
a值是否包含于table中,為真輸出TURE,否者輸出FALSE

table(rownames(exprSet)) %in% ids$probe_id# %in% is a more intuitive interface as a binary operator, which returns a logical vector indicating if there is a match or not for its left operand.n_exprSet <- exprSet[!(rownames(exprSet) %in% ids$probe_id),]dim(n_exprSet)# [1] 1165   22View(n_exprSet)# These probes are not in the package.

8.過濾表達(dá)矩陣,刪除那1165個沒有對應(yīng)基因名字的探針。

方法1:%in% 邏輯判斷

exprSet <- exprSet[rownames(exprSet) %in% ids$probe_id, ]dim(exprSet)[1] 11460    22View(exprSet)# These probes are in the package.

方法2 mappedkeys() 映射關(guān)系

 length(hgu95av2SYMBOL)[1] 12625probe_map <- hgu95av2SYMBOLlength(probe_map)[1] 12625#全部的探針數(shù)目# [1] 12625probe_info <- mappedkeys(probe_map)length(probe_info)[1] 11460#探針與基因產(chǎn)生映射的數(shù)目gene_info <- as.list(probe_map[probe_info])# 轉(zhuǎn)化為數(shù)據(jù)表length(gene_info)[1] 11460gene_symbol <- toTable(probe_map[probe_info])# 從hgu95av2SYMBOL文件中,取出有映射關(guān)系的探針,并生成數(shù)據(jù)框給gene_symbolhead(gene_symbol)   probe_id  symbol1   1000_at   MAPK32   1001_at    TIE13 1002_f_at CYP2C194 1003_s_at   CXCR55   1004_at   CXCR56   1005_at   DUSP1

mappedkeys用法示例,幫助理解。

library(hgu95av2.db)  x <- hgu95av2GO  x  length(x)  count.mappedkeys(x)  x[1:3]  links(x[1:3])  ## Keep only the mapped keys  keys(x) <- mappedkeys(x)  length(x)  count.mappedkeys(x)  x # now it is a submap  ## The above subsetting can also be achieved with  x <- hgu95av2GO[mappedkeys(hgu95av2GO)]  ## mappedkeys() and count.mappedkeys() also work with an environment  ## or a list  z <- list(k1=NA, k2=letters[1:4], k3="x")  mappedkeys(z)  count.mappedkeys(z)  ## retrieve the set of primary keys for the ChipDb object named 'hgu95av2.db'  keys <- keys(hgu95av2.db)  head(keys)

9.整合表達(dá)矩陣,多個探針對應(yīng)一個基因的情況下,只保留在所有樣本里面平均表達(dá)量最大的那個探針。

A. 提示,理解 tapply,by,aggregate,split 函數(shù) , 首先對每個基因找到最大表達(dá)量的探針。
B. 然后根據(jù)得到探針去過濾原始表達(dá)矩陣

ids=ids[match(rownames(exprSet),ids$probe_id),]head(ids)exprSet[1:5,1:5]tmp = by(exprSet,ids$symbol,function(x) rownames(x)[which.max(rowMeans(x))] )probes = as.character(tmp)exprSet=exprSet[rownames(exprSet) %in% probes ,]dim(exprSet)View(head(exprSet))

10.把過濾后的表達(dá)矩陣更改行名為基因的symbol,因為這個時候探針和基因是一對一關(guān)系了。

rownames(exprSet)=ids[match(rownames(exprSet),ids$probe_id),2]exprSet[1:5,1:5]#  CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CEL# MAPK3    5.743132  6.219412  5.523328  5.340477  5.229904# TIE1     2.285143  2.291229  2.287986  2.295313  2.662170# CYP2C19  3.309294  3.318466  3.354423  3.327130  3.365113# CXCR5    1.085264  1.117288  1.084010  1.103217  1.074243# CXCR5    7.544884  7.671801  7.474025  7.152482  6.902932library(reshape2)exprSet_L=melt(exprSet)colnames(exprSet_L)=c('probe','sample','value')exprSet_L$group=rep(group_list,each=nrow(exprSet))head(exprSet_L)#  probe    sample    value    group#1   MAPK3 CLL11.CEL 5.743132 progres.#2    TIE1 CLL11.CEL 2.285143 progres.#3 CYP2C19 CLL11.CEL 3.309294 progres.#4   CXCR5 CLL11.CEL 1.085264 progres.#5   CXCR5 CLL11.CEL 7.544884 progres.#6   DUSP1 CLL11.CEL 5.083793 progres.View(head(exprSet))

11. 對第10步得到的表達(dá)矩陣進(jìn)行探索,先畫第一個樣本的所有基因的表達(dá)量的boxplot,hist,density , 然后畫所有樣本的 這些圖

  1. 參考:http://bio-info-trainee.com/tmp/basic_visualization_for_expression_matrix.html

  2. 理解ggplot2的繪圖語法,數(shù)據(jù)和圖形元素的映射關(guān)系

### ggplot2library(ggplot2)p=ggplot(exprSet_L,aes(x=sample,y=value,fill=group))+geom_boxplot()print(p)p=ggplot(exprSet_L,aes(x=sample,y=value,fill=group))+geom_violin()print(p)p=ggplot(exprSet_L,aes(value,fill=group))+geom_histogram(bins = 200)+facet_wrap(~sample, nrow = 4)print(p)p=ggplot(exprSet_L,aes(value,col=group))+geom_density()+facet_wrap(~sample, nrow = 4)print(p)p=ggplot(exprSet_L,aes(value,col=group))+geom_density()print(p)p=ggplot(exprSet_L,aes(x=sample,y=value,fill=group))+geom_boxplot()p=p+stat_summary(fun.y="mean",geom="point",shape=23,size=3,fill="red")p=p+theme_set(theme_set(theme_bw(base_size=20)))p=p+theme(text=element_text(face='bold'),axis.text.x=element_text(angle=30,hjust=1),axis.title=element_blank())print(p)
image.png

image.png

image.png

image.png

image.png

image.png

12.理解統(tǒng)計學(xué)指標(biāo)mean,median,max,min,sd,var,mad并計算出每個基因在所有樣本的這些統(tǒng)計學(xué)指標(biāo),最后按照mad值排序,取top 50 mad值的基因,得到列表。

注意:這個題目出的并不合規(guī),請仔細(xì)看。

g_mean <- tail(sort(apply(exprSet,1,mean)),50)g_median <- tail(sort(apply(exprSet,1,median)),50)g_max <- tail(sort(apply(exprSet,1,max)),50)g_min <- tail(sort(apply(exprSet,1,min)),50)g_sd <- tail(sort(apply(exprSet,1,sd)),50)g_var <- tail(sort(apply(exprSet,1,var)),50)g_mad <- tail(sort(apply(exprSet,1,mad)),50)g_madnames(g_mad) [1] "DUSP5"   "IGFBP4"  "H1FX"    "ENPP2"   "FLNA"    "CLEC2B"  "TSPYL2"  "ZNF266"  "S100A9"  "NR4A2"   "TGFBI"   "ARF6"    "APBB2"   "VCAN"    "RBM38"  [16] "CAPG"    "PLXNC1"  "RGS2"    "RNASE6"  "VAMP5"   "CYBB"    "GNLY"    "CCL3"    "OAS1"    "ENPP2"   "TRIB2"   "ZNF804A" "H1FX"    "IGH"     "JUND"   [31] "SLC25A1" "PCDH9"   "VIPR1"   "COBLL1"  "GUSBP11" "S100A8"  "HBB"     "FOS"     "LHFPL2"  "FCN1"    "ZAP70"   "IGLC1"   "LGALS1"  "HBB"     "FOS"    [46] "SLAMF1"  "TCF7"    "DMD"     "IGF2BP3" "FAM30A" 

13.根據(jù)第12步驟得到top 50 mad值的基因列表來取表達(dá)矩陣的子集,并且熱圖可視化子表達(dá)矩陣。試試看其它5種熱圖的包的不同效果。

## heatmaplibrary(pheatmap)choose_gene=names(tail(sort(apply(exprSet,1,mad)),50))choose_matrix=exprSet[choose_gene,]choose_matrix=t(scale(t(choose_matrix)))pheatmap(choose_matrix)
image.png

14.取不同統(tǒng)計學(xué)指標(biāo)mean,median,max,mean,sd,var,mad的各top50基因列表,使用UpSetR包來看他們之間的overlap情況。

## UpSetR# https://cran.r-project.org/web/packages/UpSetR/README.htmllibrary(UpSetR)g_all <- unique(c(names(g_mean),names(g_median),names(g_max),names(g_min),names(g_sd),names(g_var),names(g_mad) ))dat=data.frame(g_all=g_all,g_mean=ifelse(g_all %in% names(g_mean) ,1,0),g_median=ifelse(g_all %in% names(g_median) ,1,0),g_max=ifelse(g_all %in% names(g_max) ,1,0),g_min=ifelse(g_all %in% names(g_min) ,1,0),g_sd=ifelse(g_all %in% names(g_sd) ,1,0),g_var=ifelse(g_all %in% names(g_var) ,1,0),g_mad=ifelse(g_all %in% names(g_mad) ,1,0))upset(dat,nsets = 7)
image.png

15.在第二步的基礎(chǔ)上面提取CLL包里面的data(sCLLex) 數(shù)據(jù)對象的樣本的表型數(shù)據(jù)。

pdata=pData(sCLLex)group_list=as.character(pdata[,2])group_list# [1] "progres." "stable"   "progres." "progres." "progres." "progres." "stable"   "stable"   "progres." "stable"   "progres." "stable"   "progres." "stable"  # [15] "stable"   "progres." "progres." "progres." "progres." "progres." "progres." "stable"  dim(exprSet)# [1] 8585   22exprSet[1:5,1:5]#      CLL11.CEL CLL12.CEL CLL13.CEL CLL14.CEL CLL15.CELMAPK3    5.743132  6.219412  5.523328  5.340477  5.229904TIE1     2.285143  2.291229  2.287986  2.295313  2.662170CYP2C19  3.309294  3.318466  3.354423  3.327130  3.365113CXCR5    7.544884  7.671801  7.474025  7.152482  6.902932DUSP1    5.083793  7.610593  7.631311  6.518594  5.059087

16.對所有樣本的表達(dá)矩陣進(jìn)行聚類并且繪圖,然后添加樣本的臨床表型數(shù)據(jù)信息(更改樣本名)

## hclustcolnames(exprSet)=paste(group_list,1:22,sep='')# Define nodeParnodePar <- list(lab.cex = 0.6, pch = c(NA, 19),                cex = 0.7, col = "blue")hc=hclust(dist(t(exprSet)))par(mar=c(5,5,5,10))plot(as.dendrogram(hc), nodePar = nodePar, horiz = TRUE)
image.png

17.對所有樣本的表達(dá)矩陣進(jìn)行PCA分析并且繪圖,同樣要添加表型信息。

# install.packages("ggfortify")library(ggfortify)exprSet <- exprs(sCLLex)df <- as.data.frame(t(exprSet))df$group <- group_list# autoplot uses ggplot2 to draw a particular plot for an object of a particular class in a single command.autoplot(prcomp(df[,1:(ncol(df)-1)]), data=df, colour = 'group')
image.png

18.根據(jù)表達(dá)矩陣及樣本分組信息進(jìn)行批量T檢驗,得到檢驗結(jié)果表格

## t.testdat = exprSetgroup_list=as.factor(group_list)group1 = which(group_list == levels(group_list)[1])group2 = which(group_list == levels(group_list)[2])dat1 = dat[, group1]dat2 = dat[, group2]dat = cbind(dat1, dat2)pvals = apply(exprSet, 1, function(x){  t.test(as.numeric(x)~group_list)$p.value})p.adj = p.adjust(pvals, method = "BH")avg_1 = rowMeans(dat1)avg_2 = rowMeans(dat2)log2FC = avg_2-avg_1DEG_t.test = cbind(avg_1, avg_2, log2FC, pvals, p.adj)DEG_t.test=DEG_t.test[order(DEG_t.test[,4]),]DEG_t.test=as.data.frame(DEG_t.test)head(DEG_t.test)#           avg_1    avg_2     log2FC        pvals     p.adj36129_at 7.875615 8.791753  0.9161377 1.629755e-05 0.205756637676_at 6.622749 7.965007  1.3422581 4.058944e-05 0.243617733791_at 7.616197 5.786041 -1.8301554 6.965416e-05 0.243617739967_at 4.456446 2.152471 -2.3039752 8.993339e-05 0.243617734594_at 5.988866 7.058738  1.0698718 9.648226e-05 0.243617732198_at 4.157971 3.407405 -0.7505660 2.454557e-04 0.3516678

19.使用limma包對表達(dá)矩陣及樣本分組信息進(jìn)行差異分析,得到差異分析表格,重點(diǎn)看logFC和P值,畫個火山圖(就是logFC和-log10(P值)的散點(diǎn)圖)。

# DEG by limmasuppressMessages(library(limma))design <- model.matrix(~0+factor(group_list))colnames(design)=levels(factor(group_list))rownames(design)=colnames(exprSet)designcontrast.matrix<-makeContrasts(paste0(unique(group_list),collapse = "-"),levels = design)contrast.matrix##這個矩陣聲明,我們要把progres.組跟stable進(jìn)行差異分析比較##step1fit <- lmFit(exprSet,design)##step2fit2 <- contrasts.fit(fit, contrast.matrix) ##這一步很重要,大家可以自行看看效果fit2 <- eBayes(fit2) ## default no trend !!!##eBayes() with trend=TRUE##step3tempOutput = topTable(fit2, coef=1, n=Inf)nrDEG = na.omit(tempOutput)#write.csv(nrDEG2,"limma_notrend.results.csv",quote = F)head(nrDEG)## volcano plotDEG=nrDEGlogFC_cutoff <- with(DEG,mean(abs( logFC)) + 2*sd(abs( logFC)) )DEG$change = as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff,                              ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT'))this_tile <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3),                    '\nThe number of up gene is ',nrow(DEG[DEG$change =='UP',]) ,                    '\nThe number of down gene is ',nrow(DEG[DEG$change =='DOWN',]))g = ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=change)) +  geom_point(alpha=0.4, size=1.75) +  theme_set(theme_set(theme_bw(base_size=20)))+  xlab("log2 fold change") + ylab("-log10 p-value") +  ggtitle( this_tile ) + theme(plot.title = element_text(size=15,hjust = 0.5))+  scale_colour_manual(values = c('blue','black','red')) ## corresponding to the levels(res$change)print(g)
image.png

image.png

20.對T檢驗結(jié)果的P值和limma包差異分析的P值畫散點(diǎn)圖,看看哪些基因相差很大。

### different P valueshead(nrDEG)head(DEG_t.test)DEG_t.test=DEG_t.test[rownames(nrDEG),]plot(DEG_t.test[,3],nrDEG[,1])plot(DEG_t.test[,4],nrDEG[,4])plot(-log10(DEG_t.test[,4]),-log10(nrDEG[,4]))
image.png

image.png

image.png
rownames(exprSet)=ids[match(rownames(exprSet),ids$probe_id),2]exprSet[1:5,1:5]exprSet['GAPDH',]exprSet['ACTB',]exprSet['DLEU1',]library(ggplot2)library(ggpubr)my_comparisons <- list(  c("stable", "progres."))dat=data.frame(group=group_list,               sampleID= names(exprSet['DLEU1',]),               values= as.numeric(exprSet['DLEU1',]))ggboxplot(  dat, x = "group", y = "values",  color = "group",  add = "jitter")+  stat_compare_means(comparisons = my_comparisons, method = "t.test")
image.png
## heatmaplibrary(pheatmap)choose_gene=head(rownames(nrDEG),25)choose_matrix=exprSet[choose_gene,]choose_matrix=t(scale(t(choose_matrix)))pheatmap(choose_matrix)
本站僅提供存儲服務(wù),所有內(nèi)容均由用戶發(fā)布,如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點(diǎn)擊舉報。
打開APP,閱讀全文并永久保存 查看更多類似文章
猜你喜歡
類似文章
箱線圖走一個
生信編程14.多個探針對應(yīng)一個基因,腫么辦?
R語言20練習(xí)題【完整版】
表達(dá)矩陣可視化大全
多個探針對應(yīng)一個基因,取平均值或者最大值
用samr包對芯片數(shù)據(jù)做差異分析
更多類似文章 >>
生活服務(wù)
分享 收藏 導(dǎo)長圖 關(guān)注 下載文章
綁定賬號成功
后續(xù)可登錄賬號暢享VIP特權(quán)!
如果VIP功能使用有故障,
可點(diǎn)擊這里聯(lián)系客服!

聯(lián)系客服