📖 R&Python&Shell --- 编程基础、可视化等

Python-可视化-statannotations包为sns绘图注释显著性

参考 https://github.com/trevismd/statannotations/blob/master/usage/example.ipynb https://github.com/trevismd/statannotations/tree/master 1 2 3 4 5 6 7 8 9 10 11 12 import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # Box plots -- sns.boxplot() # Bar plots -- sns.barplot() # Swarm plots -- sns.swarmplot() # Strip plots -- sns.stripplot() # Violin plots -- sns.violinplot() # Supporting FacetGrid -- sns.catplot(col=..., row=...) from statannotations.Annotator import Annotator 1. Basic use 1 2 3 4 5 6 7 8 9 10 11 x = "day" y = "total_bill" order = ['Sun', 'Thur', 'Fri', 'Sat'] ax = sns.boxplot(data=df, x=x, y=y, order=order) pairs = [("Thur", "Fri"), ("Thur", "Sat"), ("Fri", "Sun")] annot = Annotator(ax, pairs, data=df, x=x, y=y, order=order) annot.configure(test='t-test_ind', text_format='star', loc='outside', verbose=2) # annot.apply_test() # ax, test_results = annot.annotate() ax, test_results = annot.apply_and_annotate() Tips: sns.plot的data, x, y, order等绘图参数需要与Annotator的保持一致。 ...

Python-可视化-sns绘制热图&聚类热图

1 2 3 4 import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd 1. sns.heatmap 绘制的热图是固定行列的表格，不可以调整顺序 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # glue = sns.load_dataset("glue").pivot(index="Model", columns="Task", values="Score") # https://github.com/mwaskom/seaborn-data/blob/master/glue.csv glue = pd.read_csv("./data/glue.csv").pivot(index="Model", columns="Task", values="Score") glue.head() # Task CoLA MNLI MRPC QNLI QQP RTE SST-2 STS-B # Model # BERT 60.5 86.7 89.3 92.7 72.1 70.1 94.9 87.6 # BiLSTM 11.6 65.6 81.8 74.6 62.5 57.4 82.8 70.3 # BiLSTM+Attn 18.6 67.6 83.9 74.3 60.1 58.4 83.0 72.8 # BiLSTM+CoVe 18.5 65.4 78.7 70.8 60.6 52.7 81.9 64.4 # BiLSTM+ELMo 32.1 67.2 84.7 75.5 61.1 57.4 89.3 70.3 # 1) plt.figure(figsize=(4, 3)) sns.heatmap(glue) plt.show() # 2) 为单元格添加注释数据 plt.figure(figsize=(4, 3)) sns.heatmap(glue, annot=True, # fmt=".1f" # 可以设置数据格式 # annot=glue.rank(axis=1) # 也可以自定注释内容 ) 1 2 3 4 5 6 7 8 9 # 3) 设置边框 plt.figure(figsize=(4, 3)) sns.heatmap(glue, annot=True, linewidth=.5, linecolor="white", square=True) # 4) 指定色域以及映射范围 plt.figure(figsize=(4, 3)) sns.heatmap(glue, cmap="crest", vmin=50, vmax=100) # 最后也可以通过cbar系列参数调整颜色条 ...

Python-可视化-sns绘制回归点图

1 2 3 4 5 import seaborn as sns import matplotlib.pyplot as plt import numpy as np import pandas as pd 1. sns.regplot 适合展示简单的两组关系 1 2 3 4 5 6 mpg = sns.load_dataset('mpg') # 1) 一般用法 sns.regplot(data=mpg, x="weight", y="acceleration") # sns.regplot(data=mpg, x="weight", y="acceleration", ci=None) # 不显示置信区间 # sns.regplot(data=mpg, x="weight", y="acceleration", ci=99) # 自定义置信区间 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # 2) 修改scatter点与line线的显示细节 sns.regplot( data=mpg, x="weight", y="acceleration", scatter_kws={ 'color': 'blue', # 散点颜色 's': 50, # 点的大小（面积） 'alpha': 0.6, # 透明度 'marker': 'o', # 点的样式，比如 'o', 'x', '^', etc. 'linewidths': 0 # 点的轮廓线 } ) sns.regplot( data=mpg, x="weight", y="acceleration", line_kws={ 'color': 'red', # 回归线颜色 'linewidth': 2, # 线宽 'linestyle': '--', # 线型：'-'、'--'、':'、'-.' 等 'alpha': 0.9 # 回归线透明度 } ) 1 2 3 4 5 6 7 # 3) 对具有"离散"性质的变量(数据类型仍是float/int), 为避免重叠可以设置抖动 sns.regplot(data=mpg, x="cylinders", y="weight", x_jitter=.15) # 4) 线的拟合方式 sns.regplot(data=mpg, x="weight", y="mpg", order=2) # 拟合高阶函数 # sns.regplot(data=mpg, x="horsepower", y="weight", robust=True) # 不受离群点影响 # sns.regplot(data=mpg, x="horsepower", y="mpg", lowess=True) # 拟合平滑曲线 2. sns.lmplot 适用于分组线性回归, 即分面绘图 1 2 3 tips = sns.load_dataset("tips") sns.lmplot(x="total_bill", y="tip", hue="sex", data=tips) 1 2 sns.lmplot(x="total_bill", y="tip", col="day", data=tips, facet_kws=dict(sharex=False, sharey=False),) ...

050快捷键

shell Ctrl + a : 将光标移到本行的开始处 Ctrl + e : 将光标移到本行的末尾处 Backsapce : 删除前一个字符 Ctrl + d : 删除后一个字符 Ctrl + k : 从光标开始剪切至行的末尾 Ctrl + y : 从行的开头剪切光标处 ...

正则表达式基础

在R、shell，Python等进行字符串处理时，常常使用正则表达式进行高效的文本编辑。下面小结一下关于正则表达式的基础用法 1、匹配字符（集） . : 可以匹配除换行符外的任意字符 ...

hugo+github搭建我的个人博客

博客网页：https://lishensuo.github.io/ github：https://github.com/lishensuo/lishensuo.github.io 1、安装hugo （1）首先下载安装包https://github.com/gohugoio/hugo/releases ...

R基础配置

1、R镜像设置（1）临时设置，重启R之后会重置 1 2 3 4 options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/") options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/")) options()$repos options()$BioC_mirror （2）通过设置.Rprofile文件永久设置 linux 1 2 3 4 5 6 7 #进入家目录 cd ~ vi ~/.Rprofile #输入下面两行代码 options(repos=structure(c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))) options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor") #保存退出 window 第①步：打开记事本或者其他文本编辑软件；第②步：输入默认设置（内容同上述linux案例）；第③步：保存文件到 “此电脑>文档” /.Rprofile；第④步：重新进入R/RStudio即可。 ...

R语言的多线程循环语句

在遇到R里的大量循环操作时，可以考虑多线程处理方式，提高分析速度。具体使用方法针对window与linux/mac平台有所区别。相关笔记如下 1 2 #查看系统平台 Windows/Linux Sys.info()['sysname'] 一、Linux/Mac平台 1 2 3 library(parallel) # 检测系统的CPU数 detectCores() 1、lapply 多线程 mclapply()函数，关键是mc.cores参数设置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 library(parallel) ##并行4个线程 res = mclapply(1:10, function(x){ <code> <code> }, mc.cores = 4) ##并行处理不影响顺序 res = mclapply(1:1000, function(x){ print(x) x2 = x*x Sys.sleep(0.1) return(c(x, x2)) }, mc.cores = 10) res_df = do.call(rbind, res) head(res_df,3) # [,1] [,2] # [1,] 1 1 # [2,] 2 4 # [3,] 3 9 tail(res_df,3) # [,1] [,2] # [998,] 998 996004 # [999,] 999 998001 # [1000,] 1000 1000000 2、for循环多线程配合foreach包。可通过调整参数，设置结果返回的形式，详见相关笔记，或者该包的帮助文档。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 library(foreach) library(doParallel) cl=makeCluster(4) registerDoParallel(cl) #专门加载所需要的分析包 clusterEvalQ(cl, library(package1)) clusterEvalQ(cl, library(package1)) res = foreach(i = 1:10) %dopar% { <code> <code> } stopCluster(cl) 二、window平台个人觉得window平台的笔记本电脑可能还是不太适合多线程的使用 1 2 3 library(parallel) # 检测系统的CPU数 detectCores() 1、lapply 多线程 parLapply()函数 1 2 3 4 5 6 7 8 cl <- makeCluster(4) #专门加载所需要的分析包 clusterExport(cl, library(packages)) res=parLapply(cl, 1:10, function(x){ <code> <code> }) stopCluster(cl) #需要显式的释放已加载的线程，比较麻烦 2、for循环的多线程调用同上参考教程 https://www.biostars.org/p/273107/

R-数据分析-dplyr表格操作

1 2 3 4 5 6 7 8 9 library(tidyverse) # -- Attaching packages ----------------------------------------------------- tidyverse 1.3.1 -- # √ ggplot2 3.3.5 √ purrr 0.3.4 # √ tibble 3.1.2 √ dplyr 1.0.7 # √ tidyr 1.1.3 √ stringr 1.4.0 # √ readr 2.0.0 √ forcats 0.5.1 # -- Conflicts -------------------------------------------------------- tidyverse_conflicts() -- # x dplyr::filter() masks stats::filter() # x dplyr::lag() masks stats::lag() 1、表格筛选 1.1 select 筛选列 col1:col3 选取起止范围的列； ...

R-数据分析-reshape2表格长短转换

1 2 library(tidyverse) library(reshape2) 1、matrix 1 2 3 4 5 6 7 8 9 10 11 12 set.seed(123) scores_mt = matrix(round(rnorm(40, mean = 80, sd=10)), nrow = 10, ncol = 4, dimnames = list(paste0("Stu",1:10), paste0("Subject-",LETTERS[1:4]))) class(scores_mt) # [1] "matrix" "array" head(scores_mt) # Subject-A Subject-B Subject-C Subject-D # Stu1 74 92 69 84 # Stu2 78 84 78 77 # Stu3 96 84 70 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ##(1) 宽变长 reshaped = melt(scores, value.name = "Score") head(reshaped) # Var1 Var2 Score # 1 Stu1 Subject-A 74 # 2 Stu2 Subject-A 78 # 3 Stu3 Subject-A 96 ## Var1 --- rownames ## Var2 --- colnames ##(2) 长变宽（还原） reshaped %>% dcast(Var1 ~ Var2) %>% head() # Var1 Subject-A Subject-B Subject-C Subject-D # 1 Stu1 74 92 69 84 # 2 Stu2 78 84 78 77 # 3 Stu3 96 84 70 89 2、data.frame 2.1 简单 1 2 3 4 5 6 7 8 9 10 scores_df = scores_mt %>% as.data.frame() %>% tibble::rownames_to_column("Name") class(scores_df) # [1] "data.frame" head(scores_df) # Name Subject-A Subject-B Subject-C Subject-D # 1 Stu1 74 92 69 84 # 2 Stu2 78 84 78 77 # 3 Stu3 96 84 70 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ##(1) 宽变长 reshaped = scores_df %>% melt(id="Name", variable.name="Subject", value.name = "Score") head(reshaped) # Name Subject Score # 1 Stu1 Subject-A 74 # 2 Stu2 Subject-A 78 # 3 Stu3 Subject-A 96 ##(2) 长变宽（还原） reshaped %>% dcast(Name ~ Subject, value.var = "Score") %>% head() # Name Subject-A Subject-B Subject-C Subject-D # 1 Stu1 74 92 69 84 # 2 Stu10 76 75 93 76 # 3 Stu2 78 84 78 77 2.2 复杂 1 2 3 4 5 6 7 8 scores_df_Anno = scores_df %>% dplyr::mutate(Class=paste0("Class",rep(c("01","02"), 5)), Age=round(rnorm(10, 20, 1)), .before=2) head(scores_df_Anno) # Name Class Age Subject-A Subject-B Subject-C Subject-D # 1 Stu1 Class01 20 74 92 69 84 # 2 Stu2 Class02 20 78 84 78 77 # 3 Stu3 Class01 20 96 84 70 89 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ##(1) 宽变长 reshaped = scores_df_Anno %>% melt(id=c("Name","Class","Age"), variable.name="Subject", value.name = "Score") head(reshaped) # Name Class Age Subject Score # 1 Stu1 Class01 20 Subject-A 74 # 2 Stu2 Class02 20 Subject-A 78 # 3 Stu3 Class01 20 Subject-A 96 ##(2) 长变宽（还原） reshaped %>% dcast(Name + Class + Age ~ Subject, value.var = "Score") %>% head() # Name Class Age Subject-A Subject-B Subject-C Subject-D # 1 Stu1 Class01 20 74 92 69 84 # 2 Stu10 Class02 21 76 75 93 76 # 3 Stu2 Class02 20 78 84 78 77 3、list 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 scores_list = list(Stu1=c(1,2), Stu2=c(3,4), Stu3=c(5,6)) # $Stu1 # [1] 1 2 # # $Stu2 # [1] 3 4 # # $Stu3 # [1] 5 6 melt(scores_list) # value L1 # 1 1 Stu1 # 2 2 Stu1 # 3 3 Stu2 # 4 4 Stu2 # 5 5 Stu3 # 6 6 Stu3