我安装的是:Ubuntu 20.04.6 LTS (GNU/Linux 5.15.90.4-microsoft-standard-WSL2 x86_64)
自带python3.8.10
hann@HannYang:~$ python3
Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
htseq软件我没用过,请参阅:
(1)安装htseq软件
conda install htseq
(2)htseq-count ,下载酵母基因注释文件gff格式,或者gtf格式,只要把FTP地址中改成gff,gtf即可。
路径:~/data/yeastproject/ref/Saccharomyces_cerevisiae.R64-1-1.47.gff3
(3)新建一个文件夹read_count,在analysis目录下
(4)htseq-count
路径:~/analysis/read_aln/3b_strain_2.srt.bam
htseq-count -f bam -r pos ~/analysis/read_aln/3b_strain_2.srt.bam ~/data/yeastproject/ref/Saccharomyces_cerevisiae.R64-1-1.47.gtf > 3b_strain_2.count.tab
输出一个3b_strain_2.count.tab文件
将两个空白对照和三个样本数据都进行一个计数处理
(5)将5个计数后的count.tab文件进行合并:
用python写脚本的方式
在read_counts目录下新建一个scripts文件夹,在此文件夹里写入一个脚本文件
vi merge_read_count.py
建好这个文本文档,准备在里面输入编程程序
(6)read_count下,在linux命令行里输入
python scripts/merge_read_count.py EV_strain_3.count.tab,EV_strain_4.count.tab,3b_strain_2.count.tab,3b_strain_3.count.tab,3b_strain_4.count.tab EV_strain_3,EV_strain_4,3b_strain_2,3b_strain_3,3b_strain_4
意思是:python这个脚本文件(.py),把五个.tab文件依次对应输出名字为这几个名字。一一对应不要搞错。
每次修改.py里的内容,python出来的结果都会不一样。
(7)写程序
vi xxxxxxxxxx.py
import sys
sample_counts = sys.argv[1]
sample_names = sys.argv[2]
count_files = sample_counts.split(",")
sample_ids = sample_names.split(",")
#print(count_files)
count_dict = {}
for count_file in count_files:
with open(count_file) as count:
for line in count:
if line.startwith("__"):
continue
line = line.rstrip("\n")
ele = line.split("\t")
#print(ele)
if ele[0] in count_dict:
count_dict[ele[0]].append(ele[1])
else:
count_dict[ele[0]] = [ele[1]]
print("gene_id\t" + "\t".join(sample_ids))
for gene_id in count_dict:
#print(count_dict[gene_id])
print(gene_id + "\t" + "\t".join(count_dict[gene_id]))
网上搬来的,如有帮助请给个采纳。谢谢