使用biopython解析kegg數(shù)據(jù)庫

2021.01.09

KEGG數(shù)據(jù)庫稱之為基因組百科全書，是一個(gè)包含gene, pathway等多個(gè)子數(shù)據(jù)庫的綜合性數(shù)據(jù)庫。為了更好的查詢kegg數(shù)據(jù)，官方提供了對(duì)應(yīng)的API。

在biopython中，通過Bio.KEGG模塊，對(duì)kegg官方的API進(jìn)行了封裝，允許在python環(huán)境中使用kegg API。KEGG API與python代碼的對(duì)應(yīng)關(guān)系如下

/list/hsa:10458+ece:Z5100 -> REST.kegg_list(["hsa:10458", "ece:Z5100"])
/find/compound/300-310/mol_weight -> REST.kegg_find("compound", "300-310", "mol_weight")
/get/hsa:10458+ece:Z5100/aaseq -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq")

利用REST模塊，可以下載API支持的任何類型的數(shù)據(jù)，以pathway為例，示例如下

>>> from Bio.KEGG import REST
>>> pathway = REST.kegg_get('hsa00010')

對(duì)于查詢獲得的內(nèi)容，通過read方法可以轉(zhuǎn)換為純文本，示例如下

>>> pathway = REST.kegg_get('hsa00010')
>>> res = pathway.read().split("\n")
>>> res[0]
'ENTRY hsa00010 Pathway'
>>> res[1]
'NAME Glycolysis / Gluconeogenesis - Homo sapiens (human)'
>>> res[2]
'DESCRIPTION Glycolysis is the process of converting glucose into pyruvate and generating small amounts of ATP (energy) and NADH (reducing power). It is a central pathway that produces important precursor metabolites: six-carbon compounds of glucose-6P and fructose-6P and three-carbon compounds of glycerone-P, glyceraldehyde-3P, glycerate-3P, phosphoenolpyruvate, and pyruvate [MD:M00001]. Acetyl-CoA, another important precursor metabolite, is produced by oxidative decarboxylation of pyruvate [MD:M00307]. When the enzyme genes of this pathway are examined in completely sequenced genomes, the reaction steps of three-carbon compounds from glycerone-P to pyruvate form a conserved core module [MD:M00002], which is found in almost all organisms and which sometimes contains operon structures in bacterial genomes. Gluconeogenesis is a synthesis pathway of glucose from noncarbohydrate precursors. It is essentially a reversal of glycolysis with minor variations of alternative paths [MD:M00003].'

這樣就可以通過字符串解析，來獲取通路對(duì)應(yīng)的編號(hào)，名稱，注釋等信息。對(duì)于KEGG數(shù)據(jù)的解析，biopython還提供了專門的解析函數(shù)，但是解析函數(shù)并不完整，目前只覆蓋了compound, map, enzyme等子數(shù)據(jù)庫。以enzyme數(shù)據(jù)庫為例，用法如下

>>> from Bio.KEGG import REST
>>> request = REST.kegg_get("ec:5.4.2.2")
>>> open("ec_5.4.2.2.txt", "w").write(request.read())
>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))
>>> record = list(records)[0]
>>> record
<Bio.KEGG.Enzyme.Record object at 0x02EE7D18>
>>> record.classname
['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']
>>> record.entry
'5.4.2.2'

通過biopython，我們不僅可以在python環(huán)境中使用kegg api, 更重要的是，可以借助python的邏輯處理，來實(shí)現(xiàn)復(fù)雜的篩選邏輯，比如查找human中DNA修復(fù)相關(guān)的基因，基本思路如下

1. 通過list API獲取human所有的pathway編號(hào)；

2. 通過get API獲取每條pathway, 解析其description信息，篩選出現(xiàn)了repair關(guān)鍵詞的通路；

3. 對(duì)于篩選出的通路，通過文本解析獲取該通路對(duì)應(yīng)的基因；

完整的代碼如下

>>> from Bio.KEGG import REST
>>> human_pathways = REST.kegg_list("pathway", "hsa").read()
>>> repair_pathways = []
>>> for line in human_pathways.rstrip().split("\n"):
...  entry, description = line.split("\t")
...  if "repair" in description:
...   repair_pathways.append(entry)
...
>>> repair_pathways
['path:hsa03410', 'path:hsa03420', 'path:hsa03430']
>>> repair_genes = []
>>> for pathway in repair_pathways:
...  pathway_file = REST.kegg_get(pathway).read()
...  current_section = None
...  for line in pathway_file.rstrip().split("\n"):
...   section = line[:12].strip()
...   if not section == "":
...    current_section = section
...   if current_section == "GENE":
...    gene_identifiers, gene_description = line[12:].split("; ")
...    gene_id, gene_symbol = gene_identifiers.split()
...    if not gene_symbol in repair_genes:
...  repair_genes.append(gene_symbol)
...
>>> repair_genes
['OGG1', 'NTHL1', 'NEIL1', 'NEIL2', 'NEIL3', 'UNG', 'SMUG1', 'MUTYH', 'MPG', 'MBD4', 'TDG', 'APEX1', 'APEX2', 'POLB', 'POLL', 'HMGB1', 'XRCC1', 'PCNA', 'POLD1', 'POLD2', 'POLD3', 'POLD4', 'POLE', 'POLE2', 'POLE3', 'POLE4', 'LIG1', 'LIG3', 'PARP1', 'PARP2', 'PARP3', 'PARP4', 'FEN1', 'RBX1', 'CUL4B', 'CUL4A', 'DDB1', 'DDB2', 'XPC', 'RAD23B', 'RAD23A', 'CETN2', 'ERCC8', 'ERCC6', 'CDK7', 'MNAT1', 'CCNH', 'ERCC3', 'ERCC2', 'GTF2H5', 'GTF2H1', 'GTF2H2', 'GTF2H2C_2', 'GTF2H2C', 'GTF2H3', 'GTF2H4', 'ERCC5', 'BIVM-ERCC5', 'XPA', 'RPA1', 'RPA2', 'RPA3', 'RPA4', 'ERCC4', 'ERCC1', 'RFC1', 'RFC4', 'RFC2', 'RFC5', 'RFC3', 'SSBP1', 'PMS2', 'MLH1', 'MSH6', 'MSH2', 'MSH3', 'MLH3', 'EXO1']

通過biopython, 可以更加高效的使用KEGG API, 結(jié)合API的數(shù)據(jù)獲取能力和python的邏輯處理能力，來滿足我們的個(gè)性化分析需求。

·end·

本站僅提供存儲(chǔ)服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請(qǐng)點(diǎn)擊舉報(bào)。

打開APP，閱讀全文并永久保存查看更多類似文章

KEGG數(shù)據(jù)庫的rest API(附帶R語言小技巧)

利用KEGG的API獲取基因?qū)?yīng)的pathway 信息

KEGG數(shù)據(jù)庫倒閉了嗎

[重磅]KEGG API，你會(huì)用嗎

下載最新版的KEGG信息，并且解析好 | 生信菜鳥團(tuán)

使用R包下載KEGG數(shù)據(jù)庫

更多類似文章 >>

免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版