Skip to content
Related Articles

Related Articles

Biopython – Sequence input/output

View Discussion
Improve Article
Save Article
  • Last Updated : 22 Oct, 2020

Biopython has an inbuilt Bio.SeqIO module which provides functionalities to read and write sequences from or to a file respectively. Bio.SeqIO supports nearly all file handling formats used in Bioinformatics. Biopython strictly follows single approach to represent the parsed data sequence to the user with the SeqRecord object.

SeqRecord

SeqRecord object provided by the Bio.SeqRecord module holds the metadata of the sequence as well as the information about the sequence. Some main data information are listed below :

Record Description
seq An actual sequence to be parsed.
id Primary identity of the sequence, by default it is string type
name The name of the sequence, by default it is string type.
description Displays the information about the sequence in human-readable format.
annotations Dictionary containing additional information related to the sequence.

Reading Sequence:

Biopython Seq module has a built-in read() method which takes a sequence file and turns it into a single SeqRecord according to the file format. It is able to parse sequence files having exactly one record, if the file has no records or more than one record then an exception is raised. Syntax and arguments of the read() method are given below :

Bio.SeqIO.read(handle, format, alphabet=None)
Arguments  Description
handle   Handle to file or takes filename as string(older versions only take handle)
format   File; format as a string
alphabet Optional parameter, used when sequence type is not automatically inferred from file(ex. format = “fasta”).

Python3




# Import libraries
from Bio import SeqIO
  
# Reading file
record = SeqIO.read("sequence.gb", "genbank")
  
# Showing records
print("ID: %s" % record.id)
print("Sequence length: %i" % len(record))
print("Sequence description: %s" % record.description)


Output:

Prasing Sequence:

The Parse() method provided by the Bio.Seq module is used when we have to read multiple records from the handle. It basically converts the sequence file into an iterator which returns the SeqRecords. If the file contains string data then it must be converted to handle to parse it. The file formats where alphabet can’t be determined, it is useful to specify the alphabet explicitly(ex. FASTA). Syntax and arguments of parse() method are given below :

Bio.SeqIO.parse(handle, format, alphabet=None)
Arguments Description
handle Handle to file or takes filename as string(older versions only take handle)
format File format as a string
alphabet The optional parameter, used when sequence type is not automatically inferred from file(ex. format = “fasta”).

Python3




# Import libraries
from Bio import SeqIO
  
# Parsing file
filename = "sequence.fasta"
for record in SeqIO.parse(filename, "fasta"):
  
    # Showing records
    print("ID: %s" % record.id)
    print("Sequence length: %i" % len(record))
    print("Sequence description: %s" % record.description)


Output :

Writing to Sequence:

For writing to the file Bio.Seq module has a write() method, which writes the set of sequences to the file and returns an integer representing the number of records written. Ensure to close the handle after calling the handle else data gets flushed to disk. Syntax and arguments of write() method are given below :

Bio.SeqIO.write(sequences, handle, format)
Arguments Description
sequences List or iterator of SeqRecord object(or single SeqRecord in Biopython version 1.54 or later)
handle Handle to file or takes filename as string(older versions only take handle)
format File format to write as a lowercase string

Note: To download files click here

Python3




# Import libraries
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
  
rec1 = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
                     + "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
                     + "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"),
                 id="gi|14150838|gb|AAK54648.1|AF376133_1",
                 description="chalcone synthase [Cucumis sativus]")
  
rec2 = SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
                     + "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
                     + "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
                     + "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
                     + "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
                     + "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
                     + "TGEGLEWGVLFGFGPGLTVETVVLHSVAT"),
                 id="gi|13925890|gb|AAK49457.1|",
                 description="chalcone synthase [Nicotiana tabacum]")
sequences = [rec1, rec2]
  
# Writing to file
with open("example.fasta", "w") as output_handle:
    SeqIO.write(sequences, output_handle, "fasta")
  
for record in SeqIO.parse("example.fasta", "fasta"):
    print("ID %s" % record.id)
    print("Sequence length %i" % len(record))


Output:


My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!