Learn how to extract genes from whole genome

Learn how to extract genes from whole genome

in
Table of contents

Whether you want to investigate gene functions or explore genetic variations, extracting a gene or another subsequence from a genome assembly is a common task that you will often be doing. In this post, we will explore how we can easily extract a gene from a FASTA file by using its coordinates in a GFF file. In this particular case, we will use the publicly available genome assembly and annotation of Arabidopsis thaliana.

Setup

Software

For this, you will need Python as well as the Biopython and argparse packages. Once Python is installed on your system, you can easily get them with:

python3 -m pip install bio argparse

or with this line if you are not an administrator of your system (the --user flag is used to install the package only for your user):

python3 -m pip install --user bio argparse

Input files

To download the genome assembly and annotation of Arabidopsis thaliana, visit the NCBI datasets website and click on the Download button. Be sure to select Genome sequences (FASTA) and Annotation features (GFF) before clicking on Download at the bottom of the popup.

Popup to download the dataset

Once the file is downloaded, simply unzip it somewhere so you can follow along. Inside the data folder, you will find two folders named GCA_000001735.2 and GCF_000001735 (at the time of writing, yours may be different). The GCA one contains the Genbank assembly while the GCF one is the RefSeq assembly. There is no sequence difference between the two (take a look at this if you want to know more about it) so we will go with the RefSeq assembly.

Coding our solution

Goal of the project

Before starting to code anything, it is important to set goals so you can be prepared for what you have to code. Here we want to extract a single gene from a genome assembly so our code will have 4 inputs:

  1. the path to the assembly we want to extract the gene from
  2. the name of the sequence the gene is on
  3. the beginning of the gene
  4. the end of the gene

Example data

By looking at the GFF file (GCF_000001735.4/genomic.gff) I selected a gene at random, specifically this one:

NC_003070.9     RefSeq  gene    51953   54737   .       +       .       ID=gene-AT1G01110;Dbxref=Araport:AT1G01110,TAIR:AT1G01110,GeneID:839394;N
ame=IQD18;gbkey=Gene;gene=IQD18;gene_biotype=protein_coding;gene_synonym=IQ-domain 18,T25K16.10,T25K16_10;locus_tag=AT1G01110

This line tells us that the gene gene-AT1G01110 is located on the NC_003070.9 chromosome, it starts at coordinate 51953 and ends at 54737. Now that we know that, let’s get to work!

Command line interface

We want to make things as clean as we can so we will design a nice way for the user to interact with our program. To do this, we will use the argparse python package which provides a neat way to get parameters from the command line.

First, import the python package and declare our new program:

import argparse

# This code will not be executed if another script tries to import it
# It is good practice to include it in our main script
if __name__ == "__main__":    
    parser = argparse.ArgumentParser(description="Extracts a subsequence from a FASTA file.")

Now that we have a description set, let’s go back to the Goal of the project part and remember that we wanted four input parameters. We’ll add these parameters and set them as required. We will have a -g for the input genome, -n for the name of the sequence that contains our gene, -s for the start coordinate and -e for the end coordinate:

    parser.add_argument("-g", action="store", dest="genome", required=True, help="Path to an input genome in FASTA format")
    parser.add_argument("-n", action="store", dest="name", required=True, help="Name of the sequence the subsequence is located on")
    parser.add_argument("-s", action="store", dest="start", required=True, help="Start coordinate")
    parser.add_argument("-e", action="store", dest="end", required=True, help="End coordinate")

Finally, parse the command line parameters and save the result to a variable name args:

    args = parser.parse_args()

Sanity checks

We want to be extra careful with user-provided data and so we will check that everything looks correct before trying to extract anything. Here is what we will check:

  • does the input genome exist?
  • is it readable?
  • are the coordinates valid? Are start or end less than 0? Is end inferior to start?

Let’s first check that the file exists and is readable with the os module. We will crash if the input file is invalid. Add an import to the top of your file:

import os
import sys

and this code after the arguments parsing:

    # Check that the genome exists and is readable
    if not os.path.exists(args.genome):
        print("Input genome file does not exist")
        sys.exit(1)
    elif not os.access(args.genome, os.R_OK):
        print("You do not have the permissions to read the input genome file")
        sys.exit(1)

Then, we want to verify that coordinates are numeric and valid:

    # Check that start and end are numeric
    try:
        args.start = int(args.start)
        args.end = int(args.end) 
    except:
        print("Conversion error: '-b' and '-e' should be numeric")
        sys.exit(1)

    if args.end < args.start:
        print("End coordinate should be superior to the the start coordinate")
        sys.exit(1)
    if args.start < 0 or args.end < 0:
        print("Start and end coordinates should be superior to 0")
        sys.exit(1)

Extracting the gene

Now that we checked that everything looks okay, we can finally try to extract our gene. To do that, we will parse all sequences in the input genome. If the name of the sequence does not match, we skip it. Otherwise, we extract the sequence that goes from start to end, without forgetting to check that end is inferior to the sequence size (the user could have mistyped something!).

First, import the Biopython module:

from Bio import SeqIO

Then, open the file in read mode and parse the sequences:

    with open(args.genome) as genome:
        for record in SeqIO.parse(genome, "fasta"):

Skip unwanted sequences:

            # Skip the sequence if the ID is not the one we are looking for
            if record.id != args.name:
                continue

Check that the end is valid if the sequence is the one we want or crash otherwise:

            # Check that end is not superior to the sequence length
            if args.end > len(record.seq):
                print(f"End coordinate is superior to sequence length: {args.end} > {len(record.seq)}")
                sys.exit(1)

If everything looks good, we extract the part of the sequence that we want, print it and break out of the loop:

            gene = SeqRecord(record.seq[args.start:args.end], id="gene")
            print(gene.format("fasta"))
            break

Our program is now complete, you can test it with (the > sign is to redirect the output of the program to a file of our choice):

python3 name_of_your_script.py -n NC_003070.9 -g GCF_000001735.4_TAIR10.1_genomic.fna -s 51953 -e 54737 > AT1G01110.fasta

Full code

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

import argparse
import os
import sys

if __name__ == "__main__":    
    parser = argparse.ArgumentParser(description="Extracts a subsequence from a FASTA file.")

    parser.add_argument("-g", action="store", dest="genome", required=True, help="Path to an input genome in FASTA format")
    parser.add_argument("-n", action="store", dest="name", required=True, help="Name of the sequence the subsequence is located on")
    parser.add_argument("-s", action="store", dest="start", required=True, help="Start coordinate")
    parser.add_argument("-e", action="store", dest="end", required=True, help="End coordinate")

    args = parser.parse_args()

    # Check that the genome exists and is readable
    if not os.path.exists(args.genome):
        print("Input genome file does not exist")
        sys.exit(1)
    elif not os.access(args.genome, os.R_OK):
        print("You do not have the permissions to read the input genome file")
        sys.exit(1)

    try:
        args.start = int(args.start)
        args.end = int(args.end) 
    except:
        print("Conversion error: '-b' and '-e' should be numeric")
        sys.exit(1)

    if args.end < args.start:
        print("End coordinate should be superior to the the start coordinate")
        sys.exit(1)
    if args.start < 0 or args.end < 0:
        print("Start and end coordinates should be superior to 0")
        sys.exit(1)

    with open(args.genome) as genome:
        for record in SeqIO.parse(genome, "fasta"):
            # Skip the sequence if the ID is not the one we are looking for
            if record.id != args.name:
                continue

            # Check that end is not superior to the sequence length
            if args.end > len(record.seq):
                print(f"End coordinate is superior to sequence length: {args.end} > {len(record.seq)}")
                sys.exit(1)
                
            gene = SeqRecord(record.seq[args.start:args.end], id="gene")
            print(gene.format("fasta"))
            break