Sequence Cleaner

(Difference between revisions)
Jump to: navigation, search
(Script)
Line 41: Line 41:
  
 
"You should call the function 'sequence_cleaner', there are 3 basics parameters:
 
"You should call the function 'sequence_cleaner', there are 3 basics parameters:
"                                               #1st: your fasta file  
+
"                                 #1st: your fasta file  
"                                               #2nd: the user define the minimum length (default value 0 ( means  you dont care to minimum length)
+
"                                 #2nd: the user define the minimum length (default value 0 ( means  you dont care to minimum length)
"                                               #3rd: the user define the % of N is allows (default value 100 ( means  you dont care to 'N' in your sequences))
+
"                                 #3rd: the user define the % of N is allows (default value 100 ( means  you dont care to 'N' in your sequences))
 
"FYI If You dont care to the 2nd and the 3rd parameters you just gonna remove the duplicate sequences
 
"FYI If You dont care to the 2nd and the 3rd parameters you just gonna remove the duplicate sequences
  
 
"
 
"
 
</python>
 
</python>

Revision as of 18:15, 9 July 2011

Description

I wanna share my script using biopython to clean sequences up , you should know analyzing poor data takes CPU time and interpreting the results from poor data takes people time, so always is importat make a preprocessing.

Let me call my script as “Sequence_cleaner” and the big idea is to remove duplicate sequences, remove sequence too short ( the user define the minimum length) and remove sequences which has too much unknown nucleotides (N) ( the user define the % of N is allows ) and in the end the use can choose if he/she wanna have a file as output or print the result.

Script

from Bio import SeqIO
 
def sequence_cleaner(fasta_file,min_length=0,por_n=100):
    #create our hash table to add the sequences
    sequences={}
 
    #Using the biopython fasta parse we can read our fasta input
    for seq_record in SeqIO.parse(fasta_file, "fasta"):
        #Take the current sequence
        sequence=str(seq_record.seq).upper()
        # Check if the current sequence and be in the hash table as a "clean" sequence according with the user parameters
        if (len(sequence)>=min_length and (float(sequence.count("N"))/float(len(sequence)))*100<=por_n):
            # If the sequence passed in the test "is It clean?" and It isnt in the hash table , the sequence and Its id gonna be in the hash table
            if sequence not in sequences:
                sequences[sequence]=seq_record.id
            # if It is in the hash table , We just gonna take its ID and concact with the Id of other sequence that was exaclty the same.
            else:
                sequences[sequence]+="_"+seq_record.id
 
 
    #Write the clean sequences
 
    #Create a file in the same directory where you ran this script
    output_file=open("clear_"+fasta_file,"w+")
    #Just Read the Hash Table and write on the file as a fasta format
    for sequence in sequences:
            output_file.write(">"+sequences[sequence]+"\n"+sequence+"\n")
    output_file.close()
 
 
 
#sequence_cleaner("Aip_coral.fasta",10,10)
 
"You should call the function 'sequence_cleaner', there are 3 basics parameters:
"                                  #1st: your fasta file 
"                                  #2nd: the user define the minimum length (default value 0 ( means  you dont care to minimum length)
"                                  #3rd: the user define the % of N is allows (default value 100 ( means  you dont care to 'N' in your sequences))
"FYI If You dont care to the 2nd and the 3rd parameters you just gonna remove the duplicate sequences
 
"
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox