Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 312

ABySS for Intel® Xeon® Processors

$
0
0

Purpose

This code recipe describes how to get, build, and use the ABySS de novo assembly code for the Intel® Xeon® processor.

Contents

Introduction

Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, a de novo, parallel, paired-end sequence assembler - ABySS (Assembly By Short Sequences), was designed and developed for short reads.

The single-node version is useful for assembling genomes up to 100 Mbases in size. There is also a parallel version of ABySS implemented using MPI and capable of assembling larger genomes. The script abyss-pe will run a more comprehensive set of tools to process paired-end data.

The most current version of ABySS source files can be downloaded from http://www.bcgsc.ca/platform/bioinfo/software/abyss.

Support Software

Building the ABySS application requires the installation of the Boost* Graph Library and the Google* Sparsehash library. The current version of the Boost library can be found at www.boost.org; the current version of the Sparsehash library can be found at https://code.google.com/p/sparsehash/.

Build and Install Boost Graph Library

  1. Download the current version of the software from www.boost.org.
  2. Unzip and untar the downloaded file into a desired directory; change to the decompressed Boost directory.
  3. Execute the bootstrap.sh shell script with the --prefix parameter for the parent directory in which to install the library files. If you have admin privileges, you may wish to put this in some global directory such as /usr/local.
    $> ./bootstrap.sh --prefix=/usr/local
  4. Install the library files using the b2 script.
    $> sudo ./b2 install

Build and Install Sparsehash library

  1. Download the current version of the software from https://code.google.com/p/sparsehash/.
  2. Unzip and untar the downloaded file into a desired directory; change to the decompressed Sparsehash directory.
  3. Use configure to set up build environment. The compiler and root directory for installation, among other options, can be given as command line arguments. For example,
    $> ./configure CC=icc CXX=icpc --prefix=/usr/local 
  4. Use make to build the library.
    $> make
  5. Install the library.
    $> sudo make install

Build and Install the ABySS Software

  1. Download the current version of the software from http://www.bcgsc.ca/platform/bioinfo/software/abyss.
  2. Unzip and untar the downloaded file into a desired directory; change to the decompressed ABySS directory.
  3. Use configure to set up build environment; include your compiler choice, the location of the include files for the Boost library, any compiler flags you want to use, and the root directory in which you want to install the ABySS built programs and documentation.
    $> ./configure CC=icc CXX=icpc   --with-boost=/usr/local/include     \
                 CPPFLAGS=-I/usr/local/include  --prefix=/usr/local 
  4. Use make to build the library.
    $> make
  5. Install the library.
    $> sudo make install     

Running ABySS on a Sample Yeast Dataset

As an example, the following steps describe how to download a short yeast dataset (ERR156523) and run the ABySS genome assembly software on the dataset.

  1. Download the two paired-end data files from http://www.ebi.ac.uk/ena/data/view/ERR156523&display=htmlDirect URLS for the files are:
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/ERR156523/ERR156523_1.fastq.gz
    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR156/ERR156523/ERR156523_2.fastq.gz 
    • The files can be left as compressed FASTQ files or uncompressed for input to ABYSS
  2. The following command will run the ABYSS executable with the data files from the previous step. This assumes that the command is run from the directory in which the data files reside; if this is not the case, add directory information to the file name parameters to locate these files from the execution directory.
    $> ABYSS -k57  --coverage-hist=coverage.hist \
             -s err156523-bubbles.fa -o err156523-1.fa\
             ERR156523_1.fastq.gz ERR156523_2.fastq.gz
     
    • The -k parameter sets the length of the kmers to be used in the de Bruijn graph. The default maximum is 64. A good setting is some value over half the length of the input reads
    • The above command will generate the files coverage.histerr156523-bubbles.fa, and err156523-1.fa.

Running ABySS Paired-End Analysis

An alternate execution that will utilize the ABYSS executable and many other tools in the ABySS suite is available when you have paired end data (as given above). This uses the abyss-pe script file.

The following command will run the abyss-pe script with the data files from the previous steps. This assumes that the command is run from the directory in which the data files reside; if this is not the case, add directory information to the file name parameters to locate these files from the execution directory

$> abyss-pe name=err156523 k=57 \
         in='ERR156523_1.fastq.gz ERR156523_2.fastq.gz'
  • The assembled contigs output will be stored in the file err156523-contigs.fa (using the name parameter from the command line).
  • The parameter in specifies the input files. The pair of read files must be named with the suffixes 1 and 2 to identify the first and second read. A single file with the paired reads interleaved could be used.

Build and Install the ABySS Software for Distributed Memory Execution

  1. Download the current version of the software from http://www.bcgsc.ca/platform/bioinfo/software/abyss.
  2. Unzip and untar the downloaded file into a desired directory; change to the decompressed ABySS directory.
  3. Use configure to set up build environment; include your compiler choice, the location of the include files for the Boost library, the path to the home of the MPI files, any compiler flags you want to use, and the root directory in which you want to install the ABySS built programs and documentation
    $> ./configure CC=mpiicc CXX=mpiicpc  --with-boost=/usr/local/include     \
                --with-mpi=$MPI_HOME_DIR  CPPFLAGS=-I/usr/local/include       \
                --prefix=/usr/local
    • The above will use the Intel® MPI compiler script that employs the Intel® compiler. (The Open MPI library is the version recommended by the code authors.)
  4. Use make to build the library.
    $> make
  5. Install the library.
    $> sudo make install

Running the Distributed Memory Version of ABySS

  1. The following instructions assume that you have downloaded the ERR156526 data files or some other appropriate data set.
  2. The following command will run the ABYSS executable in a distributed memory fashion. The launch is done via the appropriate “run” command used by the MPI library that was used to build the application. The command assumes that it is run from the directory in which the data files reside; if this is not the case, add directory information to the file name parameters to locate these files from the execution directory.
    $> mpirun -np 8 ABYSS-P -k57 --coverage-hist=coverage.hist  -s err156523-bubbles.fa  -o err156523-1.fa  ERR156523_1.fastq.gz ERR156523_2.fastq.gz
     
    • The executable ABYSS-P is the MPI-enabled version of the ABYSS application.
    • The above will start 8 processes (-np 8) and divide up the input reads from the data files. Each process will use its set of reads to construct the overall de Bruijn graph.

Running ABySS Paired-End Analysis on Distributed Memory platform

An alternate execution that will utilize the ABYSS-P executable and many other tools in the ABySS suite is available when you have paired end data (as given above). This uses the abyss-pe script file.

The following command will run the abyss-pe script using the ABYSS-P executable. (At the time of this writing, none of the other ABySS tools were written to run on multiple processes.)

$> abyss-pe name=err156523 k=57 np=8 \
         in='ERR156523_1.fastq.gz ERR156523_2.fastq.gz'

If compiled for distributed execution, you MUST NOT launch the script with something like “mpirun -np 8 abyss-pe”. The abyss-pe driver script will launch the MPI processes.

References

  1. ABySS Project Description
  2. ABySS: A parallel assembler for short read sequence data
  3. Intel® C++ Compilers

About the Authors

Clay Breshears

Clay Breshears
Life Sciences Software Architect

Dr. Clay Breshears is a Life Sciences Software Architect for the Intel® Health & Life Sciences group. He currently works to parallelize and optimize genomic and bioinformatics codes. Prior to this, Clay was a Courseware Architect on the Innovative Software Education team, specializing in multi-core and multithreaded programming and training and working with university faculty to incorporate parallel programming as a natural part of the curriculums in Computer Science and other computational science fields of study. During his time with ISE, Clay was the co-host of the popular weekly online show "Parallel Programming Talk."

Sunny Gogar

Sunny Gogar
Software Engineer

Sunny Gogar received a Master’s degree in Electrical and Computer Engineering from the University of Florida, Gainesville and a Bachelor’s degree in Electronics and Telecommunications from the University of Mumbai, India.  He is currently a software engineer with Intel Corporation's Software and Services Group. His interests include parallel programming and optimization for Multi-core and Many-core Processor Architectures.


Viewing all articles
Browse latest Browse all 312

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>