275x Filetype PDF File size 0.38 MB Source: pdfs.semanticscholar.org
VECT: an automatic visual Perl programming
tool for nonprogrammers
Hui-Hsien Chou
BioTechniques 38:615-621 (April 2005)
Modern high-throughput biological research produces enormous amount of data that must be processed by computers, but many
biologists dealing with these data are not professional programmers. Despite increased awareness of interdisciplinary training in
bioinformatics, many biologists still find it difficult to create their own computational solutions. VECT, the Visual Extraction and
Conversion Tool, has been developed to assist nonprogrammers to create simple bioinformatics without having to master a pro-
gramming language. VECT provides a unified graphical user interface for data extraction, data conversion, output composition, and
Perl code generation. Programming using VECT is achieved by visually performing the desired data extraction, conversion, and out-
put composition tasks using some sample user data. These tasks are then compiled by VECT into an executable Perl program, which
can be saved for later use and can carry out the same computation independently of VECT. VECT is released under the GNU General
® ®
Public License and is freely available for all major computing platforms including Macintosh OS X, Linux, and Microsoft Win-
®
dows at www.complex.iastate.edu.
INTRODUCTION data inside its user interface, and then For each protein sequence, things are
generates Perl programs to replicate a bit more complex since it can span
In the genomics and postgenomics these tasks (1). Anyone who needs to several input lines, so an inner loop is
eras, biologists frequently need to process textual format data can poten- run to collect all its parts until the end
process a lot of biological data. Usually, tially benefit from using Vect. of the sequence is seen. Subsequently,
biologists know how their data can all quotation marks must be removed,
be manually handled, but only a few and only when both the name and
of them are well versed in computer MATERIALS AND METHODS the sequence of a protein have been
science to be able to turn that into collected will output be produced in
executable code. Powerful bioinformatic Vect employs a data flow the FASTA format (www.ncbi.nlm.
tools have been created to solve truly programming paradigm that is nih.gov/blast/html/search.html). The
difficult and well-defined problems in different from the control flow scanning continues until all input lines
computational biology. However, not programming paradigm more familiar have been seen.
all needs of biologists are as generic. to programmers. An example problem There is nothing wrong with the
Actually, most of the time biologists of extracting the translated protein control flow programming paradigm.
need some bridging programs to connect sequences of predicted open reading In fact, most programmers take it
existing bioinformatic tools together frames from a GenBank® report (www. for granted. However, the data flow
to form their data processing pipeline. ncbi.nlm.nih.gov) is used to illus- programming paradigm shown in
These generally involve data extraction, trate the difference between the two Figure 1 seems to be an easier approach
conversion, and reporting tasks that are programming paradigms. Suppose both for nonprogrammers to follow. In this
very specific to their ongoing research. the names and sequences of the proteins paradigm, focus is placed on how input
Creating these bridging programs is must be extracted. These data are data can be extracted and processed,
easy by experienced programmers, but delimited by the /protein_id and /trans- disregarding the order of their arrival.
to nonprogrammers, this work can be lation= tags embedded inside the CDS For example, obtaining protein names
detrimental and slow. regions in the report. To extract them, and sequences are considered as two
The author believes this limiting a programmer might have followed the unrelated processes. A user simply
factor of modern biological research control flow logic shown in Figure 1. A needs to define the steps to extract and
can be resolved in a creative manner. main loop is scanning through all input process them separately (e.g., protein
In this paper, a visual programming lines. Each input line is then checked names have to be taken out of quotes,
tool, Vect (the Visual Extraction and against the name and protein delimiter and protein sequences have to be
Conversion Tool), is introduced. It tags. For the name of a protein, its concatenated and then also taken out
allows users to manipulate their sample quoted string name must be extracted. of quotes). Output is produced using an
Iowa State University, Ames, IA, USA
Vol. 38, No. 4 (2005) BioTechniques 615
RESEARCH REPORT
output template. Therefore a user does before each part of it is explained in the latter is a placeholder for data sets.
not need to worry whether the name detail. The protein extraction problem This formatting is taken by Vect as a
or the sequence of a protein reaches mentioned earlier is used as an example template to group each pair of data from
the output template first; the user only again. To begin with, a GenBank file the two sets to produce the output. The
needs to know that when they have both is loaded directly into the Input Data results can be checked in the Output view
arrived, they will be output together panel of Vect. The first thing to grab shown in Figure 2E. If they are correct,
using the template as defined. is the protein sequence, so we use we can finally go to the Perl Program
Although data flow programming the right mouse button to click and panel shown in Figure 2F and click the
used to refer to specialized hardware drag over /translation=” to set it as Compile button to obtain a Perl program
and software that have never been an opening block tag. We also set the that can reproduce the same operations.
in widespread use (2), in Vect, this ending double quote ” as the closing This Perl program can be saved for later
is simply the programming method block tag. This defines regions in the use and can work on the other GenBank
adopted to facilitate user programming input file where data can be selected. files with similar contents.
effort. The data flow programming Since the ending double quote is not
paradigm naturally leads to an in the same position for each protein Data Extraction
example-driven programming style. sequences, we change both the
In Vect, programming is achieved by opening and closing tags to position The Input Data panel of Vect
letting users handle some sample data independent. This allows all protein allows users to define the extraction
in its interface. This is similar to using sequences in the input to be identified of useful data from input files. It is
an editor or a spreadsheet program. and selected. The result is shown in designed to handle semi-structured text
Vect then translates user actions into Figure 2A, where pink regions are not files commonly produced by online
executable Perl code expressed in the selectable, green- and red-colored texts databases. Selection can be based on
control flow programming paradigm. are the opening and closing block tags, fields, in which each field is a sequence
Example-driven programming respectively, and grey regions are the of nonwhite characters separated by
started in the 1960s and 1970s with the text actually selected. characters such as tabs or spaces.
RPG (3) and COBOL (4) programming The selected data are sent to the To select an entire field, just click on
languages that allowed programmers to Convert Data panel by clicking the the field. Selected data are always
format reports using output templates. Move button. The initial data set is highlighted with a grey background
In the 1980s, spreadsheet programs named Protein Parts in the Convert color. Selection can also be based on
(5,6) were invented that allowed users Data panel. Here, additional rules can positions relative to either a field or an
to program number-crunching jobs be added to convert the data. Specifi- entire line. By clicking and dragging
by inserting formulas in some cells cally, we need to add a concatenation over a range of characters, a position-
and copying them to the others. In the rule to connect the broken protein based selection is made. If the selected
1990s, the popularity of GUI programs parts into complete protein sequences, characters are completely contained
ignited the development of several and then a quoted data extraction within a field, then the position
object-oriented interface libraries and rule is needed to remove the /trans- selection is relative to the same field on
their associated rapid application devel- lation=” and ” tags that are not part of each line. Otherwise, or if the shift key
opment (RAD) tools (7,8). These tools the proteins. The results are shown in is pressed while dragging, the position
allow a user interface to be designed Figure 2B. The resulted data set Pure selection is relative to the entire line.
graphically, almost like drawing in a Proteins can be copied to the Output Selections can be restricted by
graphic program. The design is then Data panel by clicking the Copy button. designated tags in the input. Tags do
converted into program code that can In Figure 2C, similar extraction and tag not select data per se, but they help
recreate the interface at runtime. To sum removal steps are defined for protein define the desired data that are to be
up, programming by examples is not names. The resulted data set is named selected. Tags can be block opening,
new, but to apply this concept to the data Pure Names. Note that yellow-colored block closing, or simply line tags. A
extraction, conversion, and formatting texts in this figure indicate line-based line tag allows only lines containing it
needs of biologists is new to the best selection tags. to be selected. The opening and closing
of our knowledge. In the following To produce the desired output, both tags enclose a region for selection,
example, the author demonstrates how Pure Names and Pure Proteins are copied but they do not have to be paired. If
this form of programming in Vect can to the Output Data panel. This panel an opening tag is followed by another
help biologists create Perl programs. has both Template and Output views. opening tag, the second tag defines a
To compose a correct FASTA output, new selection region (i.e., it functions
we need to add a greater than symbol > both as a closing tag for the previous
RESULTS in front of the Pure Names and separate region and as an opening tag for its own
it from Pure Proteins by a new line region). All tags are defined by using
Vect Programming Tutorial (see Figure 2D). This FASTA-required the right mouse button to select text,
symbol is not to be confused with the but otherwise they are selected exactly
It is helpful to demonstrate how pair of arrows enclosing data set names. the same way as regular text (i.e., tags
Vect works from a user’s perspective The former is a static text to output, but can also be field- or position-based).
616BioTechniques Vol. 38, No. 4 (2005)
input
•
•
•
The Control Flow Paradigm /protein_id="AAD16616.1" The Data Flow Paradigm
•
•
•
/translation="MQLLRTL...
each linesequentially scan ........................ raw
........................ names
........................
........VSLIK" emerged
•
•
•
raw /protein_id="AAD16616.1" AAD16616.1
proteins get
emerged quoted
seeing string
/protein_id no /translation="MQLLRTL...
........................
........................
seeing ........................
/translation no ........VSLIK"
yes
yes concatenate
save this line
concatenate /translation="MQLLRTL...........................VSLIK"
this line
next line get
quoted
get quoted seeing no string
string end "?
yes MQLLRTL.............................VSLIK
concatenate
store the the last line
name fill in the
template
get quoted
string
output template
is both > AAD16616.1
name and protein store the
no collected? protein "MQLLRTL.........
.................
.................
yes .................
.................
output > , .......VSLIK"
name and protein
clean > AAD16616.1
name and protein MQLLRTL..........
.................
.................
.................
.................
.......VSLIK
output
Figure 1. Comparison of the control flow and data flow programming paradigms. Vect (the Visual Extraction and Conversion Tool) presents to its users
a data flow programming paradigm shown in the right. Users can separately define how data sources can be extracted, converted, and composed to produce
the output. Vect then compiles the design into a Perl program that is expressed in the control flow paradigm shown in the left, which actually implements the
computation.
Vol. 38, No. 4 (2005) BioTechniques 617
RESEARCH REPORT
A B
C D
E F
Figure 2. Vect (the Visual Extraction and Conversion Tool) programming: a tutorial. (A) Each protein block is defined by a green opening tag and a red
closing tag. Pink regions are not selectable. The actual selected text is shown in grey. (B) Selected protein fragments are sent to the Convert Data panel and
named Protein Parts. Two rules are added to concatenate the fragments (Quoted Proteins) and remove the quotes (Pure Proteins). (C) Similar selection is con-
ducted for protein names. Here the yellow line selection tags are used. (D) The Output Data panel provides a template view to compose user output. (E) The
Output Data panel also provides an output view to show the actual output. (F) Finally, a working Perl program can be obtained by clicking the Compile button
in the Perl Program panel.
618BioTechniques Vol. 38, No. 4 (2005)
no reviews yet
Please Login to review.