Mi bioinformática para el IBIMA

Análisis masivo de expresión, SNP,
CNV y biomarcadores
M. Gonzalo Claros
Rocío Bautista, Pedro Seoane, Hicham Benzekri, Isabel González Gayte, Rosario
Carmona, Darío Guerrero-Fernández, Rafael Larrosa, Macarena Arroyo
Noé Fernández-Pozo, David Velasco

Micromatrices de dos colores
3
BioMed Central
Page 1 of 13
(page number not for citation purposes)
BMC Bioinformatics
Open AccessSoftware
PreP+07: improvements of a user friendly tool to preprocess and
analyse microarray data
Victoria Martin-Requena1, Antonio Muñoz-Merida1, M Gonzalo Claros2 and
Oswaldo Trelles*1
Address: 1Computer Architecture department, University of Málaga, Málaga, Spain and 2Molecular Biology and Biochemistry department,
University of Málaga, Málaga, Spain
Email: Victoria Martin-Requena - vickymr@ac.uma.es; Antonio Muñoz-Merida - amunoz@uma.es; M Gonzalo Claros - claros@uma.es;
Oswaldo Trelles* - ots@ac.uma.es
* Corresponding author
Abstract
Background: Nowadays, microarray gene expression analysis is a widely used technology that
scientists handle but whose final interpretation usually requires the participation of a specialist. The
need for this participation is due to the requirement of some background in statistics that most
users lack or have a very vague notion of. Moreover, programming skills could also be essential to
analyse these data. An interactive, easy to use application seems therefore necessary to help
researchers to extract full information from data and analyse them in a simple, powerful and
confident way.
Results: PreP+07 is a standalone Windows XP application that presents a friendly interface for
spot filtration, inter- and intra-slide normalization, duplicate resolution, dye-swapping, error
removal and statistical analyses. Additionally, it contains two unique implementation of the
procedures – double scan and Supervised Lowess-, a complete set of graphical representations –
MA plot, RG plot, QQ plot, PP plot, PN plot – and can deal with many data formats, such as
tabulated text, GenePix GPR and ArrayPRO. PreP+07 performance has been compared with the
equivalent functions in Bioconductor using a tomato chip with 13056 spots. The number of
differentially expressed genes considering p-values coming from the PreP+07 and Bioconductor
Limma packages were statistically identical when the data set was only normalized; however, a slight
variability was appreciated when the data was both normalized and scaled.
Conclusion: PreP+07 implementation provides a high degree of freedom in selecting and
organizing a small set of widely used data processing protocols, and can handle many data formats.
Its reliability has been proven so that a laboratory researcher can afford a statistical pre-processing
of his/her microarray results and obtain a list of differentially expressed genes using PreP+07
without any programming skills. All of this gives support to scientists that have been using previous
PreP releases since its first version in 2003.
Published: 12 January 2009
BMC Bioinformatics 2009, 10:16 doi:10.1186/1471-2105-10-16
Received: 29 August 2008
Accepted: 12 January 2009
This article is available from: http://www.biomedcentral.com/1471-2105/10/16
© 2009 Martin-Requena et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
BioMed Central
Page 1 of 13
(page number not for citation purposes)
BMC Bioinformatics
Open AccessSoftware
PreP+07: improvements of a user friendly tool to preprocess and
analyse microarray data
Victoria Martin-Requena1, Antonio Muñoz-Merida1, M Gonzalo Claros2 and
Oswaldo Trelles*1
Address: 1Computer Architecture department, University of Málaga, Málaga, Spain and 2Molecular Biology and Biochemistry department,
University of Málaga, Málaga, Spain
Email: Victoria Martin-Requena - vickymr@ac.uma.es; Antonio Muñoz-Merida - amunoz@uma.es; M Gonzalo Claros - claros@uma.es;
Oswaldo Trelles* - ots@ac.uma.es
* Corresponding author
Abstract
Background: Nowadays, microarray gene expression analysis is a widely used technology that
scientists handle but whose final interpretation usually requires the participation of a specialist. The
need for this participation is due to the requirement of some background in statistics that most
users lack or have a very vague notion of. Moreover, programming skills could also be essential to
analyse these data. An interactive, easy to use application seems therefore necessary to help
researchers to extract full information from data and analyse them in a simple, powerful and
confident way.
Results: PreP+07 is a standalone Windows XP application that presents a friendly interface for
spot filtration, inter- and intra-slide normalization, duplicate resolution, dye-swapping, error
removal and statistical analyses. Additionally, it contains two unique implementation of the
procedures – double scan and Supervised Lowess-, a complete set of graphical representations –
MA plot, RG plot, QQ plot, PP plot, PN plot – and can deal with many data formats, such as
tabulated text, GenePix GPR and ArrayPRO. PreP+07 performance has been compared with the
equivalent functions in Bioconductor using a tomato chip with 13056 spots. The number of
differentially expressed genes considering p-values coming from the PreP+07 and Bioconductor
Limma packages were statistically identical when the data set was only normalized; however, a slight
variability was appreciated when the data was both normalized and scaled.
Conclusion: PreP+07 implementation provides a high degree of freedom in selecting and
organizing a small set of widely used data processing protocols, and can handle many data formats.
Its reliability has been proven so that a laboratory researcher can afford a statistical pre-processing
of his/her microarray results and obtain a list of differentially expressed genes using PreP+07
without any programming skills. All of this gives support to scientists that have been using previous
PreP releases since its first version in 2003.
Published: 12 January 2009
BMC Bioinformatics 2009, 10:16 doi:10.1186/1471-2105-10-16
Received: 29 August 2008
Accepted: 12 January 2009
This article is available from: http://www.biomedcentral.com/1471-2105/10/16
© 2009 Martin-Requena et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
En conclusión MADE4-2C es capaz de detectar
errores en la intensidad de la señal, en el lavado, la
hibridación, el marcaje con el fluoróforo, las agujas
de impresión y la calidad de las sondas impresas.
Esto ayuda a evitar que los resultados se basen en
las variaciones técnicas en lugar de en las variacio-
nes biológicas. Además, ofrece toda la información
en un informe denso pero comprensible para el in-
vestigador, lo que permite una buena evaluación del
experimento sin tener unos conocimientos avanza-
dos sobre micromatrices.
9.2.3. Descarte de sondas fallidas
Una vez que se proporciona información al usua-
rio sobre la calidad de los datos originales que quie-
re analizar, MADE4-2C procede a la corrección del
ruido de fondo utilizando normexp ([184]) y genera
las gráficas MA que muestran cómo quedan los da-
tos tras corregir el fondo (figuras 2.10 y 2.11, apén-
dice B).
A continuación se muestran las sondas que se uti-
lizarán en el experimento y las que se descartarán.
Una sonda se descartará siempre cuando su punto
está vacío según la información del fichero GAL, o
cuando la sonda contiene una secuencia artefactual
o mal caracterizada (información que se incorporó
desde el fichero BadSpots.txt). Existen dos moti-
vos de rechazo que solo afectan a algunas sondas en
una micromatriz, pero no tiene por qué afectar a las
demás réplicas:
El punto correspondiente a la sonda no se im-
primió o es de baja calidad, lo que viene indica-
do por su peso específico a partir de los campos
flags y area.
La corrección del ruido de fondo con normexp
ha marcado la sonda como descartable.
La tolerancia a estos fallos es controlable median-
te un parámetro del fichero de configuración (véase
el apéndice D) que indica el número de réplicas fa-
llidas permitidas para cada sonda en el experimento
que se analiza. Lo recomendable es que se retire la
sonda en todas las micromatrices en cuanto falle
una de las réplicas por cualquiera de los motivos
anteriores, aunque teóricamente el análisis se pue-
de realizar con tal que una sonda tenga dos o más
réplicas valores de intensidad válidos. En el caso de
los experimentos analizados sobre la expresión gé-
sis (figura 2.12, apéndice B). Es de esperar que este
filtro no retire más del 15 % de las sondas [184] co-
mo se muestra en la figura 2.12 del apéndice B. En
cambio, es recomendable repetir el experimento si
se acaban descartando más del 15 % de las sondas,
como se muestra en la figura 9.4.
Figura 9.4: Ejemplo de figura generada por
MADE4-2C para indicar que se han descarta-
do demasiadas sondas impresas para el análisis
posterior.
9.2.4. Normalización
La normalización de los datos tiene en cuenta
las réplicas técnicas para confirmar que los valo-
res de expresión no introducen más variabilidad de
la que había antes de la normalización, y que nin-
guno de los marcajes con fluoróforos añade nin-
gún tipo de sesgo a los datos. Aunque son mu-
chos los métodos de normalización que se han pro-
puesto, todavía no hay un consenso claro de que
un método sea el mejor frente a las diferentes
condiciones experimentales posibles [45], y pues-
to que el método de normalización utilizado es
uno de los factores que más afectará posteriormen-
te a la detección de GED [187, 98, 45], y es po-
sible obtener mejores resultados combinando dos
de ellos [187], MADE4-2C lleva a cabo la norma-
lización de modo independiente con varios méto-
dos: Print-tip loess [207], Print-tip loess +
scale, Print-tip loess + quantile [28], con la
función normalizeBetweenArrays de limma, y por
último, VSN [62] y VSN + Print-tip loess [45].
9.3. IDENTIFICACIÓN DE UNA MUESTRA PROBLEMÁTIC
Figura 9.9: Correlación negativa de las réplicas
detectada en los experimentos de brotes y hojas de
pinsapo.
naturales de Sierra Bermeja (Málaga), que se hi-
bridaron con el Pinarray1 y con una micromatriz
con secuencias de pino obtenidas por hibridación
sustractiva por supresión, llamada SSH-Ma (apar-
tado 8.1). A continuación se presenta el diseño del
experimento y los datos obtenidos al hibridrar con
SSH-Ma por ser donde se observó este comporta-
miento originalmente. Las réplicas del experimento
se organizan del siguiente modo:
Individuo 1-Sur, hibridado en la micromatriz
10a marcando la muestra de madera madura
con Cy3 y la de madera juvenil con Cy5. La
micromatriz se dividió en dos réplicas técnicas
10a-A y 10a-Z.
Individuo 1-Norte, hibridado en la microma-
triz 22a marcando la muestra de madera madu-
ra con Cy3 y la de madera juvenil con Cy5. La
micromatriz se dividió en dos réplicas técnicas
22a-A y 22a-Z.
Individuo 2-Norte, hibridado en la micro-
matriz 23a, con intercambio de fluoróforos en
relación a las hibridaciones anteriores, marcan-
do la muestra de madera madura con Cy5 y la
de madera juvenil con Cy3. La micromatriz se
dividió en dos réplicas técnicas 23a-A y 23a-Z.
Individuo 3-Sur, hibridado en la micromatriz
24a, con intercambio de fluoróforos en relación
a las dos primeras micromatrices, marcando la
madera
vidió en
Distancia
Correlaci
Figura 9.
tancias y c
nes realizad
réplicas téc
en el texto
En el aná
tados no mo
tancias entre
plicas técnic
ra 9.10), lo q
bien hecho. P
se observó q
quedar emp
que llevaba
del resto de
(figura 9.10
tearnos si ca
comportami
la búsqueda
2C permite
tuaciones se
con la librer
patrones de
ral, aunque
mediciones d
ORIGINAL PAPER
Gene expression profiling in the stem of young maritime pine
trees: detection of ammonium stress-responsive genes in the apex
Javier Canales • Concepcioń A´ vila • Francisco R. Cantoń • David Pacheco-Villalobos •
Sara Dıáz-Moreno • David Ariza • Juan J. Molina-Rueda • Rafael M. Navarro-Cerrillo •
M. Gonzalo Claros • Francisco M. Cańovas
Received: 25 May 2011 / Revised: 30 August 2011 / Accepted: 12 September 2011
Ó Springer-Verlag 2011
Abstract The shoots of young conifer trees represent an
interesting model to study the development and growth of
conifers from meristematic cells in the shoot apex to dif-
ferentiated tissues at the shoot base. In this work, micro-
array analysis was used to monitor contrasting patterns of
gene expression between the apex and the base of maritime
pine shoots. A group of differentially expressed genes were
selected and validated by examining their relative expres-
sion levels in different sections along the stem, from the
top to the bottom. After validation of the microarray data,
additional gene expression analyses were also performed in
the shoots of young maritime pine trees exposed to dif-
ferent levels of ammonium nutrition. Our results show that
the apex of maritime pine trees is extremely sensitive to
conditions of ammonium excess or deficiency, as revealed
by the observed changes in the expression of stress-
responsive genes. This new knowledge may be used to
precocious detection of early symptoms of nitrogen
nutritional stresses, thereby increasing survival and growth
rates of young trees in managed forests.
Keywords Conifers Á Pine development Á Nitrogen Á
Ammonium nutrition Á Transcriptional regulation
Introduction
Forests are essential components of the ecosystems, and
they play a fundamental role in the regulation of terrestrial
carbon sinks. Coniferous forests dominate large ecosys-
tems in the Northern Hemisphere and include a broad
variety of woody plant species, some ranking as the largest,
tallest, and longest living organisms on Earth (Farjon
2010). Conifers are the most important group of gymno-
sperms and have evolved very efficient physiological
adaptation systems after the separation from angiosperms,
which occurred more than 300 million years ago. Conifer
trees are also of great economic importance, as they are
major sources for timber, oleoresin, and paper production.
Maritime pine (Pinus pinaster Aiton) stands are dis-
tributed in the southwestern area of the Mediterranean
region. P. pinaster dominates the forest scenario in France,
Spain and Portugal, where this is the most widely planted
species in about 4 million hectares. The maritime pine is
particularly tolerant to abiotic stresses showing relatively
high-levels of intra-specific variability (Aranda et al.
2010). The maritime pine is also the most advanced conifer
Communicated by K. Klimaszewska.
Electronic supplementary material The online version of this
article (doi:10.1007/s00468-011-0625-z) contains supplementary
material, which is available to authorized users.
J. Canales Á C. A´ vila Á F. R. Cantoń Á D. Pacheco-Villalobos Á
S. Dıáz-Moreno Á J. J. Molina-Rueda Á M. G. Claros Á
F. M. Cańovas (&)
Departamento de Biologıá Molecular y Bioquı´mica,
Facultad de Ciencias, Instituto Andaluz de Biotecnologıá,
Campus Universitario de Teatinos, Universidad de Ma´laga,
Trees
DOI 10.1007/s00468-011-0625-z
30 s at 72°C). The fluorescence signal was captured at the
end of each extension step and melting curve analysis was
performed from 60 to 95°C. The PCR products were ver-
ified by melting point analysis at the end of each experi-
ment, and, during protocol development, by gel
electrophoresis.
The baseline calculation and starting concentration (N0)
per sample of the amplification reactions were estimated
directly from raw fluorescence data using the LinReg 11.3
program (Ruijter et al. 2009). The relative expression
levels were obtained from the ratio between the N0 of the
target gene and the normalisation factor. We used the
geometric mean of three control genes (actin, 40S ribo-
somal protein and elongation factor 1 alpha) to calculate
the normalisation factor (Vandesompele et al. 2002). Ref-
erence genes were selected based on their stable expression
in the microarrays. Furthermore, these genes were stably
expressed in all conditions and tissue portions examined as
determined by statistical analysis using Normfinder
(Andersen et al. 2004).
Results and discussion
Differential gene expression between the apex
and the base of maritime pine shoots
The differential gene expression was analysed in maritime
pine stems using microarrays. Intact total RNA was
extracted from the apex and the basal part of the stems,
labelled with CyDye and hybridised to slides of PINAR-
RAY, a maritime pine microarray constructed in our lab-
oratory. Microarray data were lowess normalised to
account for intensity-dependent differences between
channels. After normalisation, the dye-swap replicates did
not show strong deviations from linearity, proving a low
dye bias. The comparisons between replicates showed a
high degree of reproducibility, with Pearson’s correlation
coefficients of approximately 0.98. Similar transcriptomic
analyses have been previously performed in Sitka spruce
(Friedmann et al. 2007). Microarray analyses were also
used for transcript profiling in differentiating xylem of
loblolly pine and white spruce (Yang et al. 2004; Pavy
et al. 2008).
Genes differentially expressed at the apical and the basal
parts of the maritime pine stem were identified by bioin-
formatic analysis of hybridisation signals in the microarray,
using a cut-off t test p value 0.05 and a fold change
genes encoding photosynthetic proteins, including those
located in the thylakoid membranes involved in the
photosystems I and II, light-harvesting complexes, as well
as soluble proteins of the plastid stroma such as the small
subunit of ribulose-1,5-bisphosphate carboxylase/oxygen-
ase (Rubisco SSU; EC 4.1.1.39), were particularly abun-
dant. This part of the stem contains the shoot apical
meristem which drives stem growth and develops new
needles requiring the biosynthesis of proteins for the pho-
tosynthetic machinery. Also abundant were transcripts for
lipid transfer proteins (LPT), metallothionein-like proteins
(MT) and stress proteins such as an antimicrobial peptide
(AMP), a putative dehydrin and a late embryogenesis
abundant protein. The expression of stress-related genes
has also been reported in the apical shoot meristem of Sitka
spruce where they may be involved in the protection of
meristematic cells against mechanical wounding or insect
attack (Ralph et al. 2006). Interestingly, a number of genes
involved in lignin biosynthesis and cell wall formation
were also upregulated in the apical part of the maritime
pine stem. These included a putative cinnamoyl-CoA
reductase (EC 1.2.1.44), a serine-hydroxymethyltransferase
(EC 2.1.2.1), xyloglucan endotransglycosylases (EC
2.4.1.207), an endo-1,4-b-mannosidase (EC 3.2.1.78), a
putative proline-rich arabinogalactan and a germin-like
Fig. 1 Graphical representation of the microarray data analysis.
Trees
ammonium excess. We have previously report
ammonium excess and deficiency trigger changes
transcriptome of maritime pine roots (Canales
2010). The differential expression patterns of a
of representative genes suggested the existe
potential links between ammonium-responsive ge
genes involved in amino acid metabolism, particu
asparagine biosynthesis and utilisation (Canales
2010). The results reported here indicate that th
bolic changes observed in roots are transmitted
stem apex. This fact implies the existence of a s
signal that may represent a part of the respo
maritime pine seedlings to nutritional stress by
nium. The nature of this systemic signal is p
unknown; however, we can speculate that altered
of organic nitrogen in the form of asparagine
involved. High-levels of this amino acid accumu
pine hypocotyls and a role of asparagine in nitro
allocation has been proposed (Canãs et al. 2006).
asparagine is a vehicle for nitrogen transport in
and it is well known that there is a stress-
asparagine accumulation in response to minera
ciencies, drought or pathogen attack (Lea et al.
Fig. 5 Genes differentially expressed in maritime pine stems in
response to ammonium excess (E) or deficiency (D) identified by
microarray analysis. Log expression ratio values from each treatment
were represented as heatmaps
12
RESEARCH ARTICLE Open Access
Reprogramming of gene expression during
compression wood formation in pine: Coordinated
modulation of S-adenosylmethionine, lignin and
lignan related genes
David P Villalobos1,2
, Sara M Díaz-Moreno1,3
, El-Sayed S Said1
, Rafael A Cañas1
, Daniel Osuna1,4
,
Sonia H E Van Kerckhoven1
, Rocío Bautista1
, Manuel Gonzalo Claros1
, Francisco M Cánovas1
and
Francisco R Cantón1*
Abstract
Background: Transcript profiling of differentiating secondary xylem has allowed us to draw a general picture of the
genes involved in wood formation. However, our knowledge is still limited about the regulatory mechanisms that
coordinate and modulate the different pathways providing substrates during xylogenesis. The development of
compression wood in conifers constitutes an exceptional model for these studies. Although differential expression
of a few genes in differentiating compression wood compared to normal or opposite wood has been reported, the
broad range of features that distinguish this reaction wood suggest that the expression of a larger set of genes
would be modified.
Villalobos et al. BMC Plant Biology 2012, 12:100
http://www.biomedcentral.com/1471-2229/12/100
using the Pine Gene Index database (Additional file 3).
Sequences that matched with the same entry in the data-
base were assumed to represent the same gene. There-
fore, the final numbers of unigenes were reduced to 331
for Cx and 165 for Ox. Most of these genes showed sig-
nificant similarities to sequences in databases (293 in Cx
and 145 in Ox), although some of them were similar to
sequences with unknown function (49 in Cx and 45 in
Ox). The number of unigenes with no significant simi-
larity was low in both cases (38 in Cx and 20 in Ox).
The genes with assigned function were grouped into
functional categories using the Arabidopsis thaliana Mun-
ich Information Centre for Protein Sequences (MIPS)
database, and suppression of redundancy in MIPS funcat
assignations by decision according to their most probable
role in xylem development (Additional file 3). In keeping
with the greater number of genes identified as up-
Figure 3 Volcano plots of microarray analyses to identify genes
differentially expressed during compression and opposite
wood formation. The common logarithm of the p-value was
represented as a function of the binary logarithm of the
background-corrected and normalized opposite:compression
fluorescence ratio (log2 Fold Change) for each spot. Vertical bars
delimit the spots showing up-regulation in developing compression
xylem by at least 1.5-fold compared to developing opposite xylem
(Up-regulated in Cx) or spots showing up-regulation in developing
opposite xylem by at least 1.5-fold compared to developing
compression xylem (Up-regulated in Ox). The horizontal line delimits
the spots showing significant up-regulation under the criteria of an
adjusted p-value ≤ 0.001. Therefore, the upper left and right sectors
delimited by the horizontal and vertical lines include the spots (in
red) containing probes for genes significantly up-regulated in
developing compression or opposite xylem respectively. The
number of spots corresponding with genes significantly up-
regulated in Cx or Ox are shown in the top side of the sector. (a)
Results from the analysis of microarray 1 constructed with cDNA
clones from the composite library. (b) Results from the analysis of
microarray 2 constructed with cDNA clones from subtractive
libraries.
Villalobos et al. BMC Plant Biology 2012, 12:100 Page 5 of 17

Otros tipos de micromatrices
4
Percentage of use in different testing methods of the different R package, background correction,
normalization and transformation functions available
Dataset1 Dataset2 Dataset1 Dataset2 Average
Control type 1(%) Control type 2(%) (%)
Package
beadarray 16.0 11.1 15.0 12.5 13.7
lumi 84.0 88.9 85.0 87.5 86.3
Normalization
loess (lumi) 11.1 18.5 12.5 17.9 15.0
median(beadarray) 3.7 0.0 2.5 0.0 1.6
qspline(beadarray) 2.5 1.9 2.5 1.8 2.2
quantile (lumi) 17.3 22.2 17.5 25.0 20.5
quantile (beadarray) 3.7 1.9 3.8 3.6 3.2
rankinvariant 9.9 0.0 10.0 0.0 5.0
rsn (lumi) 13.6 20.4 12.5 19.6 16.5
rsn(beadarray) 2.5 1.9 2.5 0.0 1.7
ssn(lumi) 13.6 0.0 13.8 0.0 6.8
vsn (lumi) 18.5 27.8 18.8 26.8 23.0
vsn(beadarray) 3.7 5.6 3.8 5.4 4.6
Transformation
log2(lumi) 29.6 29.6 30.0 28.6 29.5
log2(Beadarray) 6.2 1.9 6.3 1.8 4.0
vst(lumi) 27.2 25.9 27.5 25.0 26.4
vst(beadarray) 4.9 7.4 5.0 7.1 6.1
cubicroot 9.9 20.4 8.8 19.6 14.7
none 22.2 14.8 22.5 17.9 19.3
Background correction
bgAdjust (lumi) 22.2 24.1 22.5 23.2 23.3
bgAdjust.Affy(lumi) 14.8 14.8 15.0 14.3 14.7
forcePositive(lumi) 23.5 27.8 23.8 26.8 26.1
none (lumi) 23.5 22.2 23.8 23.2 23.1
none(beadarray) 16.0 11.1 15.0 12.5 13.7
BedArray (Illumina)
Agilent
Determinar el mejor protocolo
Preprocesamiento
Corrección)de)
ruido)de)fondo
Normalización)de)
los)datos
Media)de)los)puntos)
replicados
Expresión0diferencial
Comparaciones)
Estimación)
variabilidad)media)
por)eBayes
Filtro)por)P)y)
logFC
Target
Datos0crudos
Diseño0
experimental
Genes0expresados0
diferencialmente
COLABORACIÓN:
Fernando Cardona 
Juan A. G. Ranea

Micromatrices de Affymetrix
5
On Selecting the Best Pre-processing Method for
Affymetrix Genechips
J.P. Florido1
, H. Pomares1
, I. Rojas1
, J.C. Calvo1
, J.M. Urquiza1
,
and M. Gonzalo Claros2
1
Department of Computer Architecture and Computer Technology, University of Granada,
Granada, Spain
{jpflorido,hector}@ugr.es, {irojas,jccalvo,jurquiza}@atc.ugr.es
2
Department of Molecular Biology and Biochemistry, University of Málaga, Málaga, Spain
claros@uma.es
Abstract. Affymetrix High Oligonucleotide expression arrays, also known as
Affymetrix GeneChips, are widely used for the high-throughput assessment of
gene expression of thousands of genes simultaneously. Although disputed by
several authors, there are non-biological variations and systematic biases that
must be removed as much as possible before an absolute expression level for
every gene is assessed. Several pre-processing methods are available in the
literature and five common ones (RMA, GCRMA, MAS5, dChip and VSN) and
two customized Loess methods are benchmarked in terms of data variability,
similarity of data distributions and correlation coefficient among replicated
slides in a variety of real examples. Besides, it will be checked how the variant
and invariant genes can influence on preprocessing performance.
1 Introduction
Microarray technology is a powerful tool used for the high-throughput assessment of
gene expression of thousands of genes simultaneously which can be used to infer
metabolic pathways, to characterize protein-protein interactions or to extract target
genes for developing therapies for various diseases [1]. Several platforms are
currently available, including the commonly used high oligonucleotide-based
Affymetrix GeneChip® arrays.
As described in [1], an Affymetrix GeneChip contains probe sets of 10-20 probe
pairs representing unique genes. Each probe pair consists of two oligonucleotides of
25 bp in length, namely perfect match (PM) probes (the exact complement of an
mRNA) and the mismatch (MM) probes (which are identical to the perfect match
except that one base is changed at the center position). The MM probe is supposed to
distinguish noise caused by non-specific hybridization from the specific hybridization
signal, although some researchers recommend avoiding its use [17].
A typical microarray experiment has biological and technical sources of variation
[2]. Biological variation results from tissue heterogeneity, genetic polymorphism, and
changes in mRNA levels within cells and among individuals due to sex, age, race,
genotype-environment interactions and other “living” factors. Biological variation is
of interest to researchers as it reflects true variation among experiments. On the other
Joan Cabestany Francisco Sandoval
Alberto Prieto Juan M. Corchado (Eds.)
Bio-Inspired Systems:
Computational and
Ambient Intelligence
10th International Work-Conference
on Artificial Neural Networks, IWANN 2009
Salamanca, Spain, June 10-12, 2009
Proceedings, Part I
1 3
E↵ect of Pre-processing methods on Microarray-based SVM
classifiers in A↵ymetrix Genechips
J.P.Florido, H.Pomares, I.Rojas, J.M.Urquiza, L.J.Herrera, M.G.Claros
Abstract— A↵ymetrix High Oligonucleotide expression
arrays are widely used for the high-throughput assessment
of gene expression of thousands of genes simultaneously.
Although disputed by several authors, there are non-biological
variations and systematic biases that must be removed as
much as possible through the pre-processing step before an
absolute expression level for every gene is assessed. It is
important to evaluate microarray pre-processing procedures
not only to the detection of di↵erentially expressed genes,
but also to classification, since a major use of microarrays
is the expression-based phenotype classification. Thus, in
this paper, we use several cancer microarray datasets to
assess the influence of five di↵erent pre-processing methods
in Support Vector Machine-based classification methodologies
with di↵erent kernels: linear, Radial Basis Functions (RBFs)
and polynomial.
I. Introduction
Microarray technology is a powerful tool used for the high-
throughput assessment of gene expression of thousands of
genes simultaneously which can be used to infer metabolic
pathways, to characterize protein-protein interactions or to
extract target genes for developing therapies for various dis-
eases [1]. Several platforms are currently available, including
the commonly used high oligonucleotide-based A↵ymetrix
GeneChip R arrays. As described in [1], an A↵ymetrix
GeneChip contains probe sets of 10-20 probe pairs re-
presenting unique genes. Each probe pair consists of two
oligonucleotides of 25 bp in length, namely perfect match
(PM) probes (the exact complement of an mRNA) and the
mismatch (MM) probes (which are identical to the perfect
match except that one base is changed at the center position).
The MM probe is supposed to distinguish noise caused by
non-specific hybridization from the specific hybridization
signal, although some researchers recommend avoiding its
use [2]. A typical microarray experiment has biological
and technical sources of variation [3]. Biological variation
results from tissue heterogeneity, genetic polymorphism, and
changes in mRNA levels within cells and among individuals
quality of array data. Therefore, since those systematic non-
biological sources of variation mask real biological variation,
significant pre-processing is required and involves four steps
for A↵ymetrix GeneChips: background correction, normal-
ization, PM correction and summarization [4].
Assessment of the e↵ectiveness of pre-processing has
mainly been confined to the ability to detect di↵erentially ex-
pressed genes [5] [6] or in terms of data variability, similarity
in data distributions and correlation among replicates [7].
However, a major use of microarrays is phenotype classi-
fication via expression-based classifiers: given a collection
of gene expression profiles for tissue samples belonging to
various cancer types, the goal is to build a classifier to
automatically determine the cancer type of a new sample
at high precision. Classifying cancer tissues based on their
gene expression profiles has the promise of providing more
reliable means to diagnose and predict various types of
cancers [8], but the accuracy of these predictions may depend
on the pre-processing method selected.
Thus, in this work, several cancer microarray data sets
are used to assess the e↵ect of di↵erent pre-processing
methods (RMA, GCRMA, VSN, dChip and MAS5) in high-
order analytical tasks such as classification using Support
Vector Machines (SVMs) with three di↵erent kernels: Linear,
Radial Basis Functions (RBFs) and polynomial. SVMs are
usually preferred in microarray-based classification due to
its outperformance compared to other paradigms, namely, k-
Nearest Neighbors, backpropagation and probabilistic neural
networks, weighted voting methods and decision trees [9]
due to two special aspects of microarray data: high dimen-
sionality and small sample size. Kernel methods represent
one way to cope with the curse of dimensionality [8].
Previous related work about the e↵ect of pre-processing
methods relative to classification has been focused on
cDNA microarrays using k-Nearest Neighbor classi-
fiers [10], [11], [12], Support Vector Machines [11], [12]
presenting unique genes. Each probe pair consists of two
oligonucleotides of 25 bp in length, namely perfect match
(PM) probes (the exact complement of an mRNA) and the
mismatch (MM) probes (which are identical to the perfect
match except that one base is changed at the center position).
The MM probe is supposed to distinguish noise caused by
non-specific hybridization from the specific hybridization
signal, although some researchers recommend avoiding its
use [2]. A typical microarray experiment has biological
and technical sources of variation [3]. Biological variation
results from tissue heterogeneity, genetic polymorphism, and
changes in mRNA levels within cells and among individuals
due to sex, age, race, genotype-environment interactions and
other ”living” factors. Biological variation is of interest to
researchers as it reflects true variation among experiments.
On the other hand, sample preparation, labeling, hybridiza-
tion and other steps of microarray experiment can contribute
to technical variation, which can significantly impact the
J.P.Florido, H.Pomares, I.Rojas, J.M.Urquiza and L.J.Herrera are with
the Department of Computer Architecture and Computer Technol-
ogy, CITIC-UGR, University of Granada, Spain (corresponding author:
jpflorido@ugr.es)
M.G.Claros is with the Department of Molecular Biology and Bioche-
mistry, University of Malaga, Spain
Radial Basis Functions (RBFs) and polynomial. SVMs are
usually preferred in microarray-based classification due to
its outperformance compared to other paradigms, namely, k-
Nearest Neighbors, backpropagation and probabilistic neural
networks, weighted voting methods and decision trees [9]
due to two special aspects of microarray data: high dimen-
sionality and small sample size. Kernel methods represent
one way to cope with the curse of dimensionality [8].
Previous related work about the e↵ect of pre-processing
methods relative to classification has been focused on
cDNA microarrays using k-Nearest Neighbor classi-
fiers [10], [11], [12], Support Vector Machines [11], [12]
and linear discriminant analysis, regular histogram, Gaussian
kernel, perceptron and multiple perceptron with majority
voting [12]. Instead, our study is related to A↵ymetrix
Genechips microarray technology.
Section II describes the main pre-processing methods
existing in the literature for A↵ymetrix Genechips, section
III introduces SVMs classifiers and section IV states experi-
mental results. Conclusions are drawn in section V.
II. Pre-processing Affymetrix Genechips
Instead of describing how every pre-processing method
(RMA, GCRMA, VSN, dChip and MAS5) works, they will
978-1-4244-8126-2/10/$26.00 ©2010 IEEE
VSN performs statistically better (P < 0.05) than the others.
So, these results suggest that RMA, VSN and dChip methods
are the preferred ones, which is consistent with the results
given in [7] and in terms of classification rate (Fig.1).
Fig. 4. Means and 95% LSD intervals of the di↵erent pre-processing
methods through the mean of Spearman Coe cient quality metric
From Figs.2 and 4 and focusing on the RMA and GCRMA
pre-processing methods, it can be observed the influence of
the background correction step employed (Table I). In this
case, there are statistical di↵erences (P < 0.05) in terms of
data variability and Spearman correlation coe cient quality
metrics between RMA and GCRMA preprocessing methods.
These statistical di↵erences were also present in terms of
misclassification rate (Fig.1).
Although this work studies the e↵ect of pre-processing
methods in terms of classification rate, it would be also
interesting to study whether the number of genes selected
in the feature selection step and the kernel method used in
the SVM classifier a↵ect the results.
From Fig.5, it can be observed that the accuracy of SVM
is a↵ected by the number of genes selected by t-test. There
are no statistical di↵erences (P > 0.05) when the number of
genes selected varies from 10 to 400. On the other hand,
when very few genes (5) are selected or the number is
large (600-2000 and the whole chip) SVM’s performance
gets worse. In the first case, the data does not contain
enough discriminative information and, in the second case,
per
rad
(P
the
ker
dec
con
in w
the
Fig.
kern
I
the
Ge
MA
Ma
lite
di↵
plo
sin
our
VS
mis
per
PROCEEDINGS Open Access
Gene expression pattern in swine neutrophils
after lipopolysaccharide exposure: a time course
comparison
Gema Sanz-Santos1
, Ángeles Jiménez-Marín1
, Rocío Bautista2
, Noé Fernández2
, Gonzalo M Claros2
, Juan J Garrido1*
From International Symposium on Animal Genomics for Animal Health (AGAH 2010)
Paris, France. 31 May – 2 June 2010
Abstract
Background: Experimental exposure of swine neutrophils to bacterial lipopolysaccharide (LPS) represents a model
to study the innate immune response during bacterial infection. Neutrophils can effectively limit the infection by
secreting lipid mediators, antimicrobial molecules and a combination of reactive oxygen species (ROS) without new
synthesis of proteins. However, it is known that neutrophils can modify the gene expression after LPS exposure. We
performed microarray gene expression analysis in order to elucidate the less known transcriptional response of
neutrophils during infection.
Methods: Blood samples were collected from four healthy Iberian pigs and neutrophils were isolated and incubated
during 6, 9 and 18 hrs in presence or absence of lipopolysaccharide (LPS) from Salmonella enterica serovar Typhimurium.
RNA was isolated and hybridized to Affymetrix Porcine GeneChip®
. Microarray data were normalized using Robust
Microarray Analysis (RMA) and then, differential expression was obtained by an analysis of variance (ANOVA).
Results: ANOVA data analysis showed that the number of differentially expressed genes (DEG) after LPS treatment vary
with time. The highest transcriptional response occurred at 9 hr post LPS stimulation with 1494 DEG whereas at 6 and
18 hr showed 125 and 108 DEG, respectively. Three different gene expression tendencies were observed: genes in
cluster 1 showed a tendency toward up-regulation; cluster 2 genes showing a tendency for down-regulation at 9 hr;
and cluster 3 genes were up-regulated at 9 hr post LPS stimulation. Ingenuity Pathway Analysis revealed a delay of
neutrophil apoptosis at 9 hr. Many genes controlling biological functions were altered with time including those
controlling metabolism and cell organization, ubiquitination, adhesion, movement or inflammatory response.
Conclusions: LPS stimulation alters the transcriptional pattern in neutrophils and the present results show that the
robust transcriptional potential of neutrophils under infection conditions, indicating that active regulation of gene
Sanz-Santos et al. BMC Proceedings 2011, 5(Suppl 4):S11
http://www.biomedcentral.com/1753-6561/5/S4/S11
Finally, cluster 3 consists of 335 up-regulated genes.
Functions associated with these molecules are related
to cellular assembly and reorganization, cellular main-
tenance and gene expression. Canonical pathways are
related to protein ubiquitination signaling, PDGF sig-
naling and IL-3 signaling which is involved in cell sur-
vival by activation of JAK/STAT signaling and BCL2
[10]. Network 2 (Additional file 4) highlights NF-B
interactions and covers several canonical pathways
such as acute phase response signaling and interferon
signaling.
Inhibition of spontaneous apoptosis at 9 hrs
Turnover of aging neutrophils occurs in the absence of
activation through a process known as spontaneous
Figure 2 Differentially expressed genes grouped into three different clusters. Cluster 1 contains 8 genes with up-regulation tendency
through the time course. 747 genes belonging the cluster 2, with a down-regulation tendency at 9 hr. Opposite tendency can be observed in
the cluster 3, where 335 genes show an up-regulation at 9 hr and down-regulation at 18 hr.
UP DOWN
hours 61 64
hours 388 1106
8 hours 50 58
61
388
50
64
1106
58
0
200
400
600
800
1000
1200
1400
1600
6 hours 9 hours 18 hours
DOWN
UP
Figure 3 Differentially expressed genes in each time point. 125
and 108 genes were altered at 6 and 18 hr respectively, with a
similar number of up and down-regulated genes. Most significant
transcriptional changes were observed at 9 hr post LPS stimulation.
1106 genes were down-regulated and 388 were up-regulated.
Sanz-Santos et al. BMC Proceedings 2011, 5(Suppl 4):S11
http://www.biomedcentral.com/1753-6561/5/S4/S11
Page 4 of 6
RESEARCH Open Access
Pyroptosis and adaptive immunity mechanisms
are promptly engendered in mesenteric
lymph-nodes during pig infections with
Salmonella enterica serovar Typhimurium
Rodrigo Prado Martins1
, Carmen Aguilar1
, James E Graham2
, Ana Carvajal3
, Rocío Bautista4
, M Gonzalo Claros4
and Juan J Garrido1*
Abstract
In this study, we explored the transcriptional response and the morphological changes occurring in porcine
mesenteric lymph-nodes (MLN) along a time course of 1, 2 and 6 days post infection (dpi) with Salmonella
Typhimurium. Additionally, we analysed the expression of some Salmonella effectors in tissue to complete our view
VETERINARY RESEARCH
Martins et al. Veterinary Research 2013, 44:120
http://www.veterinaryresearch.org/content/44/1/120
node in the network diagram represented a gene and its
relationship with other molecules was represented by a
line (solid and dotted lines represent direct and indirect
association respectively). Nodes with a red background
were input genes detected in this study while grey
nodes were molecules inserted by IPA based upon the
Ingenuity Knowledge Base to produce a highly connected
network. The score estimated the probability that a
collection of genes equal to or greater than the number
in a network could be achieved by chance alone. Scores
of 3 or higher were considered to have a 99.9% confi-
dence of not being generated by random chance alone.
For statistical analysis of enriched functions/pathways, an
IPA Knowledge Base was used as a reference set and the
Fisher’s exact test was employed to estimate the signifi-
cance of association. P-values below 0.05 were considered
statistically significant. For graphical representation of
the canonical pathways, the ratio indicates the percentage
of genes taking part in a pathway that could be found in
an uploaded data set and –log(p-value) means the level
of confidence of association. The threshold line repre-
sented a p-value of 0.05.
Relative gene expression analysis by qPCR
Real-time quantitative PCR (qPCR) assays were per-
formed as previously described [11]. Fold change values
were calculated by the 2−ΔΔCq
method [17] using beta-
actin as the reference gene. Afterwards, data were stan-
dardized as proposed by Willems et al. [18] and analyzed
by Kruskal–Wallis and Mann–Whitney tests using the
software SPSS 15.0 for Windows (SPSS Inc, Chicago, IL,
USA). Fold changes of 1 denoted no change in gene
expression. Values lower and higher than 1 denoted
down and up-regulation respectively. To be represented
in Table 1, a fold change of down-regulated genes
was calculated as −1/2−ΔΔCq
. Primer pairs used for
amplifications can be found as supporting information
(see Additional file 1).
Western blot analysis
For protein extractions, MLN samples from all experi-
mental animals were separately homogenized on ice with
lysis buffer (7 M urea, 2 M thiourea, 4% w/v CHAPS,
0.5 mM PMSF) using a glass tissue-lyser and protein
lysate concentration was determined using a Bradford
Protein Assay (Bio-Rad). Subsequently, protein from in-
dividual replicates belonging to the same group was
pooled (30 ug total), electrophoretically fractionated in
12% (w/v) SDS-PAGE gels and transferred onto a PVDF
membrane (Millipore, Bedford, MA, USA). Western blot
assays were carried out as described by Martins et al.
[10] employing the following primary antibodies: 4B7/8
for swine histocompatibility class I antigen (SLAI) detec-
tion [19], 1 F12 for swine histocompatibility class II
antigen (SLAII) detection [19], anti-CTLA4 (Epitomics,
Burlingame, CA, USA) and anti-Clathrin light chain
(ab24579, Abcam, Cambridge, UK). To confirm equal
sample loading, membranes were reblotted with anti-
GAPDH monoclonal antibody (GenScript, Picastaway,
NJ, USA) and no statistical differences for GAPDH
abundance were observed between groups in all assays.
Membranes were scanned in an FLA-5100 imager
Table 1 Microarray data validation by qPCR.
Gene MICROARRAY qPCR
Fold change BF Fold change p-value
1 dpi 2 dpi 6 dpi 1 dpi 2 dpi 6 dpi
CD180 1.7 2.6 1.5 0.0000429 1.1 1.8 1.2 0.010
CD1A 1.1 −1.4 1.2 0.00047793 −1.4 −2.5 1.2 0.013
DAB2 −1.2 −2.6 −1.2 6.62E-13 −3.1 −6.5 −2.6 0.001
EIF4H −1.1 −1.1 −1.1 0.0000101 −1.5 −1.4 −1.8 0.021
ENPP6 1.3 2.0 −1.2 0.0000448 1.2 1.8 −1.7 0.000
F13A1 1.4 2.2 −1.1 0.00000227 1 1.7 −2.2 0.012
HLA-Bb
1.0 −1.1 −1.2 0.00023747 −1.4 −1.4 −1.9 0.047
HLA-DRB5b
1.0 −1.1 1.0 0.0000311 −1.4 −1.6 −2 0.036
HSPA1Ba
3.3 1.4 −1.1 0.0001166 2.5 1.4 −1.3 0.025
HSPH1 2.3 1.7 −1.0 0.00000424 1.5 1.1 −2 0.003
IL16 −1.0 −1.2 −1.1 8.12E-07 1 −1.1 −1.5 0.035
LPCAT2 1.2 2.3 1.0 0.0000146 1.4 2 −1.3 0.010
PSMC2 −1.0 −1.0 −1.1 0.00105861 −1.1 −1.4 −1.8 0.036
TRAC −1.0 −1.1 −1.1 0.00000951 −1.5 −1.8 −1.8 0.010
a
Data from microarray analysis are mean values from two different probes. b
Amplified with SLA-B and SLA-DRB5 primers.
Martins et al. Veterinary Research 2013, 44:120 Page 3 of 14
http://www.veterinaryresearch.org/content/44/1/120

A miRNA Signature Predictive of Early Recurrence
Microarray de miRNA de Affymetrix
6
A microRNA Signature Associated with Early Recurrence
in Breast Cancer
Luis G. Pe´rez-Rivas1.
, Jose´ M. Jerez2.
, Rosario Carmona3
, Vanessa de Luque1
, Luis Vicioso4
,
M. Gonzalo Claros3,5
, Enrique Viguera6
, Bella Pajares1
, Alfonso Sańchez1
, Nuria Ribelles1
,
Emilio Alba1
, Jose´ Lozano1,5
*
1 Laboratorio de Oncologıá Molecular, Servicio de Oncologıá Me´dica, Instituto de Biomedicina de Ma´laga (IBIMA), Hospital Universitario Virgen de la Victoria, Ma´laga,
Spain, 2 Departamento de Lenguajes y Ciencias de la Computacioń, Universidad de Ma´laga, Ma´laga, Spain, 3 Plataforma Andaluza de Bioinforma´tica, Universidad de
Ma´laga, Ma´laga, Spain, 4 Servicio de Anatomıá Patolo´gica, Instituto de Biomedicina de Ma´laga (IBIMA), Hospital Universitario Virgen de la Victoria, Ma´laga, Spain,
5 Departmento de Biologıá Molecular y Bioquı´mica, Universidad de Ma´laga, Ma´laga, Spain, 6 Departmento of Biologıá Celular, Gene´tica y Fisiologıá Animal, Universidad de
Ma´laga, Ma´laga, Spain
Abstract
Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse pattern
after surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,
respectively. Although several clinical and pathological features have been used to discriminate between low- and high-risk
patients, the identification of molecular biomarkers with prognostic value remains an unmet need in the current
management of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in
71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developed
early (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregated
tumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarray
data analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentially
expressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs were
down-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-risk
group of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-
relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by public
databases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result in
an overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-related
microRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breast
surgery.
Citation: Pe´rez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoS
ONE 9(3): e91884. doi:10.1371/journal.pone.0091884
Editor: Sonia Rocha, University of Dundee, United Kingdom
Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014
Copyright: ß 2014 Pe´rez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio de
Economıá, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıá (TIN-4026, to JJ). The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: jlozano@uma.es
. These authors contributed equally to this work.
Introduction
Breast cancer comprises a group of heterogeneous diseases that
can be classified based on both clinical and molecular features [1–
5]. Improvements in the early detection of primary tumors and the
development of novel targeted therapies, together with the
systematic use of adjuvant chemotherapy, has drastically reduced
mortality rates and increased disease-free survival (DFS) in breast
cancer. Still, about one third of patients undergoing breast tumor
excision will develop metastases, the major life-threatening event
which is strongly associated with poor outcome [6,7].
The risk of relapse after tumor resection is not constant over
time. A detailed examination of large series of long-term follow-up
years, respectively, followed by a nearly flat plateau in which the
risk of relapse tends to zero [8–10]. A causal link between tumor
surgery and the bimodal pattern of recurrence has been proposed
by some investigators (i.e. an iatrogenic effect) [11]. According to
that model, surgical removal of the primary breast tumor would
accelerate the growth of dormant metastatic foci by altering the
balance between circulating pro- and anti-angiogenic factors
[9,11–14]. Such hypothesis is supported by the fact that the two
peaks of relapse are observed regardless other factors than surgery,
such as the axillary nodal status, the type of surgery or the
administration of adjuvant therapy. Although estrogen receptor
(ER)-negative tumors are commonly associated with a higher risk
In order to select the statistically significant and differentially
expressed miRNAs from Fig. 1, paired and multiple comparisons
among the prognosis groups A, B and C were performed. Two
different approaches, limma and RankProd Bioconductor, were
employed. Only those candidates with a fold change (FC).2
(either up- or down-regulated) and an adjusted p-value,0.05 were
selected (Table 2). Thus, comparison of the logFC and p-values
obtained with both limma and RankProd libraries led to the
identification of miR-149, miR-20b, miR-30a-3p, miR-342-5p,
downregulation in basal-like tumors. They also showed an inverse
relationship between the mitotic index and both miR-30a-3p and
miR-342-5p [76].
Differential expression of all six miRNAs were also determined
by RT-qPCR in the three prognosis groups (Table 2). With the
exception of miR-625, which could not be validated, miR-149,
miR-20b, miR10a, miR-30a-3p and miR-342-5p (the ‘‘5-miRNA
signature’’, from now on) were all confirmed to be down-regulated
in tumors from relapsing patients (groups B or C) when compared
Table 2. Most significant deregulated miRNAs in breast tumors from relapsing patients.
limma F* RankProd** RT-qPCR***
Comparison#
miRNA logFC adj-pval logFC adj-pval logFC SE
B/A hsa-miR-149 21.410 0.0016 21.615 ,0.00001 22.646 0.724
hsa-miR-20b 21.048 0.0071 21.237 ,0.00001 21.542 0.521
hsa-miR-30a-3p 21.359 0.0078 21.521 ,0.00001 21.001 0.514
hsa-miR-625 21.149 0.0014 21.377 ,0.00001 20.347 0.282
hsa-miR-10a 21.235 0.0168 21.547 ,0.00001 21.108 0.404
BC/A hsa-miR-149 21.120 0.0117 21.329 ,0.00001 22.555 0.681
hsa-miR-20b 21.016 0.0076 21.155 ,0.00001 21.470 0.536
hsa-miR-30a-3p 21.124 0.0256 21.326 ,0.00001 20.994 0.458
hsa-miR-625 21.003 0.0049 21.223 ,0.00001 20.266 0.237
B/AC hsa-miR-149 21.294 0.0052 21.446 ,0.00001 22.340 0.698
hsa-miR-10a 21.397 0.0093 21.647 ,0.00001 21.241 0.404
hsa-miR-342-5p 21.123 0.0159 21.254 ,0.00001 21.194 0.627
#
Group A = no recurrence, Group B = early recurrence (#24 months after surgery), Group C = late recurrence (50–60 months after surgery).
*limma F, analysis of filtered data (sd.70%) using limma.
**RankProd, analysis of unfiltered data using RankProduct algorithm.
***RT-qPCR, Relative miRNA expression was calculated using the DDCt method. The standard error (SE) was calculated based on the theory of error propagation [107].
doi:10.1371/journal.pone.0091884.t002
PLOS ONE | www.plosone.org 6 March 2014 | Volume 9 | Issue 3 | e91884
B
B
A
B
B
A
B
B
B
B
C
A
A
C
A
B
B
A
A
B
A
B
B
B
B
A
A
B
B
C
A
A
A
B
A
A
A
A
C
A
A
A
A
A
A
A
C
C
A
A
C
A
A
A
A
A
B
A
A
C
B
A
C
B
A
B
B
A
C
B
C
C
B
B
B
hsa−miR−10a_st
hsa−miR−149_st
hsa−miR−20b_st
hsa−miR−30a−star_st
hsa−miR−342−5p_st
Pérez-Rivas et al., Figure 2
-3
-2
-1
0
miR-10a
log2FoldChange
-3
-2
-1
0
miR-149
log2FoldChange
-3
-2
-1
0
miR-20b
log2FoldChange
-3
-2
-1
0
miR-30a-3p
log2FoldChange
-3
-2
-1
0
miR-342-5p
log2FoldChange
B vs A
BC vs A
B vs AC
A
B
COLABORACIÓN:
Emilio Alba 
José M. Jerez

RNA-seq
7
SOFTWARE Open Access
SeqTrim: a high-throughput pipeline for
pre-processing any type of sequence read
Juan Falgueras1
, Antonio J Lara2
, Noé Fernández-Pozo3
, Francisco R Cantón3
, Guillermo Pérez-Trabado2,4
,
M Gonzalo Claros2,3*
Abstract
Background: High-throughput automated sequencing has enabled an exponential growth rate of sequencing
data. This requires increasing sequence quality and reliability in order to avoid database contamination with
artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-
processing algorithms.
Results: SeqTrim has been implemented both as a Web and as a standalone command line application. Already-
published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality,
vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of
several input and output formats allows its inclusion in sequence processing workflows. Due to its specific
algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It
performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing
reads and does not lead to over-trimming.
Conclusions: SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including
next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know
what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual
sequence if desired. The recommended pipeline reveals more information about each sequence than previously
described pre-processors and can discard more sequencing or experimental artefacts.
Background
Sequencing projects and Expressed Sequence Tags
(ESTs) are essential for gene discovery, mapping, func-
tional genomics and for future efforts in genome anno-
tations, which include identification of novel genes, gene
location, polymorphisms and even intron-exon bound-
aries. The availability of high-throughput automated
sequencing has enabled an exponential growth rate of
sequence data, although not always with the desired
quality. This exponential growth is enhanced by the so
called “next-generation sequencing”, and efforts have to
be made in order to increase the quality and reliability
of sequences incorporated into databases: up to 0.4% of
sequences in nucleotide databases contain contaminant
sequences [1,2]. The situation is even worse in the EST
databases, where vector contamination rate reach 1.63%
of sequences [3]. Hence, improved and user friendly
bioinformatic tools are required to produce more reli-
able high-throughput pre-processing methods.
Pre-processing includes filtering of low-quality
sequences, identification of specific features (such as
poly-A or poly-T tails, terminal transferase tails, and
adaptors), removal of contaminant sequences (from vec-
tor to any other artefacts) and trimming the undesired
segments. There are some bioinformatic tools that can
accomplish individual pre-processing aspects (e.g. Trim-
Seq, TrimEST, VectorStrip, VecScreen, ESTPrep [4],
crossmatch, Figaro [5]), and other programs that cope
with the complete pre-processing pipeline such as
PreGap4 [6] or the broadly used tools Lucy [7,8] and
SeqClean [9]. Most of these require installation, are dif-
ficult to configure, environment-specific, or focused on
specific needs (like a design only for ESTs), or require a
change in implementation and design of either the pro-
gram or the protocols within the laboratory itself.
* Correspondence: claros@uma.es
2
Plataforma Andaluza de Bioinformática, Universidad de Málaga, 29071
Málaga, Spain
Falgueras et al. BMC Bioinformatics 2010, 11:38
© 2010 Falgueras et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
DEgenes Hunter - A Self-customised Gene
Expression Analysis Workflow for Non-model
Organisms
Isabel González Gayte1
, Roc´ıo Bautista Moreno2
, and M. Gonzalo Claros1,2
1
Departamento de Biolog´ıa Molecular y Bioqu´ımica, Universidad de Málaga,
29071 Málaga, Spain
2
Plataforma Andaluza de Bioinformática, Centro de Supercomputación y
Bioinnovación, Universidad de Málaga,
29071 Málaga, Spain
Abstract. Data from high-throughput RNA sequencing require the de-
velopment of more sophisticate bioinformatics tools to perform optimal
gene expression analysis. Several R libraries are well considered for differ-
ential expression analyses but according to recent comparative studies,
there is still an overall disagreement about which one is the most appro-
priate for each experiment. The applicable R libraries mainly depend on
the presence or not of a reference genome and the number of replicates
gene expression analysis. Several R libraries are well considered for differ-
ential expression analyses but according to recent comparative studies,
there is still an overall disagreement about which one is the most appro-
priate for each experiment. The applicable R libraries mainly depend on
the presence or not of a reference genome and the number of replicates
per condition. Here it is presented DEgenes Hunter, a RNA-seq analysis
workflow for the detection of differentially expressed genes (DEGs) in
organisms without genomic reference. The first advantage of DEgenes
Hunter over other available solutions is that it is able to decide the most
suitable algorithms to be employed according to the number of biological
replicates provided in the sample. The different workflow branches allow
its automatic self-customisation depending on the input data, when used
by users without advanced statistical and programming skills. All appli-
cable libraries served to obtain their respective DEGs and, as another
advantage, genes marked as DEGs by all R packages employed are consid-
ered ‘common DEGs’, showing the lowest false discovery rate compared
to the ‘complete DEGs’ group. A third advantage of DEgenes Hunter is
that it comes with an integrated quality control module to discard or
disregard low quality data before and after preprocessing. The ‘common
DEGs’ are finally submitted to a functional gene set enrichment analysis
(GSEA) and clustering. All results are provided as a PDF report.
Keywords: RNA-seq, R, pipeline, workflow, differential expression,
bioinformatic tool, functional analysis.
1 Introduction
Nowadays, high-throughput technologies are well considered for genetic stud-
ies. For the analysis of gene expression profiles, data are obtained from RNA
sequencing (RNA-seq) experiments. RNA-seq provides precise measurements of
F. Ortuño and I. Rojas (Eds.): IWBBIO 2015, Part II, LNCS 9044, pp. 313–321, 2015.
c⃝ Springer International Publishing Switzerland 2015
http://www.scbi.uma.es/seqtrimnext
MiSeq @ CIMES
Estamos trabajando para aplicarlo en organismos
modelo: vid, lenguado y humanos

Siempre confirmamos con varios algoritmos
8
DEgenes Hunter - A Self-customised Gene Expression Analysis Workflow 315
Input (Count Data)
Data Filtering
Replicates 1 ?
Replicates 3 ?
DESeq2
edgeR
limma
NOISeq
DESeq2
DESeq2
edgeR
FUNCTIONAL ANALYSiS
topGO
Headmap and Clustering
Output
(Pdf Report)
YES
YES
NO
NO
Fig. 1. DEgenes Hunter main workflow
2 Methods
GO:0003674
molecular_function
1.0000
225 / 41433
GO:0003824
catalytic activity
0.0012
128 / 19303
GO:0004347
glucose−6−phosphate ...
2.02e−11
7 / 22
GO:0004497
monooxygenase activi...
9.77e−11
15 / 294
GO:0005488
binding
0.9677
127 / 25778
GO:0008289
lipid binding
8.45e−16
29 / 797
GO:0016491
oxidoreductase activ...
3.08e−19
50 / 2066
GO:0016853
isomerase activity
3.28e−05
11 / 440
GO:0016860
intramolecular oxido...
1.68e−08
8 / 82
GO:0016861
intramolecular oxido...
4.79e−10
8 / 53
GO:0046906
tetrapyrrole binding
6.07e−11
16 / 335
GO:0097159
organic cyclic compo...
0.9982
57 / 14111
GO:1901363
heterocyclic compoun...
0.9981
57 / 14093
1 2 3 4 5 6
−1.5−1.0−0.50.00.51.01.5
sample
Samples
1.5
1.0
0.5
0.0
–0.5
–1.0
–1.5
Zscoreexpression
C1 C2 C3 T1 T2 T3
A
B
C
Samples
C1 C2 C3 T1 T2 T3
Fig. 2. Example analyses that can be performed with DEgenes Hunter on the ‘common
DEGs’ group. A: A GSEA analysis performed with topGO, where rectangle colour
represents the relative significance, ranging from dark red (most significant) to bright
yellow (least significant). B: A typical heatmap that can also be used as a quality
control to verify that control samples (C1, C2 and C3) and treatment samples (T1, T2
and T3) are grouped together. C: Expression clustering performed using cluster where
the genes have similar expression levels among control samples, and a clearly higher
value in treatment samples.
3.2 Performance Testing
Utility of ‘common DEGs’ group was confirmed comparing their FDR values.
Figure 3 shows that the FDR for ‘common DEGs’ is considerably lower than
for ‘complete DEGs’ and ‘non-common DEGs’ using separately any R package.
Since there is no clear way to set the threshold for qNOISeq [15], it is very high
in all cases.
100/0 50/50 0/100
Fig. 4. Venn diagrams showing the numbers of DEGs found in synthetic data whe
different DEG ratios are used. 100/0 corresponds to all over-expressed/none repressed
50/50 is the balanced ratio, and 100/0 corresponds to none over-expressed/all re
pressed.

of a Pinus pinaster gene, one from photosynthetic tissue
and one from non-photosynthetic tissue (Table 1) were
analysed. Sequences were aligned with MultAlin using
identified a divergent region, and that the primers were
correctly designed and worked as predicted by the
software.
Figure 6 Use of AlignMiner for designing several specific primer pairs for PCR amplification of the different isoforms of the AtGS1
nucleotide sequence (A) The 5’ and 3’ divergent regions obtained with Entropy that were selected for primer design including the
characteristic parameters of each region. (B) Results of the in silico “PCR amplification” with BioPHP [34] using the different primer pairs. Note that
the actual 3’ primers are complementary to the sequences shown on the right.
Guerrero et al. Algorithms for Molecular Biology 2010, 5:24
http://www.almob.org/content/5/1/24
Page 12 of 16
¿Qué región es más variable en un alineamiento?
9
SOFTWARE ARTICLE Open Access
AlignMiner: a Web-based tool for detection of
divergent regions in multiple sequence
alignments of conserved sequences
Darío Guerrero1
, Rocío Bautista1
, David P Villalobos2
, M Gonzalo Claros1,2*
Abstract
Background: Multiple sequence alignments are used to study gene or protein function, phylogenetic relations,
genome evolution hypotheses and even gene polymorphisms. Virtually without exception, all available tools focus
on conserved segments or residues. Small divergent regions, however, are biologically important for specific
quantitative polymerase chain reaction, genotyping, molecular markers and preparation of specific antibodies, and
yet have received little attention. As a consequence, they must be selected empirically by the researcher.
AlignMiner has been developed to fill this gap in bioinformatic analyses.
Results: AlignMiner is a Web-based application for detection of conserved and divergent regions in alignments of
conserved sequences, focusing particularly on divergence. It accepts alignments (protein or nucleic acid) obtained
using any of a variety of algorithms, which does not appear to have a significant impact on the final results.
AlignMiner uses different scoring methods for assessing conserved/divergent regions, Entropy being the method
that provides the highest number of regions with the greatest length, and Weighted being the most restrictive.
Conserved/divergent regions can be generated either with respect to the consensus sequence or to one master
sequence. The resulting data are presented in a graphical interface developed in AJAX, which provides remarkable
user interaction capabilities. Users do not need to wait until execution is complete and can.even inspect their
results on a different computer. Data can be downloaded onto a user disk, in standard formats. In silico and
experimental proof-of-concept cases have shown that AlignMiner can be successfully used to designing specific
polymerase chain reaction primers as well as potential epitopes for antibodies. Primer design is assisted by a
module that deploys several oligonucleotide parameters for designing primers “on the fly”.
Conclusions: AlignMiner can be used to reliably detect divergent regions via several scoring methods that provide
different levels of selectivity. Its predictions have been verified by experimental means. Hence, it is expected that its
usage will save researchers’ time and ensure an objective selection of the best-possible divergent region when
closely related sequences are analysed. AlignMiner is freely available at http://www.scbi.uma.es/alignminer.
Background
Since the early days of bioinformatics, the elucidation of
similarities between sequences has been an attainable
goal to bioinformaticians and other scientists. In fact,
multiple sequence alignments (MSAs) stand at a cross-
road between computation and biology and, as a result,
long-standing programs for DNA or protein MSAs are
nowadays widely used, offering high quality MSAs. In
recent years, by means of similarities between sequences
and due to the rapid accumulation of gene and genome
sequences, it has been possible to predict the function
and role of a number of genes, discern protein structure
and function [1], perform new phylogenetic tree recon-
struction, conduct genome evolution studies [2], and
design primers. Several scores for quantification of resi-
due conservation and even detection of non-strictly-con-
served residues have been developed that depend on the
composition of the surrounding residue sequence [3],
and new sequence aligners are able to integrate highly
heterogeneous information and a very large number of
sequences. Without exception, the sequence similarity of
1
Plataforma Andaluza de Bioinformática (Universidad de Málaga), Severo
Ochoa, 34, 29590 Málaga, Spain
© 2010 Guerrero et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Table 2 Details of primers designed with AlignMiner to identify specifically by PCR the five A. thaliana GS1 genes as
well as the two primer pairs that identify the photosynthetic and non-photosynthetic isoforms of P. pinaster; note
that the 3’ (reverse) primer is complementary to the sequence appearing in Figures 6 and 8.
Isoform Primer Length %GC Tm (°C) Amplicon size (bp)
GS1.1 5’-GGTCTTTAGCAACCCTGA-3’ 18 50 54.6 740
5’-ATCATCAAGGATTCCAGA-3’ 18 39 48.7
GS1.2 5’-GATCTTTGCTAACCCTGA-3’ 18 44 51.3 739
5’-CTTTCAAGGGTTCCAGAG-3’ 18 50 53.6
GS1.3 5’-AATCTTCGATCATCCCAA-3’ 18 39 50 739
5’-AAAGTCTAAAGCTTAGAG-3’ 18 33 46
GS1.4 5’-GATCTTCAGCCACCCCGA-3’ 18 61 59.4 739
5’-AATGTGTCATCAACCGAG-3’ 18 44 51.5
GS1.5 5’-GATCTTTGAAGACCCTAG-3’ 18 44 48.8 740
5’-TCTTTCATGGTTTCCAAA-3’ 18 33 50.1
Photosyntetic isoform 5’-AGTGCGCATTAAGGACCCATCA-3’ 22 50 61 177
5’-ACACACTGGCTTCCACAATAGG-3’ 22 50 59.4
Non-photosynthetic isoform 5’-ACAGATGATCTAGGACATGC-3’ 20 45 52 169
5’-CACTTATTTGCACTTGAAGG-3’ 20 40 52.6
Figure 7 Correlation between the most divergent amino acid sequences and antigenicity of the AtGS1 protein MSA. (A) Similarity plot
obtained using the Entropy method; the most divergent regions being are highlighted. (B) Aligned sequences for the two divergent regions
together (underlined in black) and their score in relation to other divergent regions. (C) Localisation of each divergent region in the alignment
where: (i) nucleotides in bold are the predicted epitopes for B-cells; (ii) an “e” denotes predicted solvent accessibility for this position; and (iii)
red-boxed amino acids correspond to the sequence of the matching divergent region. It is clearly seen that divergent sequences overlap with
the predicted epitopes and the solvent-accessible amino acids.
Page 13 of 16
Cebadores
capaces de
distinguir alelos
Epítopos
especíﬁcos
http://www.scbi.uma.es/alignminer

of a Pinus pinaster gene, one from photosynthetic tissue
and one from non-photosynthetic tissue (Table 1) were
analysed. Sequences were aligned with MultAlin using
identified a divergent region, and that the primers were
correctly designed and worked as predicted by the
software.
Figure 6 Use of AlignMiner for designing several specific primer pairs for PCR amplification of the different isoforms of the AtGS1
nucleotide sequence (A) The 5’ and 3’ divergent regions obtained with Entropy that were selected for primer design including the
characteristic parameters of each region. (B) Results of the in silico “PCR amplification” with BioPHP [34] using the different primer pairs. Note that
the actual 3’ primers are complementary to the sequences shown on the right.
Page 12 of 16
¿Qué región es más variable en un alineamiento?
9
SOFTWARE ARTICLE Open Access
AlignMiner: a Web-based tool for detection of
divergent regions in multiple sequence
alignments of conserved sequences
Darío Guerrero1
, Rocío Bautista1
, David P Villalobos2
, M Gonzalo Claros1,2*
Abstract
Background: Multiple sequence alignments are used to study gene or protein function, phylogenetic relations,
genome evolution hypotheses and even gene polymorphisms. Virtually without exception, all available tools focus
on conserved segments or residues. Small divergent regions, however, are biologically important for specific
quantitative polymerase chain reaction, genotyping, molecular markers and preparation of specific antibodies, and
yet have received little attention. As a consequence, they must be selected empirically by the researcher.
AlignMiner has been developed to fill this gap in bioinformatic analyses.
Results: AlignMiner is a Web-based application for detection of conserved and divergent regions in alignments of
conserved sequences, focusing particularly on divergence. It accepts alignments (protein or nucleic acid) obtained
using any of a variety of algorithms, which does not appear to have a significant impact on the final results.
AlignMiner uses different scoring methods for assessing conserved/divergent regions, Entropy being the method
that provides the highest number of regions with the greatest length, and Weighted being the most restrictive.
Conserved/divergent regions can be generated either with respect to the consensus sequence or to one master
sequence. The resulting data are presented in a graphical interface developed in AJAX, which provides remarkable
user interaction capabilities. Users do not need to wait until execution is complete and can.even inspect their
results on a different computer. Data can be downloaded onto a user disk, in standard formats. In silico and
experimental proof-of-concept cases have shown that AlignMiner can be successfully used to designing specific
polymerase chain reaction primers as well as potential epitopes for antibodies. Primer design is assisted by a
module that deploys several oligonucleotide parameters for designing primers “on the fly”.
Conclusions: AlignMiner can be used to reliably detect divergent regions via several scoring methods that provide
different levels of selectivity. Its predictions have been verified by experimental means. Hence, it is expected that its
usage will save researchers’ time and ensure an objective selection of the best-possible divergent region when
closely related sequences are analysed. AlignMiner is freely available at http://www.scbi.uma.es/alignminer.
Background
Since the early days of bioinformatics, the elucidation of
similarities between sequences has been an attainable
goal to bioinformaticians and other scientists. In fact,
multiple sequence alignments (MSAs) stand at a cross-
road between computation and biology and, as a result,
long-standing programs for DNA or protein MSAs are
nowadays widely used, offering high quality MSAs. In
recent years, by means of similarities between sequences
and due to the rapid accumulation of gene and genome
sequences, it has been possible to predict the function
and role of a number of genes, discern protein structure
and function [1], perform new phylogenetic tree recon-
struction, conduct genome evolution studies [2], and
design primers. Several scores for quantification of resi-
due conservation and even detection of non-strictly-con-
served residues have been developed that depend on the
composition of the surrounding residue sequence [3],
and new sequence aligners are able to integrate highly
heterogeneous information and a very large number of
sequences. Without exception, the sequence similarity of
1
Plataforma Andaluza de Bioinformática (Universidad de Málaga), Severo
Ochoa, 34, 29590 Málaga, Spain
© 2010 Guerrero et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
Table 2 Details of primers designed with AlignMiner to identify specifically by PCR the five A. thaliana GS1 genes as
well as the two primer pairs that identify the photosynthetic and non-photosynthetic isoforms of P. pinaster; note
that the 3’ (reverse) primer is complementary to the sequence appearing in Figures 6 and 8.
Isoform Primer Length %GC Tm (°C) Amplicon size (bp)
GS1.1 5’-GGTCTTTAGCAACCCTGA-3’ 18 50 54.6 740
5’-ATCATCAAGGATTCCAGA-3’ 18 39 48.7
GS1.2 5’-GATCTTTGCTAACCCTGA-3’ 18 44 51.3 739
5’-CTTTCAAGGGTTCCAGAG-3’ 18 50 53.6
GS1.3 5’-AATCTTCGATCATCCCAA-3’ 18 39 50 739
5’-AAAGTCTAAAGCTTAGAG-3’ 18 33 46
GS1.4 5’-GATCTTCAGCCACCCCGA-3’ 18 61 59.4 739
5’-AATGTGTCATCAACCGAG-3’ 18 44 51.5
GS1.5 5’-GATCTTTGAAGACCCTAG-3’ 18 44 48.8 740
5’-TCTTTCATGGTTTCCAAA-3’ 18 33 50.1
Photosyntetic isoform 5’-AGTGCGCATTAAGGACCCATCA-3’ 22 50 61 177
5’-ACACACTGGCTTCCACAATAGG-3’ 22 50 59.4
Non-photosynthetic isoform 5’-ACAGATGATCTAGGACATGC-3’ 20 45 52 169
5’-CACTTATTTGCACTTGAAGG-3’ 20 40 52.6
Figure 7 Correlation between the most divergent amino acid sequences and antigenicity of the AtGS1 protein MSA. (A) Similarity plot
obtained using the Entropy method; the most divergent regions being are highlighted. (B) Aligned sequences for the two divergent regions
together (underlined in black) and their score in relation to other divergent regions. (C) Localisation of each divergent region in the alignment
where: (i) nucleotides in bold are the predicted epitopes for B-cells; (ii) an “e” denotes predicted solvent accessibility for this position; and (iii)
red-boxed amino acids correspond to the sequence of the matching divergent region. It is clearly seen that divergent sequences overlap with
the predicted epitopes and the solvent-accessible amino acids.
Page 13 of 16
Cebadores
capaces de
distinguir alelos
Epítopos
especíﬁcosGuerrero et al. Algorithms for Molecular Biology 2010, 5:24
Page 14 of 16
http://www.scbi.uma.es/alignminer

Bases de datos de genomas
10
Genetic and physical mapping of the QTLAR3 controlling
blight resistance in chickpea (Cicer arietinum L)
E. Madrid • P. Seoane • M. G. Claros •
F. Barro • J. Rubio • J. Gil • T. Millań
Received: 14 January 2014 / Accepted: 14 February 2014 / Published online: 26 February 2014
Ó Springer Science+Business Media Dordrecht 2014
Abstract Physical and genetic maps of chickpea a
QTL related to Ascochyta blight resistance and
located in LG2 (QTLAR3) have been constructed.
Single-copy markers based on candidate genes located
in the Ca2 pseudomolecule were for the first time
obtained and found to be useful for refining the QTL
position. The location of the QTLAR3 peak was linked
to an ethylene insensitive 3-like gene (Ein3). The Ein3
gene explained the highest percentage of the total
phenotypic variation for resistance to blight (44.3 %)
with a confidence interval of 16.3 cM. This genomic
region was predicted to be at the Ca2 physical position
32–33 Mb, comprising 42 genes. Candidate genes
located in this region include Ein3, Avr9/Cf9 and
Argonaute 4, directly involved in disease resistance
mechanisms. However, there are other genes outside
the confidence interval that may play a role in the
blight resistance pathway. The information reported in
this paper will facilitate the development of functional
markers to be used in the screening of germplasm
collections or breeding materials, improving the
efficiency and effectiveness of conventional breeding
methods.
Keywords Ascochyta blight Á CandidategenesÁ
Physical map Á Molecular markers
Introduction
Chickpea (Cicer arietinum L.) is a self-pollinated
diploid (2n = 2x = 16) annual grain legume widely
grown in arid and semi-arid areas across the six
continents. Together with other pulse crops, such as
lentil (Lens culinaris Medik.), dry pea (Pisum sativum
L.) and dry bean (Phaseolus vulgaris L.), chickpea is a
major source of protein in human diets, particularly in
low-income countries. In addition, chickpea crops
play an important role in the maintenance of soil
fertility, particularly in dry, rain-fed areas (Berrada
et al. 2007).
One of the most important factors contributing to
instability in chickpea yields is Ascochyta blight,
Electronic supplementary material The online version of
this article (doi:10.1007/s10681-014-1084-6) contains supple-
mentary material, which is available to authorized users.
E. Madrid () Á F. Barro
Institute for Sustainable Agriculture, CSIC, Apdo 4084,
14080 Co´rdoba, Spain
e-mail: b62mahee@uco.es
P. Seoane Á M. G. Claros
Departamento de Biologıá Molecular y Bioquı´mica, y
Plataforma Andaluza de Bioinforma´tica, Universidad de
Ma´laga, 29071 Ma´laga, Spain
J. Rubio
A´ rea de Mejora y Biotecnologıá, IFAPA Centro Alameda
del Obispo, Apdo 3092, 14080 Co´rdoba, Spain
J. Gil Á T. Millań
Departamento de Gene´tica, Universidad de Co´rdoba,
Campus Rabanales, Edif. C5, 14071 Co´rdoba, Spain
123
Euphytica (2014) 198:69–78
DOI 10.1007/s10681-014-1084-6
Genetic and physical mapping of the QTLAR3 controlling
blight resistance in chickpea (Cicer arietinum L)
E. Madrid • P. Seoane • M. G. Claros •
F. Barro • J. Rubio • J. Gil • T. Millań
Received: 14 January 2014 / Accepted: 14 February 2014 / Published online: 26 February 2014
Ó Springer Science+Business Media Dordrecht 2014
Abstract Physical and genetic maps of chickpea a
QTL related to Ascochyta blight resistance and
located in LG2 (QTLAR3) have been constructed.
Single-copy markers based on candidate genes located
in the Ca2 pseudomolecule were for the first time
obtained and found to be useful for refining the QTL
position. The location of the QTLAR3 peak was linked
to an ethylene insensitive 3-like gene (Ein3). The Ein3
gene explained the highest percentage of the total
phenotypic variation for resistance to blight (44.3 %)
with a confidence interval of 16.3 cM. This genomic
region was predicted to be at the Ca2 physical position
32–33 Mb, comprising 42 genes. Candidate genes
located in this region include Ein3, Avr9/Cf9 and
Argonaute 4, directly involved in disease resistance
mechanisms. However, there are other genes outside
the confidence interval that may play a role in the
blight resistance pathway. The information reported in
this paper will facilitate the development of functional
markers to be used in the screening of germplasm
collections or breeding materials, improving the
efficiency and effectiveness of conventional breeding
methods.
Keywords Ascochyta blight Á CandidategenesÁ
Physical map Á Molecular markers
Introduction
Chickpea (Cicer arietinum L.) is a self-pollinated
diploid (2n = 2x = 16) annual grain legume widely
grown in arid and semi-arid areas across the six
continents. Together with other pulse crops, such as
lentil (Lens culinaris Medik.), dry pea (Pisum sativum
L.) and dry bean (Phaseolus vulgaris L.), chickpea is a
major source of protein in human diets, particularly in
low-income countries. In addition, chickpea crops
play an important role in the maintenance of soil
fertility, particularly in dry, rain-fed areas (Berrada
et al. 2007).
One of the most important factors contributing to
instability in chickpea yields is Ascochyta blight,
Electronic supplementary material The online version of
this article (doi:10.1007/s10681-014-1084-6) contains supple-
mentary material, which is available to authorized users.
E. Madrid () Á F. Barro
Institute for Sustainable Agriculture, CSIC, Apdo 4084,
14080 Co´rdoba, Spain
e-mail: b62mahee@uco.es
P. Seoane Á M. G. Claros
Departamento de Biologıá Molecular y Bioquı´mica, y
Plataforma Andaluza de Bioinforma´tica, Universidad de
Ma´laga, 29071 Ma´laga, Spain
J. Rubio
A´ rea de Mejora y Biotecnologıá, IFAPA Centro Alameda
del Obispo, Apdo 3092, 14080 Co´rdoba, Spain
J. Gil Á T. Millań
Departamento de Gene´tica, Universidad de Co´rdoba,
Campus Rabanales, Edif. C5, 14071 Co´rdoba, Spain
123
Euphytica (2014) 198:69–78
DOI 10.1007/s10681-014-1084-6
SNP
SNP

BD de transcriptomas
11
De novo assembly of maritime pine transcriptome:
implications for forest breeding and biotechnology
Javier Canales1,†
, Rocio Bautista2,†
, Philippe Label3†
, Josefa Gomez-Maldonado1
, Isabelle Lesur4,5,6
,
Noe Fernandez-Pozo2
, Marina Rueda-Lopez1
, Dario Guerrero-Fernandez2
, Vanessa Castro-Rodrıguez1
,
Hicham Benzekri2
, Rafael A. Ca~nas1
, Marıa-Angeles Guevara7
, Andreia Rodrigues8
, Pedro Seoane2
,
Caroline Teyssier9
, Alexandre Morel9
, Francßois Ehrenmann4,5
, Gregoire Le Provost4,5
, Celine Lalanne4,5
, Celine
Noirot10
, Christophe Klopp10
, Isabelle Reymond11
, Angel Garcıa-Gutierrez1
, Jean-Francßois Trontin11
, Marie-Anne
Lelu-Walter9
, Celia Miguel8
, Marıa Teresa Cervera7
, Francisco R. Canton1
, Christophe Plomion4,5
, Luc Harvengt11
,
Concepcion Avila1,2
, M. Gonzalo Claros1,2
and Francisco M. Canovas1,2,
*
1
Departamento de Biologıa Molecular y Bioquımica, Facultad de Ciencias, Universidad de Malaga, Malaga, Spain
2
Plataforma Andaluza de Bioinformatica, Edificio de Bioinnovacion, Parque Tecnologico de Andalucıa, Malaga, Spain
3
INRA, Universite Blaise Pascal, Aubiere Cedex, France
4
INRA, Cestas, France
5
Universite de Bordeaux, Talence, France
6
HelixVenture, Merignac, France
7
Departamento de Ecologıa y Genetica Forestal, INIA-CIFOR, Madrid, Spain
8
Forest Biotech Lab, IBET/ITQB, Oeiras, Portugal
9
INRA, Unite Amelioration, Genetique et Physiologie Forestieres, Orleans Cedex 2, France
10
INRA de Toulouse Midi-Pyrenees, Auzeville, Castanet Tolosan cedex, France
11
FCBA, Pôle Biotechnologie et Sylviculture, Cestas, France
Received 20 July 2013;
revised 24 September 2013;
accepted 26 September 2013.
*Correspondence (Tel: +34 952131942;
fax: +34 952132376;
email: canovas@uma.es)
†
These authors contributed equally to work.
Summary
Maritime pine (Pinus pinaster Ait.) is a widely distributed conifer species in Southwestern
Europe and one of the most advanced models for conifer research. In the current work,
comprehensive characterization of the maritime pine transcriptome was performed using a
combination of two different next-generation sequencing platforms, 454 and Illumina.
De novo assembly of the transcriptome provided a catalogue of 26 020 unique transcripts in
maritime pine trees and a collection of 9641 full-length cDNAs. Quality of the transcriptome
assembly was validated by RT-PCR amplification of selected transcripts for structural and
regulatory genes. Transcription factors and enzyme-encoding transcripts were annotated.
Furthermore, the available sequencing data permitted the identification of polymorphisms and
Plant Biotechnology Journal (2014) 12, pp. 286–299 doi: 10.1111/pbi.12136
http://www.scbi.uma.es/sustainpinedb/
RESEARCH ARTICLE Open Access
De novo assembly, characterization and functional
annotation of Senegalese sole (Solea senegalensis)
and common sole (Solea solea) transcriptomes:
integration in a database and design of a
microarray
Hicham Benzekri1,2
, Paula Armesto3
, Xavier Cousin4,5
, Mireia Rovira6
, Diego Crespo6
, Manuel Alejandro Merlo7
,
David Mazurais8
, Rocío Bautista2
, Darío Guerrero-Fernández2
, Noe Fernandez-Pozo1
, Marian Ponce3
, Carlos Infante9
,
Jose Luis Zambonino8
, Sabine Nidelet10
, Marta Gut11
, Laureana Rebordinos7
, Josep V Planas6
, Marie-Laure Bégout4
,
M Gonzalo Claros1,2
and Manuel Manchado3*
Abstract
Background: Senegalese sole (Solea senegalensis) and common sole (S. solea) are two economically and
evolutionary important flatfish species both in fisheries and aquaculture. Although some genomic resources and
tools were recently described in these species, further sequencing efforts are required to establish a complete
transcriptome, and to identify new molecular markers. Moreover, the comparative analysis of transcriptomes will be
useful to understand flatfish evolution.
Results: A comprehensive characterization of the transcriptome for each species was carried out using a large set
of Illumina data (more than 1,800 millions reads for each sole species) and 454 reads (more than 5 millions reads
only in S. senegalensis), providing coverages ranging from 1,384x to 2,543x. After a de novo assembly, 45,063 and
38,402 different transcripts were obtained, comprising 18,738 and 22,683 full-length cDNAs in S. senegalensis and S.
solea, respectively. A reference transcriptome with the longest unique transcripts and putative non-redundant new
transcripts was established for each species. A subset of 11,953 reference transcripts was qualified as highly reliable
orthologs (97% identity) between both species. A small subset of putative species-specific, lineage-specific and
flatfish-specific transcripts were also identified. Furthermore, transcriptome data permitted the identification of single
nucleotide polymorphisms and simple-sequence repeats confirmed by FISH to be used in further genetic and expression
studies. Moreover, evidences on the retention of crystallins crybb1, crybb1-like and crybb3 in the two species of soles are
also presented. Transcriptome information was applied to the design of a microarray tool in S. senegalensis that was
successfully tested and validated by qPCR. Finally, transcriptomic data were hosted and structured at SoleaDB.
Conclusions: Transcriptomes and molecular markers identified in this study represent a valuable source for future
genomic studies in these economically important species. Orthology analysis provided new clues regarding sole
genome evolution indicating a divergent evolution of crystallins in flatfish. The design of a microarray and establishment
of a reference transcriptome will be useful for large-scale gene expression studies. Moreover, the integration of
Benzekri et al. BMC Genomics 2014, 15:952
http://www.juntadeandalucia.es/
agriculturaypesca/ifapa/soleadb_ifapa/

ReprOlive y alérgenos nuevos
12
Unigen
number
QSEQID FLN_STATUS FLN_HIT_DEFINITION SACC
ALLERGOME
CODE
SDEFINITION
1 olive_transcript_000475 Complete Sure sp=5-methyltetrahydropteroyltriglutamate--homocysteine methyltransferase; Catharanthus roseus (Madagascar periwinkle) (Vinca rosea).E3VW74 - Pollen allergen MetE (Fragment) OS=Amaranthus retroflexus PE=2 SV=1
2 olive_transcript_000659 Complete Sure sp=Luminal-binding protein 5; Nicotiana tabacum (Common tobacco).Q9FSY7 243; 3215 Putative luminal binding protein OS=Corylus avellana GN=BiP PE=2 SV=1
3 olive_transcript_002489 Complete Putative sp=Cysteine proteinase RD19a; Arabidopsis thaliana (Mouse-ear cress).A5HIJ3 1 Cysteine protease Cp3 OS=Actinidia deliciosa PE=2 SV=1
4 olive_transcript_003129 Complete Sure sp=Malate dehydrogenase, mitochondrial; Fragaria ananassa (Strawberry).P17783 6159 Malate dehydrogenase, mitochondrial OS=Citrullus lanatus GN=MMDH PE=1 SV=1
5 olive_transcript_003931 Complete Sure sp=L-ascorbate peroxidase 1, cytosolic; Arabidopsis thaliana (Mouse-ear cress).Q42661 2423 L-ascorbate peroxidase OS=Capsicum annuum PE=2 SV=1
6 olive_transcript_005675 C_terminal Putative sp=Glyceraldehyde-3-phosphate dehydrogenase, cytosolic; Petroselinum crispum (Parsley) (Petroselinum hortense).C7C4X1 9501; 9502 Glyceraldehyde-3-phosphate dehydrogenase OS=Triticum aestivum GN=ga3pd PE=2 SV=1
7 olive_transcript_007323 Complete Putative sp=Triosephosphate isomerase, cytosolic; Petunia hybrida (Petunia).Q9FS79 920; 9498 Triosephosphate isomerase OS=Triticum aestivum GN=tpis PE=2 SV=1
8 olive_transcript_008377 C_terminal Sure sp=Glyceraldehyde-3-phosphate dehydrogenase, cytosolic; Antirrhinum majus (Garden snapdragon).C7C4X1 9501; 9502 Glyceraldehyde-3-phosphate dehydrogenase OS=Triticum aestivum GN=ga3pd PE=2 SV=1
9 olive_transcript_008559 Complete Sure sp=Superoxide dismutase [Mn], mitochondrial; Nicotiana plumbaginifolia (Leadwort-leaved tobacco) (Tex-Mex tobacco).Q9FSJ2 380; 383 Superoxide dismutase (Fragment) OS=Hevea brasiliensis GN=sod PE=2 SV=1
10 olive_transcript_008909 - - B9T876 - Minor allergen Alt a, putative OS=Ricinus communis GN=RCOM_0066700 PE=3 SV=1
11 olive_transcript_009735 - - W9RZW9 - Minor allergen Alt a 7 OS=Morus notabilis GN=L484_009041 PE=3 SV=1
12 olive_transcript_010769 * Complete Sure sp=Probable calcium-binding protein CML13; Arabidopsis thaliana (Mouse-ear cress).Q2KM81 1070; 3105 Polcalcin OS=Artemisia vulgaris PE=2 SV=1
13 olive_transcript_018199 C_terminal Putative sp=Peptidyl-prolyl cis-trans isomerase 1; Glycine max (Soybean) (Glycine hispida).Q8L5T1 134 Peptidyl-prolyl cis-trans isomerase OS=Betula pendula GN=ppiase (CyP) PE=2 SV=1
14 olive_transcript_027589 * C_terminal Putative sp=Profilin; Litchi chinensis (Lychee).Q2PQ57 449 Profilin OS=Litchi chinensis PE=2 SV=1
POLLEN TRANSCRIPTOME ALLERGOME – UNIPROT ALLERGENS
Nuevos
alérgenos sin
describir
Nuevas proﬁlinas y
variantes de
alérgenos conocidos
http://reprolive.eez.csic.es/
Búsquedas semánticas
COLABORACIÓN:
José Aldana

AutoFlow: automatización de «workﬂows»
13
Figure 4
Time(hours)
Total_time
Euler_assembling_k_25
Euler_assembling_k_29
MIRA3_assembling
Euler_remove_artifacts_k_25
Euler_remove_artifacts_k_259
validate_contigs_with_mapping_k_25
validate_contigs_with_mapping_k_29
rescue_unmapped_contigs_k_25
rescue_unmapped_contigs_k_29
recover_MIRA3_debris
MIRA3_remove_artifacts
CAP3_reconciliation_k_25
CAP3_reconciliation_k_29
FLN_analysis_of_CAP3_contigs_k_25
FLN_analysis_of_CAP3_contigs_k_29
TIDs
choose_best_assembly+cp_best_assembly
AutoFlow, a Versatile Workflow Engine Illustrated by Assembling an
Optimised de novo Transcriptome for a Non-Model Species, such as Faba
Bean (Vicia faba)
Running title: AutoFlow, a versatile workflow engine
Pedro Seoane1
, Sara Ocaña2
, Rosario Carmona3
, Rocío Bautista3
, Eva Madrid4
,
Ana M. Torres2
, M. Gonzalo Claros1,3,*

Mi bioinformática para el IBIMA

Mi bioinformática para el IBIMA

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Mi bioinformática para el IBIMA

Ähnlich wie Mi bioinformática para el IBIMA (20)

Mehr von M. Gonzalo Claros

Mehr von M. Gonzalo Claros (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mi bioinformática para el IBIMA