The use of gene expression data from Micro arrays coupled with WT QTL's linked to Tryp resistance phenotypes in Cattle to elucidate pertinent genetic changes underpinning phenotype in putative candidate genes
A systematic, data driven approach to the combined analysis of microarray and qtl data
1. Pinard M-H, Gay C, Pastoret P-P, Dodet B (eds): Animal Genomics for Animal Health. Dev Biol
(Basel). Basel, Karger, 2008, vol 132, pp 293-299.
A Systematic, Data-driven Approach to
the Combined Analysis of Microarray
and QTL Data
C. Rennie1,2
, H. Hulme2
, P. Fisher2
, L. Hall3
, M. Agaba4
,
H.A. Noyes1
, S.J. Kemp1,4
, A. Brass2
1.
School of Biological Sciences, Biosciences Building, Liverpool, UK
2.
SchoolofComputerScience/FacultyofLifeSciences,UniversityofManchester,
UK
3.
Roslin Institute, Roslin, Midlothian, Scotland, UK
4
. ILRI, Nairobi, Kenya
Keywords: automated analysis, microarray, QTL, workflow
Abstract:High-throughputtechnologiesinevitablyproducevastquantitiesofdata.Thispresents
challenges in terms of developing effective analysis methods, particularly where the analysis
involves combining data derived from different experimental technologies. In this investigation,
a systematic approach was applied to combine microarray gene expression data, quantitative
trait loci (QTL) data and pathway analysis resources in order to identify functional candidate
genes underlying tolerance to Trypanosoma congolense infection in cattle. We automated much
of the analysis using Taverna workflows previously developed for the study of trypanotolerance
in the mouse model.
Pathways represented by genes within the QTL regions were identified, and this list was
subsequently ranked according to which pathways were over-represented in the set of genes
that were differentially expressed (over time or between tolerant Nâdama and susceptible Boran
breeds) at various timepoints after T. congolense infection.The genes within the QTLthat played
a role in the highest ranked pathways were flagged as good targets for further investigation and
experimental confirmation.
INTRODUCTION
The analysis of microarray gene expression data can present difficulties due to
the vast size of the datasets. Depending on the purpose of the study, analysis may
be further complicated by the need to combine data produced using different
experimental techniques or by the underlying complexity of the phenotype being
investigated. A systematic, data-driven, semi-automated analysis pipeline was
293
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:32 Page 293
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM
2. 294 RENNIE/HULME/FISHER/HALL/AGABA/NOYES/KEMP/BRASS
developed for the pathway-based combined analysis of microarray and quantitative
traitlocus(QTL)dataaspartofastudyinvestigatingthegeneticsunderlyingtolerance
to African bovine trypanosomiasis (nagana).
Nagana is transmitted by the tsetse fly, leading to loss of productivity and often
death in infected cattle. It represents a major constraint on livestock production in
Africa [1].
Some breeds of cattle, such as the Boran, are susceptible to the pathological
consequences of trypanosomiasis. Others, such as the N'dama, are more resistant to
these effects (trypanotolerant) [2]. However, the susceptible breeds have desirable
traits, such as greater size, and may be preferred by farmers. Identification of genes
that influence response to trypanosomiasis might inform new treatment approaches,
or even pave the way for creating transgenic breeds that combine the desirable traits
of susceptible and trypanotolerant cattle.
Trypanotolerance is a complex phenotypeincluding several distinct components,
likely to involve separate genetic control mechanisms. Features include the ability
to control anaemia, control parasitaemia and maintain bodyweight. Previous studies
provideevidenceofthecomplexityoftrypanotolerance.Thetrypanosomiasisresponse
of haematopoietic chimeric twins bred from one Boran and one Nâdama parent was
studied, demonstrating that control of anaemia depends on bone marrow from a
trypanotolerant background, whereas control of parasitaemia does not [3]. The
mappingstudythatprovidedQTLdatausedinthisanalysisshowedthattheproportion
of phenotypic variation explained by each QTL was between 6 and 20%, suggesting
that multiple genes, or complex epistatic or environmental effects, may influence
each trait [4].
A microarray gene expression time course study was carried out to investigate
geneexpressiondifferencesbetween(trypanotolerant)N'damaand(trypanosusceptible)
Boran cattle infected with T. congolense strain IL1180. This study generated a vast
dataset.Thousandsofprobesetsonthearraygeneratedsignalsthatweresignificantly
different between timepoints and/or between the two breeds (in T-tests or paired T-
tests with pâ€0.01).
A mapping study identified QTL for 16 phenotypic traits associated with
trypanotolerance in Boran and Nâdama cattle [4]. The gene underlying a QTL is not
assumedtobedifferentiallyexpressed.However,itisexpectedtoconnectbiologically
withdifferentiallyexpressedgenes.Theknownpathwaysthatincludedagenewithin
one of the five QTL with the largest effect were identified and compared with the
known pathways that included a differentially expressed gene. The rationale behind
this approach was to establish the possible connections between the QTL and the
differentially expressed genes.
A systematic strategy was used to enable an objective triage of the datasets,
resultinginashortlistofstrongcandidatepathwaysthatincludedbothadifferentially
expressed gene and a gene within a trypanotolerance QTL. These pathways were
ranked according to the results of a Fisher exact test performed using the Database
for Annotation, Visualisation and Integrated Discovery (DAVID) [5]. A literature
search was carried out to determine whether the biological function of each pathway
was likely to be linked to the phenotypic trait influenced by the QTL.
Large sections of the analysis were automated by adapting Taverna workflows
originally developed for the study of trypanotolerance in the mouse model [6]. This
allowed for the entire analysis to be repeated consistently and relatively quickly and,
for example, for the incorporation of information on the bovine genome from a
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:32 Page 294
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM
3. different EnsEMBL build. It was also possible to adapt the analysis procedure to
examineadifferentspeciesoradifferentphenotypeforwhichQTLdatawasavailable.
MATERIALS AND METHODS
Microarray gene expression data was acquired using Affymetrix Bovine Genome â100 format (Midi)â
microarrays for liver samples harvested from Boran and Nâdama cattle at 0, 12, 15, 18, 21, 26, 29, 32 and 35
days post-infection.
This data was analysed with dChip [7] to identify and remove outliers before normalisation using the
robust multi-array (RMA) method. Principal components analysis (PCA) was used to check that hybridisations
grouped as expected.
T-tests were used to compare gene expression between breeds at each time point. Paired T-tests (using
data for the same individual animals at different timepoints) were used to compare gene expression for each
time point with day 0. Lists of probes that showed differential gene expression (pâ€0.01) between breeds or
over time were compiled.
Aprevious study identified trypanotolerance QTLin Nâdama and Boran cattle [4]. Five QTLwere selected
to be included in this analysis based on phenotypic trait, mapping resolution and strength of effect. Base pair
positionsofQTLrelativetotheEnsEMBLbovinegenomepreliminarybuildBtau2.0weredeterminedmanually.
Names and phenotypes for the five QTL are shown in Table 1.
Table 1: Name and phenotype for the five trypanotolerance QTL used in this analysis. For more detailed
information, please refer to the original mapping study [5].
To combine microarray and QTL data, a Taverna workflow previously developed for the study of
trypanotolerance in the mouse model was adapted [6]. The paper cited provides a full description. In brief,
lists of differentially expressed genes (over time or between breeds) were associated with Kyoto Encyclopaedia
of Genes and Genomes (KEGG) pathways. A separate process identified genes within QTL and associated
these with KEGG pathways. A third process compared these lists to produce a list of KEGG pathways that
contained both differentially expressed genes and genes from the QTL.
Some adaptations of the workflow were necessary. Rather than the mouse EnsEMBL build and IDs, the
bovine EnsEMBLpreliminary build (Btau2.0) was used. Bovine gene IDs were retrieved forAffymetrix probes
then mapped to human homologues (using EnsEMBL data for NCBI build 36) so that human IDs could be
used for the remainder of the analysis (available annotation on bovine genes is very limited). Output was in
the same form as the original, comprising a list of KEGG pathways that included at least one differentially
expressed gene and at least one gene from the QTL.
Thislistwasrankedbasedonthep-valueofeachpathwayinaFisherexacttestperformedonthemicroarray
data using DAVID indicating whether pathway genes showed more differential expression than expected by
chance. The list was annotated to add gene symbols for pathway genes in the QTL and to indicate the breeds
and timepoints in which pathway genes were differentially expressed. Further annotation was derived from
gene and pathway resources including GenBank, iHOP, GenMAPP and GeneGo: MetaCore.
Figure 1 summarises the analysis protocol described above.
QTL Phenotype
BTA2 Anaemia
BTA4 Parasitaemia
BTA7 Anaemia and parasitaemia
BTA16 Anaemia
BTA27 Anaemia
MICROARRAY AND QTL DATA ANALYSIS 295
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:32 Page 295
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM
4. Fig. 1: Summary of the analysis procedure. Automated sections are indicated using grey shading.
296 RENNIE/HULME/FISHER/HALL/AGABA/NOYES/KEMP/BRASS
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 296
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM
5. RESULTS
This analysis procedure could be re-used or adapted to examine another species
or phenotype for which QTL data are available. The modular nature of the protocol
and of Taverna workflows facilitates adding or altering analysis stages. Workflow
scripts and supplementary data (e.g. files from intermediate stages) are available
from the authors.
In the bovine trypanotolerance study, the result of the analysis procedure was to
provide a short list of targets for further investigation. This result can be quantified
by assessing the numbers of genes requiring further investigation based on the
combined analysis results or on the original data.
Out of 24,128 probe sets on the array, 12,591 were significantly differentially
expressed (pâ€0.01) in T-tests or paired T-tests comparing expression between breeds
or over time. Of these probe sets, 8,342 were mapped to a known gene, in total
representing 7,071 unique gene symbols.
After combining the pathway lists for differentially expressed and QTL genes,
pathway genes within QTL provided a list of 127 targets. Restricting the pathway
list to those with a significant (pâ€0.05) score in the DAVID Fisher exact test reduced
this to 51 targets (it could be reduced more by checking whether expression changes
are downstream of the QTL gene and whether the pathway function is related to the
QTL phenotype).
The list of pathways with a significant score (pâ€0.05) in the DAVID Fisher exact
test is displayed in Table 2. Pathway genes lying within each QTL are also listed.
NotethatthesedataarebasedonananalysisusingtheEnsEMBLBtau2.0preliminary
build.Amore recent preliminary build is available, and the analysis will be repeated,
and key findings discussed, in a future publication.
DISCUSSION
When studying complex phenotypes, analysis based on biological processes
already known to be involved may be insufficient. It is possible that other key
biological pathways, or complex interactions between them, could be missed. Data-
driven approaches are useful to identify the biological processes showing strongest
variation in the results.
The aim of a pathway-based approach to analysing microarray and QTL data is
toidentifybiologicallymeaningfullinksbetweenthetwodatasets.Thegeneunderlying
a QTL is not necessarily differentially expressed, but may influence the expression
of other genes downstream in a known pathway. This approach allows such genes
to be identified without detailed investigation of every gene in the QTL regions or
every gene that is differentially expressed.
Automationisincreasinglynecessarytohandlethevastquantitiesofdataproduced
by high-throughput technologies, where manual analysis of the entire dataset is not
feasible.Automated approaches are systematic, promoting consistency and reducing
bias. Consistent replication of automated analyses is relatively simple, allowing
separate studies to produce comparable results and allowing analyses to be repeated
in order to incorporate new information (e.g. from updates to genome build and gene
information available in public databases). Automated analysis can be an effective
triageprocess,producingashortlistofstrongtargetsforthoroughmanualinvestigation.
MICROARRAY AND QTL DATA ANALYSIS 297
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 297
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM
6. Table 2: Pathways with a significant (pâ€0.05) score in a Fisher exact test to determine whether the differential
expression of pathway genes is higher than expected by chance. The columns on the right give the
gene symbols for pathway genes within each of the QTL.
KEGG pathway name BTA2 BTA4 BTA7 BTA16 BTA27
Leukocyte transendothelial migration VAV1 CLDN23
Regulation of actin cytoskeleton
FN1 CHRM2 VAV1 BRAF
PIP5K3 FGF20
Cell cycle
MCM6 CDKN2D
ORC4L
Gap junction
PRKACA GNAQ
TUBB4
Focal adhesion
FN1 ZYX COL5A3 CAPN2 BRAF
VAV1
MAPK signalling pathway
CASP8 CASP2 ECSIT DUSP10 BRAF
PRKACA DUSP4
FGF20
IKBKB
Hematopoietic cell lineage
EPOR
FCER2
Huntingtonâs disease CASP8
Glycerolipid metabolism LCT DGKI AGPAT6
Axon guidance
EFNB1 EPHA1 UNC5D
EPHB6
Glycerophospholipid metabolism DGKI ARD1A AGPAT6
Adherens junction INSP FGFR1
Neurodegenerative disorders CASP8
T cell receptor signalling pathway
CD28 VAV1 IKBKB
CTLA4
ICOS
Long-term potentiation PRKACA GNAQ BRAF
Apoptosis
CASP8 IRAK1 CAPN2 IKBKB
PRKACA
Toll-like receptor signalling pathway
CASP8 IRAK1 IKBKB
TICAM1
Wnt signalling pathway
PRKACA DKK4
SFRP1
Glutathione metabolism IDH1 GSTK1
Calcium signalling pathway
CHRM2 PTGER1 GNAQ ADRB3
PRKACA GNA14 VDAC3
ITPKB
298 RENNIE/HULME/FISHER/HALL/AGABA/NOYES/KEMP/BRASS
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 298
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM
7. MICROARRAY AND QTL DATA ANALYSIS 299
CONCLUSION
Systematic data-driven automated approaches offer an excellent means to triage
data from high-throughput technologies, providing a shortlist of viable targets for
thorough manual analysis and experimental confirmation.
ACKNOWLEDGEMENTS
This work was wholly funded by The Wellcome Trust.
REFERENCES
1 Kristjanson PM, Swallow BM, Rowlands GJ, Kruska RL, de Leeuw PN: Measuring the costs of African
animal trypanosomosis, the potential benefits of control and returns to research. Agric Syst 1999;59:79-
98.
2 Murray M, DâIeteren G, Teale AJ. Trypanotolerance, in Maudlin I, Holmes PH, Miles MA (eds): The
Trypanosomiases. Wallingford UK, CABI Publishing, 2004, pp 461-477.
3 Naessens J, Leak SG, Kennedy D, Kemp SJ, Teale AJ: Responses of bovine chimaeras combining
trypanosomosisresistantandsusceptiblegenotypestoexperimentalinfectionwithTrypanosomacongelense.
Vet Parasitol 2003;111:125-142.
4 Hanotte O, Ronin Y, Agaba M, Nilsson P, Gelhaus A, Horstmann R et al: Mapping of quantitative trait
loci controlling trypanotolerance in a cross of tolerant WestAfrican Nâdama and susceptible EastAfrican
Boran cattle. Proc Natl Acad Sci USA 2003;100(13):7443-7448.
5 Dennis GJ, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC et al: DAVID: Database for Annotation,
Visualisation, and Integrated Discovery. Genome Biol 2003;4(9):R60.
6 Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S et al: A systematic strategy for large-
scale analysis of genotype-phenotype correlations: identification of candidate genes involved in African
trypanosomiasis. Nucl Acids Res 2007;35(16):5625-5633.
7 Li C, Wong WH: Model-based analysis of oligonucleotide arrays: Expression index computation and
outlier detection. Proc Natl Acad Sci USA 2001;98:31-36.
Catriona Rennie, LF8, Kilburn Building, The University of Manchester, Oxford Rd, Manchester, M13 9PL,
UK.
E-mail: catriona.rennie@postgrad.manchester.ac.uk
AG_Vol 132_21.07.08:Animal Genomics vol 132 23/07/2008 11:33 Page 299
Downloadedby:L.Dawkins-Hall-416430
UniversityofLeicester
143.210.247.140-2/27/20171:16:58PM