SlideShare ist ein Scribd-Unternehmen logo
1 von 98
Downloaden Sie, um offline zu lesen
Title An Information Platform for Business Intelligence in the
Aid Sector based on Open Data and Documents
Subtitle Integrated Access to structured and unstructured data
using the document-oriented database CouchDB
Commission Nivocer B.V.
Document Thesis Bachelor of ICT
Instelling Amsterdam University of Applied Sciences
Domain DMCI: Domain Media, Creation and Information
Curriculum ICT parttime
Module Graduation Project
Period February - July 2012
Delivery 13 August 2012
Author Michiel Kuijper
Studentno. 500609474
Version Distribution
Ontsluiting van Open Data voor ontwikkelingshulp
[1
] De Millennium Ontwikkelings Doelen
definiëren een aantal zaken die voor iedereen
goed geregeld zouden moeten zijn, bv de
leefomstandigheden in huis. Zo zijn er in
Afrika nog veel huishoudens die binnenshuis
met hout koken. Dit veroorzaakt oa.
ademhalingsproblemen en oogproblemen en is
dus slecht voor de gezondheid. Deze manier
van koken zorgt ook voor veel houtkap
waardoor er ontbossing onstaat. Dit heeft
weer invloed op het ecosysteem. Ook sociaal
gezien heeft deze manier van koken een
impact. Het kost vrouwen en kinderen veel tijd
om hout te verzamelen waardoor zij langdurig
aan het huishouden gebonden worden.
Hierdoor zijn ze niet in staat om sociaal gezien
op een hoger plan te komen.
Eén van de oplossingen die bedacht is om dit
aan te pakken is koken op biogas. Biogas is
het gas dat vrij komt uit de uitwerpselen van
koeien, geiten en kippen. Dit werkt omdat
hierdoor geen rook in huis komt, er geen hout
gekapt wordt en vrouwen en kinderen hun tijd
aan anderen dingen kunnen besteden, zoals
onderwijs of nijverheid. Voor het koken op
biogas zijn speciale installaties nodig. Voor het
bouwen van deze installaties zijn kennis en
kunde nodig. Om deze kennis en kunde over
te brengen is geld nodig. Dit geld komt uit
ontwikkelingsprogramma’s. Deze programma’s
worden gesponsored door de welvarende
landen. Deze landen geloven dat het in het
belang van iedereen is dat ook de niet
welvarende landen economisch en sociaal
gezond zijn.
Een probleem bij ontwikkelingsprogramma’s is
dat er veel zijn en dat ze ook vaak overlappen
in hun doelstellingen. Andere doelstellingen
worden soms onderbelicht doordat donoren
van elkaar denken dat anderen hier zich mee
bezig houden. Er is dus een gebrek aan
overzicht. Daarom hebben de samenwerkende
organisaties bedacht dat ze hun informatie
1
Deze pagina met links is beschikbaar op
http://www.michielkuijper.nl/aboutiatidemo
over ontwikkelingsprogramma’s moeten
publiceren om een begin te maken met het
vinden van overzicht. Hier is het International
Aid Transparancy Initiative uit voort gekomen.
Het HvA project dat dit rapport beschrijft heeft
onderzocht wat er gedaan moet worden om de
data gepubliceerd in de IATI standaard om te
zetten in overzicht.
Het HvA onderzoek heeft met een business
intelligence bril gekeken naar de IATI data.
IATI data wordt gepubliceerd als XML. XML is
een uitwisselingsstandaard voor computers.
Om XML voor mensen leesbaar te maken moet
zij in tabelvorm worden omgezet. Om deze
tabellen met elkaar te kunnen vergelijken
moet hun betekenis gelijkvormig en éénduidig
worden gemaakt. In het HvA project is
software ontwikkeld om IATI XML in te lezen,
éénduidig te maken en te presenteren in
vergelijkingstabellen. De activiteiten en
uitgaven van donoren kunnen bekeken worden
vanuit verschillende invalshoeken.
Bijvoorbeeld, welke donoren geven geld aan
welke landen, voor welke doelstellingen en
welke werkvorm gebruiken ze hiervoor. Dit
stelt geïnteresseerden in staat om te zien hoe
de bijdragen van verschillende donoren bij
elkaar komen voor een bepaalde invalshoek.
In het HvA project is ook onderzocht hoe
documenten over de voortgang van
ontwikkelingsprogramma’s gekoppeld kunnen
worden aan de IATI XML data. Deze koppeling
heeft wederzijdse voordelen. Documenten
bieden meer achtergrond aan invalshoeken,
en invalshoeken bieden houvast om document
archieven te ontsluiten. In het project kwam
bijvoorbeeld naar voren dat de HvA oplossing
managers in ontwikkelingsorganisaties kan
helpen om een scherper beeld te krijgen van
andere organisaties waarmee zij samen
programma’s zouden kunnen inrichten en
uitvoeren. Door bijvoorbeeld de ontwikkelings
activiteiten voor Ethiopia te bekijken zien zij
ook alle documenten die rapporteren over die
activiteiten. In deze rapportages worden
organisaties vermeld die niet in de IATI XML
tabellen voorkomt. Hiermee krijgen zij een
rijker beeld van de organisaties die in Ethiopia
actief zijn. Dit is bv. voor een programma
manager voor het land Ethiopia van Oxfam
interessant. Tegelijkertijd biedt dit structuur
aan de archivarissen van Oxfam die willen
begrijpen hoe hun interne klanten in deze
documenten willen zoeken. De IATI standaard
wordt als het ware toegepast om
gestructureerde zoekvragen op
documentarchieven te definiëren.
Het HvA project laat dus zien dat twee
belangrijke organisatiedoelstellingen
worden ondersteund: het vergelijkbaar maken
van uitgaven, een business intelligence
doelstelling, en het ontdekken van
achtergrond informatie, een business
discovery doelstelling.
Technisch gezien gaat het ontsluiten, koppelen
en presenteren van XML data en documenten
als volgt in het werk. Twee hoofdproblemen
moeten worden opgelost: de variatie in
publicaties en het verwerken van grote
hoeveelheden. Hiervoor wordt een duidelijk
methodologisch onderscheid gemaakt tussen
ontsluiten en presenteren. Het ontsluiten
houdt in dat de verschillende data- en
documentbronnen wordt klaargezet, staging
genoemd, om gemakkelijk gekoppeld te
kunnen worden. Als de koppeling eenmaal
gemaakt is wordt het ook gemakkelijker de
informatie te presenteren. Het kern idee
hierachter is dat alle berekeningen die moeten
worden uitgevoerd om data en documenten
éénduidig te maken van te voren worden
gedaan zodat de eindgebruiker hier niet op
hoeft te wachten.
Het HvA project heeft zich geconcentreerd op
het klaarzetten en éénduidig maken van data
en documenten. Deze worden aangeboden in
een informatie platform. Ontwikkelaars van
presentatie oplossingen kunnen via een
standaard webprotocol deelverzamelingen van
de informatie opvragen die zij op hun eigen
manier kunnen vormgeven. In het project is
een voorbeeld presentatie oplossing
ontwikkeld om te laten zien hoe dit werkt.
Het informatie platform is gebouwd in de
technologie CouchDB. Dit is een web-
geörienteerde database die ontwikkelaars in
staat stelt te werken in de web-
programmeertaal javascript en de
bijbehorende dataopslag standaard JSON
(Javascript Object Notation). Het voordeel van
JSON is dat data opslag gemakkelijker wordt,
waardoor het zoeken naar éénduidigheid in
variërende publicaties sneller gaat en er
minder technisch boekhoudwerk hoeft te
worden gedaan. Hiermee komt er snelheid in
het oplossen van het variatie probleem.
Het volume probleem wordt gedeeltelijk
opgelost door deelverzamelingen vanuit
invalshoeken klaar te zetten. Dit heet
dimensioneel modelleren. Doordat voor deze
deelverzamelingen éénduidigheid
voorberekend is kan een presentatie oplossing
snel worden voorzien van alle informatie
behorende bij een bepaalde invalshoek.
CouchDB is gespecialiseerd in het snel
uitleveren van invalshoeken op grote
hoeveelheden informatie. Daarom is deze
technologie geschikt voor het HvA project. Om
deze snelheid te kunnen waarborgen is het
éénduidig klaarzetten gebonden aan een
specifieke programmeeropzet. Deze opzet
kent twee stappen: het afbeelden van
informatie op invalshoeken, en het berekenen
van optellingen, gemiddelden en andere
maten voor deze invalshoeken. Deze opzet
wordt MapReduce genoemd.
Het HvA project laat dus zien dat twee
belangrijke technische doelstellingen
worden ondersteund: het gemakkelijk
experimenteren met éénduidigheid en het snel
uitleveren van deelverzamelingen voor
bepaalde invalshoeken. Hiermee kunnen de
bedrijfsdoelstellingen overzicht te krijgen en
ontdekkingen te doen vanuit bepaalde
invalshoeken flexibel en uitbreidbaar worden
gerealiseerd.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
3
Foreword
This document is the thesis for a Bachelor of ICT degree from the Amsterdam University of Applied
Sciences (HvA). It reports on a software development project commissioned by Nivocer B.V. The goal
of the project was to make Open Data about Development Aid programmes more accessible and
combine it with corresponding documents. This facilitaties other parties to develop analytical
complements. A demonstrator complement was developed as an example1
.
Thanks to Rolf Kleef from Nivocer for granting me this opportunity and for engaging in many
interesting discussions. Thanks to Gerke de Boer from the HvA for guiding me in the writing of this
report.
Amsterdam, August 12 2012,
Michiel Kuijper
1
http://www.michielkuijper.nl/iatidemo
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
4
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
5
Summary
This report describes a software development project in support of Aid Transparency, ie. making data
and documents about investments for Development Aid programmes suitable for analysis. The trigger
for this project was the recent initiative by cooperating Aid Organisations to publish raw data in an
agreed upon format, the IATI (International Aid Transparency Initiative) format.
Our assignment was to develop software middleware to link this data to corresponding documents in
order to support a rich picture of Aid Activities. This assignment was commissioned to us by Nivocer
B.V., an intermediary party in the Aid network, specialised in information services.
The project was managed using the Maes nine-squares model. This model allowed us to look at our
research and development using three concerns intersecting with three process levels. These concerns
are business aspects, information aspects and technical aspects. These levels are strategy, tactics and
operations. This model is used to structure the report and present our products.
The model allowed us to operationalise our high-level assignment into three concrete assignments: a
business assignment, an information assignment and a technical assignment.
• The technical assignment was to stage structured data and unstructured documents so they
can be manipulated in a uniform way.
• The business assignment was to implement a demonstrator front-end that can visualise
Information Entities about Development Aid in relationship to each other.
• The information assignment was to develop an architecture and proof-of-concept software
support to provide integrated access to Information Entities consumable by multiple types of
visualisation front-ends.
A user facing preview of our
demonstrator is provided to
create a concrete frame of
reference for the more
abstract remainder of this
report. The demonstrator
shows that data and
documents are linked at a
descriptional level as well as
at an analytical level.
The business assignment is
contextualised using business
intelligence theory. Integral
Performance Management
shows performance indicators give direction to BI projects. The BI cycle provides a 15 step divide-
and-conquer approach to data preparation, fact analysis and decision making. Typical BI user types
and use cases give requirements analysis for BI applications a headstart.
Our demonstrator seems to gravitate towards the typical use case “Exploration”. This use case is
representative for managers in Aid Organisations with cross-organisational responsibilities. Using the
structured analysis approach from BI to discover relevant information in unstructured documents
helps to disclose an organisation’s archives.
Our information platform links up data and documents in a business
intelligence view on organisational collaboration
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
6
In the technical assignment the raw data was profiled by looking at existing studies. These were
complemented using our own investigations, in particular around the Dutch Ministry of Foreign Affairs
as a providing organisation and Sudan as a recipient country. In particular the variety and volume of
the data and documents are challenging. These two challenges lock each other in causing a long
development curve.
To deal with the variety and volume challenges we have selected CouchDB, a schemaless datastore
using JSON data formats and javascript-based logic for data staging and information provisioning. In
the information assignment an architecture was developed using CouchDB’s strategic application
logic. CouchDB has two direct advantages for our project. Its generic schematisation allows us to
capture sources with diverse structures in a common staging area very early in our transformation
process. This prevents having to invest heavily in data typing and storage management. This allows
for flexible experimentation and agile adaptation. This helps us to address the variety problem.
CouchDB’s transformation functionality is used to precompute inverted indexes representing the
analytical views we want to provide to BI applications. To separate analytical from descriptional
properties we have used the Kimball dimensional modelling approach. An aggregation strategy was
devised to provide a slice of a fact table depending on the viewpoint of the requesting use case. The
helps us to address the volume problem, both on resp. a micro and macro level.
System models are provided to explain the structure and behavior of our staging area, information
platform and front-end application working together. Using a data flow model the data transformation
is illustrated using a running example about the Aid Activity “Africa Biogas Partnership Programme”.
The different formats in the transformation are described as well as the logic performing the
transformation.
The data uploaded to CouchDB was pre-processed using Python scripts. The documents uploaded to
CouchDB were pre-processed using the semantic annotation webservice Open Calais. The
demonstrator application was developed in the open source data visualisation framework Exhibit-
Simile.
The demonstrator proves our concept of software middleware to link data and documents in order to
provide a rich picture of Aid Activities. CouchDB can cater for loose coupling of changes in data supply and
changes in information demand. It does this by encapsulating Activity data and Activity documents as JSON
documents, so they can be transformed by MapReduce functions into JSON entity collections. These can be
provided to different BI clients in customised formats using CouchDB list functions.
Our approach fits the three stage strategy recommended by Business Intelligence practitioners; ie.
data preparation for indicators, fact-based analysis and peer-based decision making. In this project
we have covered the first stage, leaving open multiple options for supporting the second and third
stage. Next steps have been identified to bring the platform to industrial-scale quality.
There were two major deviations from the expectations captured in our initial plans. These were the
abandonment of the ABPP as primary focus of the demonstrator and the employment of text mining
techniques for disclosing Activity documents. The first was abandoned because of a lack of operational
data. The second was substituted for the external webservice Open Calais.
The core areas of competence development this project has addressed are analysis of business
processes and analysis of existing software frameworks. The secondary areas of competence
development this project has addressed are design and implementation of business processes, and
design and implementation of software combinations. The main soft skills this project has addressed
are self-education of web APIs and non-relational database technologies.
Term
AidDomain
BIProcess
Software
Description
ABPP X
Africa Biogas Partnership Programme. Used in the report as a running example. See also
chapter references
Activity X Aid programme record according to IATI standards
Assets X HTML, javascript or multimedia files held in a CouchDB design document
BI cycle X Structured process to stage data, analyse facts and support decision making
BI / business
intelligence
X
The directed process to collect and analyse data and apply the resulting information to
govern an organisation
BUZA X Dutch Ministry of Foreign Affairs. See also DGIS
CouchDB X
Document-oriented datastore that uses JSON to store data, JavaScript as its query
language using MapReduce and HTTP for an API.
CSO X
Civil Society Organisation; Volunteering is often considered a defining characteristic of the
organizations that constitute civil society, which in turn are often called Non-Governmental
Organisations, or Non Profit Organisations.
CSV X
Comma Separated Values - format to mark up data tables using commas or semicolons for
consumption by spreadsheet sofware - eg. cell1;cell2;cell3
CURL X Command-line tool to interact with websites
DAC X
The Organisation for Economic Co-operation and Development's Development Assistance
Committee (DAC) is a forum for selected OECD member states to discuss issues
surrounding aid, development and poverty reduction in developing countries.
DGIS X
Directorate Generale for International Cooperation (Samenwerking). Department of the
Dutch Ministry for Foreign Affairs responsible for Aid Activities
DMW X Department Environment and Water. Department of DGIS
DOCX X File format for Microsoft Word. Content, logic and style are annotated separately using XML
DOM X
Domain Object Model - a hierarchy of nested tags with presentation, logic or content roles
that make up a webpage; eg. <div id='myDOMelement'></div>
ETL X Extract, Transform, Load; Structured process to put data in a datawarehouse
Exhibit-Simile X Open Source light-weight data visualisation framework building on the HTML DOM model
HBO-I Hogere Beroeps Opleiding Informatica (Vocational Training Body)
Hivos X
Humanist Institute for Cooperation (Humanistisch Instituut voor
Ontwikkelingssamenwerking) is a Dutch organization for development co-inspired by
humanist values
HTML X Hyper Text Mark-up Language; language used to build a webpage
HTTP X Hyper Text Transfer Protocol; network standard for transferring webpages
IATI X International Aid Transparancy Initiative
Term
AidDomain
BIProcess
Software
Description
Inmon-explorer X
BI term for a user with a need to slice and dice through information and drill up and down
aggregation level
Inmon-farmer X
BI term for a type of user with a strategic role in an organisation with a predictable need
for digested information.
Inmon-miner X
BI term for a user interested in discovering trends and anomalies in information to explain
past events and predict future ones.
Inmon-tourist X
Bi term for a user with an operational role in an organisation with a predictable need for
specific information
IPM X
Integral Performance Management; Management method to connect a business strategy
with business operations by means of indicators
JasperETL X ETL component of JasperSoft
JasperReports X Reporting component of JasperSoft
JasperSoft X Open Source BI suite well known for its reporting components
JavaScript X programming language mostly used in web browsers
JSON X
Java Script Object Notation - Typical way to represent data structures used in Javascript -
eg. {key : value}
MapReduce X
CouchDB functionality to select properties from CouchDB documents and use those as look
up keys for a specified calculation. Calculations are aggregated on levels corresponding
with key positions.
MDG X Millennium Development Goals - Key objectives of United Nations to govern Aid Activities
MDX X Multidimensional Expressions; a query language for OLAP databases
Mustache X HTML directives framework - see directive
NGO X
Non-governmental organisation; a legally constituted organization created by natural or
legal persons that operates independently from any form of government.
NPO X
Not-for-profit/Non-profit organisation; an organization that uses surplus revenues to
achieve its goals rather than distributing them as profit or dividends
ODA X
Official Development Assistance; Term to measure aid (coined by DAC of the OECD). It is
widely used by academics and journalists as a convenient indicator of international aid
flow.
OECD X
Organisation for Economic Cooperation and Development; an international economic
organisation of 34 countries founded in 1961 to stimulate economic progress and world
trade.
OLAP X
On Line Analytical Processing; is an approach to swiftly answer multi-dimensional analytical
queries
Open Calais X
Free semantic annotation webservice by Thomson-Reuters; matches up words in a text
with domain concepts
Open Data X
Open data is the idea that certain data should be freely available to everyone to use and
republish as they wish, without restrictions from copyright, patents or other mechanisms of
control.
Python X
Programming language often used for network programming - eg file manipulation and
exchange
Term
AidDomain
BIProcess
Software
Description
Python-Calais X Python module to abstract interaction with Open Calais webservice
ResRaps X
Result Reports - Document reporting on the progress of Aid Activities for a specific sectoral
purpose
Simile X The widget suite of the Exhibit-Simile framework
SNV X
Netherlands Development Organisation; a non-profit, international development
organisation that aims to alleviate poverty by enabling increased income and employment
opportunities and increasing access to basic services
UNICODE X
computing industry standard for the consistent encoding, representation and handling of
text expressed in most of the world's writing systems.
URL X Uniform Resource Locator; ie. a weblink
XML X
eXtensible Mark-up Language - used to mark up content, logic or style in web applications -
eg. <tag>content</tag>
Distinguishing words in this project
aggregation X
BI term for summarised results of calculations on values of facts - often the sum, average,
min or max of a set of values
analytical
property
X
BI term used to contrast with descriptional, where analytical refers to properties that are
used to view summarised values for facts, while a descriptional property provides
information about one individual entity
collection X Exhibit-Simile term for a set of items of a certain type
descriptional
property
X
BI term used to contrast with analytical, where analytical refers to properties that are used
to view summarised values for facts, while a descriptional property provides information
about one individual entity
dimension X BI term for a viewpoint on facts; fact values can be aggregated by viewpoint
directive X
Used to mark up places in webpages at designtime that should be filled in with data at
runtime - eg. JSP, Mustache directive; when mark ups are encountered by processing
engine, logic is directed to fetch and fill in data that matches with marked up variable
document X X
IATI term for a report describing the progress of an Activity; CouchDB term for a JSON
object; in this report either Activity document or JSON document
element X IATI term for a node in the XML description of an Activity; corresponds to an entity
emit X CouchDB term for the output of a Map function
entity X
Domain concept used by humans to reason about the domain and used by software as a
data object
facet X Exhibit-Simile term for a BI dimension or viewpoint
facts X BI term for a domain entity with a measurement value associated with it
filter X Software term for selecting entities with specific value for a property
grain X Most detailed level at which facts are available for analysis in a BI application
Term
AidDomain
BIProcess
Software
Description
indicator X BI term for a type of performance measurement.
item X Exhibit-Simile term for an entity/data object
level X CouchDB term for an aggregation level of a Reduce function
list X
CouchDB term to wrap the output of a MapReduce Function (a View) into a client-specific
format
precompute X Software term to indicate that the result of a query is stored for immediate response
publication X IATI term for making Activity data available for third parties
signatories X IATI term for organisations who have signed a manifest to comply the IATI standard
structured
(data)
X Software term to refer to data from databases or marked-up data
unstructured
(data)
X Software term used to refer to texts of webpages and documents
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
11
Contents of this report
Foreword 03
Summary 05
Abbreviations & Glossary 07
Chapter 1. Introduction 13
Chapter 2. Assignment – linking open data and open documents 17
2.1 Aid Activities 17
2.2 Aid Documents 18
2.3 Operationalisation of the Assignment 18
Chapter 3. Business Intelligence demonstrator: user facing preview 21
3.1 Activity description page 21
3.2 Document description page 21
3.3 Activities analysis page 21
Chapter 4. Business Intelligence in the Aid Domain 27
4.1 Integral Performance Management 27
4.2 Business Intelligence Cycle 28
4.3 Inmon-BI Use Cases & User Types 29
4.4 Organisational profiling as example 30
Chapter 5. Data profiling of the IATI XML sets 33
5.1 Variety in IATI XML sets 34
5.2 Volume in IATI XML sets 35
5.3 Dimensional modelling 35
5.4 Extract Transform Load investigations 37
Chapter 6. Technology for Variety and Volume 39
6.1 Document-oriented datastore CouchDB 39
6.2 Functions of CouchDB 40
6.3 Comparison to Relational Warehousing 41
6.4 Aggregation strategy in this project 43
6.5 System models 45
Chapter 7. Staging of Structured and Unstructured Data 49
7.1 Introducing the data flow 49
7.2 Staging Activities data 51
7.3 Staging Activities documents 53
7.4 Data Flow diagram 58
Chapter 8. Information Provisioning for Front-end Configuration 59
8.1 Information entities in JSON Activities 60
8.2 Information entities in JSON Reports 61
8.3 Providing Information Entities to a Front-end 62
8.4 Completing the Data flow diagram 64
Chapter 9. Business Intelligence demonstrator: developer configuration 65
9.1 Exhibit-Simile 65
9.2 Sourcing aggregate data from the information platform 66
9.3 Define a data graph in terms of collection dependencies 67
9.4 Configuring widget to control and view model properties 68
Chapter 10. Conclusions 71
References 73
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
12
Chapter 11. Phasing: planned and actual course of the project 75
11.1 Project Threads 75
11.2 Major deviations from expectations 80
11.3 Nine-squares model and choices made 80
Chapter 12. Reflection: connecting project execution to the HBO-i competences framework 81
12.1 HBO-i competences framework 81
12.2 HBO-i competences in this project 82
12.3 Soft skills 83
Appendix A. Technical references 85
Appendix B. Sudan Activity sets 87
Appendix C. Schemas landscaped 89
End of this report 106
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
13
Chapter 1. Introduction
This report describes a software development project in support of Aid Transparency for the
Millennium Development Goals. The Millennium Development Goals (MDGs) are eight international
development goals that 193 United Nations member states and at least 23 international organisations
have agreed to achieve by the year 20151
. The goals are to eradicate extreme poverty and hunger, to
achieve universal primary education, to promote gender equality and empower women, to reduce
child mortality rates, to improve maternal health, to combat HIV/AIDS, malaria, and other diseases,
to ensure environmental sustainability, and to develop a global partnership for development.
In order to achieve those goals development aid donors fund aid activities, ie. programmes, in
recipient countries and regions. Unfortunately nobody sees the bigger picture of who is spending
money on what and whether is it has any effect: Aid spending and effects are not transparent.
Therefore the International Aid Transparency Initiative (IATI) promotes a common format, the IATI
standard, for sharing relevant information so that it will be easier to understand, compare and use
[MakeAidTransparent, 2011]. Our project builds on this standard.
Aid Transparency fits within the movement of Open Data. This is data produced by governments while
serving the public, paid for by our taxes. Open Data is published with limited legal restrictions in order
to enhance transparency of governance. However, merely publishing raw data is not enough. Raw
data needs to be turned into information, and information has to be combined to make decisions.
This resembles the motives of Business Intelligence within commercial organisations. Business
Intelligence (BI) is the directed process to collect and analyse data and apply the resulting information
to govern an organisation. In our project we are dealing, not with one particular organisation, but with
a network of organisations. These network participants are exchanging data with each other, but each
participant has to spend considerable effort and time on turning that data into information and
combining this into intelligence that decisions can be based on. There is gap between the ability to
process data on the one hand and the speed, volume and variety with which data becomes available
on the other hand: the so-called information gap [Beek, 2010].
Nivocer B.V.2
is an intermediary party in the development aid network that provides services to
address the information gap in Aid Transparency. Nivocer commissioned us with the following project:
develop software middleware that can link structured Aid Activity data with unstructured
Aid Activity documents, in order to support a rich picture of Aid Activities. In this report we
describe the solution to this project following a business intelligence approach. This project consisted
of building three main cases: a business case, an information case and a technical case. The business
case answers the question what Aid Intelligence means for the different organisational participants in
the Aid Sector. The technical case answers the question how the data and documents provided by
different participants can be integrated. The information case answers the question how data supply
and intelligence demand can be loosely coupled.
Loose coupling is an important design driver in software engineering: requirements are constantly
shifting, while scalable solutions need a stable foundation. Also in our project, analytical cases breed
demand for ever more sophicated combinations of information, while community buy-in of
standardised publishing is dependent on simplicity and stability. This has been the reason that our
core focus has been on the information case providing middleware between emerging practices of
1
http://en.wikipedia.org/wiki/Millennium_Development_Goals
2
http://www.nivocer.com
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
14
analysis and existing practices of publishing. The business intelligence application we have developed
in this project acts as a front-end showcase to our middleware, while the data we use to feed our
middleware is taken from existing datasets and documents.
These project goals were managed using the
model as shown in Figure 1.1 en 1.2. Figure
1.1 shows the theoretical purpose of the model
and Figure 1.2 shows its application to our
project. The model shown in Figure 1.1 is the
Maes nine-squares model of information
systems development [Maes, 2004]. The
columns in the model represent three
organisational concerns: business, information
and technology. The rows in the model
represent three process levels: strategy, tactics
and operations. The model captures a divide-
and-conquer approach to information systems development in which the squares, representing pieces
of the solution puzzle, mutually inform and constrain each other. Using the Maes nine-squares model
we were able to express our three cases as constrained and informed by existing strategies and
operations.
Figure 1.2 Application of Maes nine-squares model for our project
Figure 1.2 shows how domain analysis and requirements framing progressed from the business
column (left) to the technology column (right) and how an application was designed, implemented and
delivered from the technology column back to the business column. The nine-squares model imposed
an information system development structure on our relatively open-ended assignment. To
understand the business purpose of the assignment we looked at business intelligence in the context
of the Millennium Development Goals strategy for Development Aid Participant organisations. To
Figure 1.1 Maes nine-squares model of information
systems development
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
15
understand the technical options for our assignment we looked at strategies to deal with data of a
large volume and variety, both between documents and data and within IATI XML data-sets.
The contribution of this project is visualised in Figure 1.3 and Figure 1.4. At the start of this project
the assignment was as open-ended as to link up data and documents. During the project the seven
remaining pieces of the puzzle were filled in iteratively and incrementally. This process is accounted
for and reflected upon in Chapter 11 and 12. The deliverables of this project are presented looking
back through the lens of the completed nine-squares model.
Figure 1.3 Situation at the start of the project Figure 1.4 Situation at the end of the project
The report is structured as follows:
• Chapter 2 describes the assignment in more depth, explaining the nature of Aid Activities as
represented in data and documents.
• Chapter 3 presents a preview of the user facing side of the demonstrator we have developed.
This provides the reader with a concrete frame of reference to process the more abstract
remainder of the report.
• How we arrived at our business and user requirements by building on Business Intelligence
theory is explained in Chapter 4.
• Chapter 5 reports on the data profiling of the available IATI XML sets.
• The technology we selected for dealing with a large variety and volume of data and documents
is described in Chapter 6.
• Chapter 7 illustrates the data staging process using the selected technology.
• Chapter 8 presents the provisioning techniques of our information platform.
• In the Chapter 9 we explain the developer facing side of our demonstrator. This completes the
circle to chapter 3.
• The conclusions of our research and development are presented in Chapter 10.
• Chapter 11 describes the choices we faced during the project and the project threads we
abandoned.
• Chapter 12 reflects about the technical and non-technical competences that were progressed in
this project.
• Lastly, appendices with technical references, data profiles and landscape schemas are provided.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
16
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
17
Chapter 2. Assignment – linking open data and open documents
This chapter describes the assignment of this project. The project commissioned by Nivocer B.V. was
to develop software middleware that can link structured Aid Activity data with unstructured Aid
Activity documents, in order to support a rich picture of Aid Activities.
The input for this project was the
following structured and unstructured
data:
1. Aid Activities data sets in XML
2. Result Reports about Aid Activities,
mostly in MS Word docx format.
2.1 Aid Activities
An Aid Activity is the basic unit of
reporting in IATI [IATI, 2011]. This is
typically an individual programme or
logical grouping of work in an Aid
Participant organisation's budget. Each
Activity is represented by a Activity
record. This record has three main parts:
1. Who is involved, where and how?
2. What are the basic management details for the project?
3. What are the financials details
Who is involved, where and how? Example: Africa Biogas Partnership Programme
• What is the name of the reporting organisation?
• Which organisations are funding you?
• Which organisations are you funding?
• What is the nature of the funding relationship?
• Ministry of Foreign Affairs (DGIS)
• Not applicable, this is a donor
• Funding HIVOS in order to enable SNV to work with
countries in the region South of the Sahara
• Untied (No obligation to purchase from donor economy)
What are the basic management details for the project?
• What is the IATI identification code for this project?
• Project name and description
• What are the documents related to this project?
• What are the contact details for the project?
• What other projects are related to it?
• What are the geographic details?
• What are the start and end dates?
• What is the current status of the project?
• What are the expected and actual results?
• Which sector does the project contribute to?
• What are the cross-cutting themes?
• Are there terms and conditions?
• NL-1-PPR-18384
• DMW ABPP; Africa Biogas Partnership Programme
• Not included, but intended for Result Reports
• Not included, but intended for HIVOS contact
• Not included, but intended for other Activities
• Not included, but intended for geo-coordinates
• Start 2008, End: 2013
• Implementing
• Not included, but intended for MDG-related indicators
• Power generation/renewable sources
• Biological Diversity, Combat Desertification, Gender Equality
• ODA (Official Development Assistance)
What are the financials details
• What are the total budgets for each financial year?
• What type of aid is this?
• What are the disbursements?
• What are the financial mechanisms used?
• Commitment of 30M euro over the course of programme
• Project type intervention
• Diverse chunks of the commitment money transferred
• Aid grant excluding debt reorganisation
Figure 2.2 Open Data: conceptual properties of an IATI Aid Activity record (technically provided in
XML). Illustrated by the Aid Activity Africa Biogas Partnership Programme [SNV, 2009]
Table 2.2 lists the properties of Aid Activity record using an example programme [SNV, 2009].
Figure 2.1 This chapter describes the squares Open
Documents and Open Data from the nine-squares model
explained in chapter 1
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
18
2.2 Aid Documents
Result reports are short documents delivered by Dutch embassies describing the status of
development related to the MDGs in a particular geographic area. There is a many to many
relationship with Aid Activities as registered in the donor’s transactional systems. They provide the
context to obtain a richer picture about Activities. It is not the case that one Aid Activity is exactly
covered by one Result Report. Reports follow a common structure (that is under revision) and are
typically composed out of the paragraphs listed in table 2.3.
Result Report on Sectoral Purpose in
Geographic area
Africa Biogas Partnership Programme1
Metadata about the embassy involved, the
recipient country or region, the strategic goal,
authors
To develop a biogas economy in countries South of the Sahara
Context description about the situation in the
recipient country or region
People now mainly use wood and fuel to cook in their houses. This causes
respiratory problems, deforestation or CO2 emission, costs a lot of preparation
time for women and children …
Results and lessons learned Biogas installations allow cooking on gas, with no smoke in the house. They also
free up time for women to develop themselves. In addition dung is removed from
living lots improving hygiene. Biogas construction companies deliver employment
boosting local economies. This all contributes to a perception of social progress.
What went less well and why? Biogas installations need careful maintenance, which has been the cause of failure
in some cases. The product needs to be accompanied with a life cycle process.
What has been learned (process) Don’t give people installations, but learn them to build installations to create a
sense of ownership. This creates sustainable economies.
Resources spent and Aid Activities involved as
registered
These are typically the figures provided in the Open Data set, but not one on one.
Most reports do contain references to Activity identifiers. The description for ABPP
above is for illustration purposes, but reports are likely to be at the level of
sectoral purposes, eg. renewable power sources.
Traffic light score about status of investment
area
Ordinal scoring on a scale with values like On track, In danger to be off-track, Off-
track
Figure 2.3 Open Documents: Typical paragraphs listed in a Result Report (technically provided as MS
Word of Adobe pdf document). Illustrated by the Aid Activity Africa Biogas Partnership Programme
taken from [SNV,2009]
2.3 Operationalisation of the assignment
During the feasibility phase of the project the assignment gravitated towards a conceptual
middleground present in both structured and unstructured data: information entities. An information
entity in this project is defined as
• a domain concept that domain participants use to reason about the field,
• that captures a re-occuring set of properties and
• that can be manipulated by software to support discovery of quantitative and qualitative
relationships.
This has resulted in the three assignment operationalisations listed below.
1
[SNV, 2009]
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
19
The development of software middleware to prepare linking between data and documents is
operationalised in this project as follows and results in the Technical assignment:
Stage structured data and unstructured documents so they can be manipulated in a uniform way
The development of software middleware to pass on data as information is operationalised in this
project as follows and results in the Information assignment:
Deliver an architecture and proof-of-concept software support to provide integrated access to
Information Entities consumable by multiple types of visualisation front-ends
Support for a rich picture of Aid Activities is operationalised in this project as follows and results in the
Business assignment:
Implement a demonstrator front-end that can visualise Information Entities
in relationship to each other
The three operationalised assignments have lead to the products Data Staging solution, Information
Platform and BI application, depicted in the middle row of our nine-squares project model. The next
chapter will present a preview of the user facing side of the BI application. This provides the reader
with a concrete frame of reference in order to process the more abstract remainder of the report.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
20
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
21
Chapter 3. Business Intelligence demonstrator: user facing preview
This chapter gives a user facing preview of the demonstrator developed in this project. The purpose of
this preview is to create a concrete frame of reference for the more abstract remainder of this report.
The front-ends that are shown are:
• Activity description page
• Document description page
• Activities analysis page
3.1 Activity description page
The activity description page offers a
listing of the properties of an activity as
they were described in chapter 2. In
Figure 3.2 the properties for the activity
“Africa Biogas Partnership Programme”
are shown. Two aspects need to be
noticed at this stage of reading. One, the
raw XML format is presented as a human readable table. Two, the page contains a link to a
corresponding report at the bottom of the page. In addition, a link to the raw XML is included at the
top left of the page. The latter allows for inspection of technical metadata.
3.2 Document description page
The document description page (Figure 3.3) offers a raw listing of the text of a report. The original
document as formatted by the publisher can be downloaded using the link at the top left of the page.
Entities are marked up in the page, indicated in red. These entities link to a corresponding external
wikipedia page if one exists. It is expected that these entities will be linked to the Web of Data1
in
future versions. Important to notice is the bar with activity identifiers listed at the top. Each identifier
links back to the corresponding activity. One report covers more than one activity, one activity can be
covered by more than one report.
3.3 Activities analysis page
These two pages illustrate the basic way in which activities and documents are linked in our
demonstrator application. Before they existed as disparate data sources. Both the activity and
document description pages can be reached from the activities analysis page, which is the main
entrance of the application. On this page the analytical use cases are implemented. Currently they
contain the raw building blocks for these use cases. Three examples are shown:
• Figure 3.5 shows a map displaying the number of activities in six countries that had transactions
in the year 2001 (selected at the left)
• Figure 3.6 shows a timeline of the same activities, stretching a bar between start and end date
• Figure 3.7 shows a table with these activities (clicking on an activity jumps to its description
page). The table also shows the aggregated transaction amount for each activity.
1
http://en.wikipedia.org/wiki/Linked_data
Figure 3.1 This chapter describes the square Business
Intelligence Application from the nine-squares model
explained in chapter 1
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
22
The analysis page shows the analytical building blocks in the middle. Several tabs are visibile at the
top. Figure 3.5, Figure 3.6 and Figure 3.7 show the building blocks of three of those. In the left and
right column filters are shown that correspond with the properties of an activity. Activities with
common properties can be shown together using those filters. When a property value is selected, eg.
the year is 2001, activities with transactions in 2001 will be shown. This will also cause the other
filters to only show the property values of those activities. It is possible to select values in different
filters consecutively which will narrow down the shown set even more. Finding specific measures for
properties is then done by selecting the specific tab at the top. This makes up the BI functionality.
Note that the right column also contains filters for report and related entities which listen to selections
in the other filters. Hence it possible to use the analysis page for discovering links between activities
and documents. Document descriptions are reached via the tab reports and related entities.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
23
Figure 3.2 Activity detail page for the Activity "Africa Biogas Partnership Programme" (Not all
properties shown). The report link below leads to a corresponding Document description page.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
24
Figure 3.3 Document description page. A document related to the Activity “Africa Biogas Partnership
Programme” is shown. At the top links to Activity descriptions are provided. Entities are marked up in
the page. This page can be reached from reports and related entities shown in Figure 3.4
Figure 3.4 The activities analysis page with the tab related entities shown. Note that in the right
column the report from Figure 3.3 is selected causing the other filters to show related property
values.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
25
Figure 3.5 The activities analysis page with the tab countries map shown. Note that in the left
column the year 2001 is selected causing the other filters to show the property values of activities
that have transactions in 2001
Figure 3.6 The activities analysis page with the tab periods timeline shown. Note that in the left
column the year 2001 is selected causing the other filters to show the property values of activities
that have transactions in 2001
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
26
Figure 3.7 The activities analysis page with the tab activities shown. No filter selections are applied.
Note the aggregated transaction values in the activities table. Other tabs host aggregated transaction
values for different views. Further calculations for showing different measures, eg. percentages, and
visualisations, eg. norm-related color coding, will be added in future versions. At the time of showing
the database contained a total of 125 activities. The industrial-scale version will contain hundreds of
thousands of activities.
One front-end page is not shown at this stage of reading, which is the fact slicing page. The fact
slicing page is part of our aggregation strategy which is explained in chapter 6.
The webpage front-ends2
shown above are implemented using the Exhibit-Simile data visualisation
framework [Huynh, 2007]. Chapter 9 will explain the developer side of these pages as they are
configured from the information provided by our information platform. The remainder of this report
describes the research and development process that has lead to this implementation.
2
The demonstrator can be found at www.michielkuijper.nl/iatidemo
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
27
Chapter 4. Business Intelligence in the Aid Domain
This chapter describes the business aspects of our project. Our project follows a business intelligence
approach to frame its requirements. Business intelligence is defined as the directed process to collect
and analyse data and apply the resulting information to govern an organisation. In our project we are
not dealing with one particular organisation, but with a network of organisations.
The business intelligence approach has
informed us with three important theories
to build on. These are:
1. Integral Performance Management
2. Business Intelligence Lifecycle
3. Inmon-BI use cases & user types
The reason why we discuss these in the
context of a software development
project is that they represent best
practice frameworks for setting up a
specific type of application: a BI
application. They can be seen as
templates for the software lifecycle phase
Requirements Analysis and Design, that
can be used to build on the accumulated insight of BI practitioners.
4.1 Integral Performance Management
Integral Performance Management (IPM) is a methodology that links up a business strategy with
business processes by means of Performance Indicators [Geelen, 2005]. A business strategy is
derived from a business mission, the Why, and the business vision, the What. The business strategy is
the way to achieve this, the How.
In our project this “business” mission is to achieve the Millennium Development Goals as set out by
the United Nations. In our project the vision is to achieve Aid Effectiveness. Our project addresses one
particular thrust of the underpinning strategy: Aid Transparency. Transparency holds donor
governments to account, commits them to results, helps improve the performance of aid agencies,
decreases corruption and enables better planning and coordination amongst donor agencies1
.
IPM recommends defining a hierarchy of indicators to connect daily operations to a strategy. At the
top of this hierarchy are Key Performance Indicators that represent the external state of affairs of an
organisation. Lower in the hierarchy are Performance Indicators that represent the internal state of
affairs an organisation. The core idea is to connect these two views in order to align internal
diagnostics to external market positioning. Then you know as an organisation what to change inside
when the market changes.
Indicators are typically calculations compared to an agreed target or norm provided by the strategy.
They should be defined in a specific, measurable, attainable, realistic and timebound fashion
(S.M.A.R.T.). Such a definition will guide decision making, information design and data selection.
1
http://www.aidmonitor.org/
Figure 4.1 This chapter describes the squares
Millennium Development Goals and Aid Participant
Organisation from the nine-squares model explained in
chapter 1
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
28
4.2 Business Intelligence Cycle
Indicators are part of a business intelligence cycle. This cycle is the feedback loop that connects
strategy governance to daily operations. The BI cycle is defined at two levels [Beek, 2010]. The outer
BI cycle consists of the phases Registration of Data, Processing Information and Reacting to
Knowledge. Most organisations have automated systems in place to Register transactions. In order to
be able to React the registered data has to be Processed for that purpose.
Figure 4.2 Outer- and Inner BI cycle [Beek, 2010]
This Processing step is broken down into three main phases itself: Preparing Indicators, Analysing
Facts & Distributing Decisions. This is often referred to as the inner BI cycle and forms the
methodological process by which most BI programmes are structured (Figure 4.2). This methodology
is useful as a divide-and-conquer approach as there is a strong inclination to get bogged down in the
diversity of the data and the preparation conflicts resulting from that. This is a main cause of failure
for BI programmes.
Indicators are the pivot points between the Preparation and Analysis phase. The state of affairs that
needs to be distilled from the data should be captured before preparation as this informs the manner
in which different sources should be filtered, combined and aggregated. Preparation should be done in
service of the business strategy which is operationalised in terms of indicators.
A typical sequence of activities in the Preparation phase is: Collecting data sources, Filtering out low
quality data, Combining different sources in a common format and meaning, Aggregating individual
facts according to different views, Visualising aggregations by means of graphs and charts to assess
proportions and trends, Interpreting patterns in terms of domain events.
A typical sequence of activities in the Analysis phase is: Internalisation of the perceived patterns,
Adapting mental models and targets, Checking the data and analysis again in this new light,
Augmenting with complementary data to increase analytical scale or scope.
A typical sequence of activities in the Distribution phase is: Sharing insights with peers, Materialising
insights into new management principles, Deciding what to do by seeking consensus, Communicating
and Evangelising these decisions, Anticipating on events that are likely to occur again.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
29
4.3 Inmon-BI use cases & user types
Analytical use cases tend to resemble one of four types [Inmon, 1998]. These four types are
associated with types of BI users with corresponding use cases:
1. Farmers, mostly users with a strategic role in an organisation with a predictable need for digested
information. Eg. Board members, stakeholders, government. In the Aid sector this type of role
translates to Donor Policy Makers, Parliamentarians and Civil Society Organisation (CSO)
representatives with a strong interest in Accountability: how is tax money being spent on Aid.
2. Tourists, mostly users with an operational role in an organisation with a predictable need for
specific information. In the Aid sector this type of role translates to Recipient Line Ministry
officials and Recipient Community Council members with a strong interest in available funds
(aggregated from different donors) for specific sectoral purposes or local programmes.
3. Explorers, mostly knowledge workers in an organisation with a need to slice and dice through
information and drill up and down aggregation levels. In the Aid sector this type of role translates
to Non-Governmental Organisations (NGOs) and Civil Society Organisations (CSOs) with a strong
interest in Aid Effectiveness.
4. Miners, mostly researchers interested in discovering trends and anomalies in information to
explain past events and predict future ones. In the Aid sector this type of role translates to
Academics with a strong interest in arrangements of Aid Management to study for instance the
difference between institutional and grass-roots approaches.
IATI aims to cater for all types of users but with different horizons of implementation [AidInfo, 2010].
The main driver has been to support Recipients, ie. Inmon-Tourists, to more swiftly obtain an
overview of available funds. This can support more agile forward planning and prevent delays in policy
execution. A second driver has been to create more opportunities for CSOs and Parliamentarians, ie.
Inmon-Farmers, to hold their government to account. The result has been that most progress in
structured publications has been made in aligning financial transactional data exchange. The
information necessary for typical Inmon-explorers resides mainly in the unstructured Result Reports.
Typical Inmon-miners are currently still focused outside the IATI space on statistical data provided by
organisations such as the Organisation for Economic Co-operation and Development (OECD), who
also provide information about disease levels in developing countries for instance.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
30
4.4 Organisational profiling as example
A good example of Inmon-based exploration is the use case of organisational profiling. Figure 4.3
show a wireframe provided by Nivocer after the first prototype of our demonstrator was constructed.
The figure shows a number of analytical building blocks expressing information about a filtered set of
activities. In addition a listing of related documents is shown. The figure shows how an organisational
profile can be obtained by juxtaposing related activities and documents. This can serve to get a rich
picture of the workings of an affiliated organisation in, for instance, an NGO country manager’s
portfolio. This affiliated organisation might be a potential partner in new activities.
Figure 4.3 Nivocer wireframe showing analytical building blocks about a set of activities in a NGO
country manager’s portfolio. The related activities and documents offer information to get a richer
picture of the organisation’s workings.
This use case shows how analytical and descriptive information, when linked up, allow for richer
discovery routes than just analytical information alone. Organisations have large archives with
unstructured information. How to disclose these documents is not always obvious without a clear use
case. The IATI format offers analytical viewpoints that can be used to deliver these use cases. These
viewpoints can be seen as structured queries on the contents of these documents. Using the
structured analysis approach from BI to discover relevant information in unstructured documents
therefore helps to disclose an organisation’s archives.
The justification of our assignment to link up data and documents is therefore confirmed, but the use
case also inspires new ways of contextualising business intelligence. Vice versa a predominantly
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
31
hypertext based approach to discovery gains more depth by adding analytical aggregations to purely
description information.
In this chapter we obtained an overview of the purpose and meaning of the available data and
documents in the context of the Business assignment. The next chapter looks at the available data
and documents from a technical perspective.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
32
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
33
Chapter 5. Data profiling of IATI XML sets
In chapter 7, 8 and 9 a running example is used to explain the data flow from staging to application.
This running example uses representative properties from the structured data sets and unstructured
documents to abstract from the complexity of the full data sets and make clear the essence of the
transformations. This chapter describes the aspects of variety and volume we encountered during the
data profiling process of the full data sets in different stages of the project.
An index of IATI XML sets can be found at
the IATI registry1
. The registry allows a
look-up of IATI files according to
• 2 File types (1407 Activity files, 11
Organisation files)
• 192 Recipient Countries & Regions
• 60 Publishers
• 7 types of organisation of which the
bulk is Governments
IATI based publication is in its first
iteration after version 1 of the standard
was agreed in 2011. Currently IATI has
29 official signatories, organisations who
signed up to the standard, but also non-signatories have published data. Although more than only
Activities data is available we decided to focus on Activities because they make up the bulk of the
available data. A full specification of the IATI metadata in Activities can be found in [IATI, 2011];
Figure 5.2 shows a summary.
Figure 5.2 Summary of IATI metadata in Activities files
1
http://www.iatiregistry.org/
Figure 5.1 This chapter describes the square IATI XML
from the nine-squares model explained in chapter 1
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
34
5.1 Variety in IATI XML sets
In chapter 2 we have described the nature of the structured data and unstructured documents and
concluded that they have a common middleground in Information Entities. In chapter 3 we have seen
that the analytical role of Information Entities is to provide views on data and documents. These views
allow for filtering data and documents together and therefore enhance the Inmon exploration use
case. It is therefore of importance that different data sources are combined to integrated information
entities. In a BI project data profiling is used to audit the different sources available and specify the
transformations necessary to rally them in a common format and meaning. The following data quality
aspects typically are addressed:
• Inconsistencies in spelling, eg. organisation vs organization
• Inconsistencies in innerfield logic, eg. 31-12-2012 vs 12/31/2012 for a date
• Inconsistent use of definitions, eg. different calculations of net value after taxes
• Inconsistent values in different sources, eg. different addresses of the same client
• Completeness of values, eg. missing fields in a record
• Double entries, eg. a client appears several times in one source
• Referential Integrity, eg. a client key in a transaction does not have a corresponding CRM entry
• Semantically unlikely values, eg. a shoesize of 64
In our project many of these issues are pre-empted by the IATI standard proving a clear set of
metadata concerning domain definitions, technical definitions and publication history definitions. Many
of the consistency and integrity issues mentioned are taken care of before publication. Currently the
biggest issue is the completeness of data, and especially the variety in completeness. An investigation
by the IATI programme office [IATI, 2012] shows that only the following core data elements are being
provided by most publishers:
1. Activity period dates
2. Activity status
3. Participating organisations
4. Geography (not providing recipient country or region)
5. Sectors
6. Transactions, amount and dates, providing and receiving organisations
We also did our own investigation of Activity files. Here we used the activities classified against
recipient country Sudan. We selected Sudan because this appeared to have activities associated with
it involving many different donors. We found a spread of 132 data elements (including technical
metadata; see Appendix Sudan]) with an overlap of 8 data elements across all files. That is, to do
aggregated analysis only 8 data elements can be fully compared to each other across all Sudan sets.
These data elements were:
1. Activity identifier
2. Activity title
3. Activity description
4. Participating Organisation name
5. Participating Organisation role
6. Transaction value
7. Transaction type
8. Transaction code
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
35
This has been the reason to focus in the transformation flow on the core data elements outlined in
chapter 7, 8 and 9. One source of variety between publishers comes from the data elements they
publish as shown above, another source of variety between publishers comes from the cardinality
between the data elements. Again during investigation of the Sudan sets differences were found
between publishers in:
• The number of Activity records in a published Activities file
• The number of Transactions per Activity
• The number of attributed Sectors per Activity determining the weight of Transactions
ETL scripts are BI tool jobs strung together to transform different data elements to a common format.
This variety has made it difficult to transform the data solely on the basis of its abstract schema.
Additional logic is required to determine the exact number of links between elements. This makes the
ETL jobs quite elaborate, complex and hard to combine in a well coordinated script.
5.2 Volume in IATI XML sets
During our data profiling research we discovered that the structured data sets are published in
different sizes, ranging from sets with one activity to sets with thousands of activities. The number of
accumulated activities is in the order of tens of thousands. An analysis of the IATI registry shows that
at the time of writing some 1400 Activity files have been published in the first iteration after
publication started in 2011. OpenSpending [OpenSpending, 2011] transformed these Activity files to
weigthed transactions; ie. one line for each transaction per sector per activity. This has produced
some 450.000 transactional rows. The CSV set weighs around 600 Mb and takes about 3 minutes to
open a consumer desktop. Estimated from the Activity files we have investigated one Activity record
contains 10 unweighted transactions on average. This means that at least 45.000 individual Activity
records have been published in the first iteration. It should be expected that this number will grow in
the next iterations and contribute to the accumulation.
The combination of variety and volume posed a barrier to get the data quality investigations started;
ie. to investigate which types of variety were present, Activity files had to be loaded in bulk in our
Data Analysis tools. Due to the volume of the sets their performance was low. This hampered the
learning about which parts were variant and which parts were common. Disclosing the data sets for
aggregated analysis has two problems that lock each other in. Analysts acting on their own would face
these same problems. Our information platform lowers the barriers [Cusumano, 2010] for them by
preparing the separation of variety and volume. It does this by deploying dimensional modelling.
5.3 Dimensional modelling
In transactional systems it is best practice to represent data in a normalised format. Normalisation is
the process of organizing the fields and tables of a relational database to minimize redundancy and
dependency2
. This prevents maintenance problems. In analytical systems data is represented in a
dimensional format. This format is optimised for aggregated calculations. To model data for
aggregation analysis we build on Kimball’s dimensional modelling method [Kimball, 2004]. The focus
of a dimensional model is the fact table. Every line in fact table is a registration of a measured event.
In our case this either an activity or a transaction. The viewpoints or dimensions are determined by
asking questions like who is involved, what is it about, where does it take place, when does it take
place, why does it place [Linden, 2012].
2
http://en.wikipedia.org/wiki/Database_normalization
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
36
Figure 5.3 Target dimensional model in our project; Two fact tables: an Activity Fact table with grain
“An activity per policy per sector” and a transaction Fact table with grain “A transaction per sector”.
Kimball’s method promotes the clear separation of facts and dimensions. Facts are entities with
measurement values associated with them that can be studied in aggregate. Dimensions are entities
that look at facts from specific viewpoints. Often facts are analysed against two dimensions at one
time, for instance the number of Activities sponsored by a Donor in a specific period or the
accumulated Committed investment for a Recipient Country in a specific period.
Kimball’s method optimises facts for calculation by externalising dimensional weight; ie. dimensions
are referenced by means of lightweight keys, that point to entities with heavier descriptional content.
This rationale stems from a limitation in computational and memory capacity on data warehouse
servers, which has been caught up by new hardware developments. However the modelling approach
is now becoming relevant again due to in-browser or on-device processing requirements. Also in our
case we expect browser-based and mobile-based analysis to become more popular, since analysis is
made relevant by sharing it with peers and made useful by having it available on location.
Figure 5.3 shows our target dimensional model for Aid Activities. In the explanation of the data flow
process we will use a simplified version of this model that is used to explain the representative
transformation steps involved. This simplified data model is given in Figure 7.2.
In the target dimensional model two fact items are visible: an Activities fact item and a Transaction
fact item. The Activities fact item has three measurement values: a fact count, a sector percentage
and a policy significance. The fact count is used to optimise calculation for counting the number of
Activities per Dimension. The sector percentage is included because each Activity can contribute to
one or more Sectors, as shown in Figure 5.4. The policy significance is present because each Activity
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
37
can be seen to implement one or more Policies, also shown in Figure 5.4. The inclusion of the sector
and policy weighting in the Activity fact item ensures that calculations against resp. the Sector and
Policy dimension can be done quickly. Consequently, the grain of the Activity fact item is one Activity
per Policy per Sector.
The Transactional fact item has four measurement values: a fact count, an amount, a sector
percentage and a percentual amount. The fact count is used to optimise calculation for counting the
number of Transactions per Dimension. The amount is used the calculate the total amount per
Dimension. The sector percentage is included because a transaction belongs to an Activity and each
Activity can contribute to one or more Sectors, as shown in Figure 5.4. The inclusion of the sector
weighting in the Transaction fact item ensures that calculations against the Sector dimension can be
done quickly. The percentual amount is used the calculate the total percentual amount per Dimension.
Consequently, the grain of the Transaction fact item is one Transaction per Sector.
5.4 Extract Transform Load investigations
Figure 5.4 shows selected elements of the IATI Activity as it published. These elements are selected
on the basis of challenges in transforming to a dimensional model. One of the challenges is that an
Activity can contain one or more instances of a dimensional type such as Sector and Policy. The
implication for the transformation is that one cannot assume that the number of dimensional elements
per activity is constant across published sets. In our investigations we encountered a set with one
sector per activity, but also a set with more sectors per activity. Also within sets these numbers can
vary per activity. Another challenge is the indirect dependency between the transactional values and
the sectoral attributions. Both the number of transactional elements and the number of sectoral
elements can vary together within one Activity.
Figure 5.4 Selected elements in an Activities set as published by IATI signatories. Elements are
selected on the basis of challenges in transforming to a dimensional model
The power of ETL scripts is that the transformation requirements discovered in a subset of the sources
can be transferred to the remainder of the sources. The strategy to deal with hierarchical relationships
between data elements is to decompose the elements so that each element has a fixed schema. On
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
38
the basis of this fixed schema transformation jobs can be specified and repeated. Due to the variety
described above our decomposition needed conditional logic to check the number of dimensional
elements per activity. This logic is required to determine the exact number of instances involved in
one-to-many and many-to-many relations between nodes.
Our technology choice for transformation has been CouchDB, which is described in the next chapter.
We have investigated two transformation routes before deciding on CouchDB: JasperETL and Google
Refine [SE2][SE3]. JasperETL is the ETL component of the JasperSoft BI suite. The ETL requirements
described above in combination with the storage and data typing constraints of the JasperETL solution
made ETL in JasperSoft a laborious and time-consuming process [SE2]. Therefore we concluded
JasperETL was not a suitable candidate to automate ETL for IATI Activities.
Google Refine is a data cleansing and transformation tool using object-oriented design behind the
scenes. Using Refine we were able to speed up the ETL process in comparison to JasperETL because of
the lack of tedious storage management. Refine transforms the XML node hierarchy into nested
records. The transformation strategy then entails transposing rows to columns, and filling out parent
elements over all rows [SE3]. The transposition assumes a fixed number of rows to be transposed.
Here again the variety prevented reuse of the transposition specifications. Because of the variation in
the number of elements the number of columns that have to be transposed is not the same for each
record. Therefore we concluded Google Refine was not a suitable candidate to automate ETL for IATI
Activities.
In both the case of JasperETL and Google Refine we did not investigate the usage of an XSLT
schema3
. The XSLT schema might have been able to support us in automating the conditional logic
referred to above. By that time we had embarked on our experiments with CouchDB and committed
ourselves to this route. The advantage of CouchDB for dealing with the variety described above is that
it is designed to loop over hierarchical data objects. Therefore we concluded CouchDB was a suitable
candidate to automate ETL for IATI Activities.
In this chapter we have seen that the IATI XML sets are characterised by a large variety and volume.
In the next chapter we will describe how the selected technology CouchDB deals with this variety and
volume.
3
XSLT: Extensible Stylesheet Language Transformations; a declarative, XML-based language used for
the transformation of XML documents. http://en.wikipedia.org/wiki/XSLT
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
39
Chapter 6. Technology for Variety and Volume
In this chapter we describe the technology we have selected on the basis of the variety and volume of
data and documents we discovered. In chapter 5 we have profiled the IATI XML sets. Overall two
main types of variety were distinguished:
1. documents vs data
2. variety between XML data sets
The volume of the data resides on two
levels:
1. the number of data sets and
documents
2. the size of (some of) the data sets
In this project the document-oriented
datastore CouchDB has been chosen as
our primary data processing technology.
Derived from this we chose Python as a
pre-processing technology and Exhibit-
SIMILE as a post-processing demonstrator technology. Before we got to this stage we have
experimented with and investigated different options. These investigations are described in the
chapter 11. In this chapter we explain our selected data staging and information platform technology
CouchDB. Chapter 7, 8 and 9 will give detailed accounts of an exemplary data flow.
6.1 Document-oriented datastore CouchDB
CouchDB is a document-oriented store. A CouchDB document stands for a set of key-value pairs and
can therefore also been seen as an Object in the object-oriented sense of the word. Each value in a
document carries it’s own semantic declaration in the key it is associated with. The syntactic
declaration is the same for all keys, they are strings. Values usually are also strings or numbers, but
can be objects or arrays of objects as well. This differs from relational database technology where the
meaning of the values is captured in a schema and columns can be of various data types. In a
CouchDB document the schema is generic in the sense that it requires valid key-value pair
combinations of strings, objects of strings or arrays with objects of strings. The specific
implementation complies to the JavaScript Object Notation (JSON1
). CouchDB was designed to deal
with web browsers. Most web browsers make intensive use of JavaScript2
. This design choice makes
storage in CouchDB very accessible and frees developers from tedious storage management. This
greatly accelerates staging and provisioning processes.
The name CouchDB is an acronym for “cluster of unreliable commodity hardware”, because it is
designed to scale out over a large set of elementary machines. This must be understood in
comparison to relational database techniques that adopt a scaling up approach: deploying more
powerful machines. The scaling out approach addresses the limitations of the scaling up approach.
These are caused by the complexity and intensity of the coordination required to distribute a relational
database across several machines. The scaling out strategy comes with a different development
approach. Tasks must be specified in simple jobs that can be farmed out in parallel over a large set of
1
http://en.wikipedia.org/wiki/JSON
2 http://en.wikipedia.org/wiki/JavaScript
Figure 6.1 This chapter describes the square technology
for Variety & Volume from the nine-squares model
explained in chapter 1
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
40
machines. An abstraction layer has been developed that shield programmers from dealing with the
coordination of these parallel tasks, often referred to as the MapReduce approach [Dean, 2006].
MapReduce functions make it easy to map documents onto inverted indexes of attributes, that can
then be aggregated using reduce operations. MapReduce therefore implements BI requirements but
uses a different combination of content and logic abstractions than a conventional relational
application. More details are explained below.
The three main advantages of CouchDB for our project are that:
1. We use its generic schematisation to capture sources with diverse structures in a common staging
area very early in our transformation process. This prevents us from having to invest heavily in
data typing and storage management. This allows for flexible experimentation and agile
adaptation. This helps us to address the variety problem.
2. We can use its MapReduce functionality to precompute inverted indexes representing the
analytical views we want to provide to BI applications. Due to the fact that this allows for a
scaling out approach this anticipates on a large volume growth of IATI data sets. The helps us to
address the volume problem.
3. Because of its web-based design we can use its replication capabilities to easily share data and
logic between community participants.
6.2 Functions of CouchDB
The basic building block of CouchDB is a JSON document. This is the CouchDB equivalent of a
database record. With these documents web applications can be built using a number of functions.
These are:
• A Design document as a web application project, also called a CouchApp
• Views on JSON documents using MapReduce functions, providing property indexes to documents
• List functions for customising views in a client-specific format
• Show functions to present individual user documents using templates and directives
• Update handlers to pre-process content into JSON
• Assets such as HTML pages, javascript libraries and multimediafiles held in a design document
6.2.1 Design documents
A design document in CouchDB acts as a web application project and holds all view-, list-, show- and
update-functions for one specific web application. A design document can easily be exchanged
between CouchDB instances. A CouchDB web application can be seen as a two-tier web application.
Relational web-applications often have three tiers: database, application logic, front-end. These three
architectural components are all provided by CouchDB: json documents as the equivalent of the
database. Update, view, list & show functions as the equivalent of the application logic; assets and
templates as equivalents for the front-end. Because a browser interacts only with CouchDB for both
data and logic it is seen as a two-tier application.
6.2.2 Views
A view must been seen as the CouchDB equivalent of behind-the-scenes business logic in a web
application. A view consists of the materialised output of a MapReduce function and provides a
precomputed inverted index of properties to documents. These indexes are cached for performance.
In our information platform views are used to precompute analytical collections of information entities.
Figure 8.4 and 8.8 in chapter 8 are examples of MapReduce functions generating views.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
41
6.2.3 List functions
The MapReduce functions in a view emit, ie. put out, rows of key-value pairs. List functions allow the
developer to easily do something with those rows. In our platform we use the list functions to turn the
elements in these rows into front-end specific formats. A list function can be seen as the CouchDB
equivalent of a servlet that formats the response. Figure 8.10 and 8.12 in chapter 8 are examples of
list functions that present views as Exhibit-Simile data items.
6.2.4 Show functions
Show functions are used to present single JSON documents. For this templates can be applied that
make use of directives. This means that the templates hold marked-up variables that are filled in at
runtime. The can be seen as the CouchDB equivalent of Java Server Pages. In our platform we use
the Mustache3
directive framework which is recommend by the CouchDB community. The
demonstrator pages shown in chapter 3 are all templates with directives to either JSON documents
(an activity, a report or a specification document) or HTTP Request object variables (view names).
These are rendered as HTML pages that load data from list functions to fill the page with data as
shown in Figure 9.2 in chapter 9. The precise interaction behavior is visualised in Figure 6.9 below.
6.2.5 Update handlers
An update handler pre-processes a submitted document before storing it as JSON. It can be seen as
the CouchDB equivalent of a servlet handling HTTP PUT and POST requests. In our platform we use
update handlers to transform single XML activity records into JSON documents. This is shown in
Figure 7.6 in chapter 7.
6.2.6 Assets
Assets are HTML pages, javascript libraries and multimediafiles used in web application.
6.3 Comparison to Relational Datawarehousing
ETL process Relational (ie. PostgreSQL,
MySQL)
CouchDB
Collect data sources Make a local copy or retrieve
network address
Make a local copy or retrieve
network address
Extract data from source Use appropriate connector Pre-processing script and/or
update handlers
Transform data ETL jobs MapReduce/Recline4
Load data in warehouse ETL jobs Not applicable
Optimise aggregated access Create indexes MapReduce
Provide to client SQL/MDX5
+ application server List and show
Figure 6.2 ETL tasks mapped onto typical relational BI environment and CouchDB
A datawarehouse is a subject-oriented, integrated and time-dependent database with relatively static
data. Its technical purposes are to integrate disparate data-sources, to reconcile them, to optimise the
3
http://en.wikipedia.org/wiki/Mustache_(template_system)
4
Recline is the CouchDB version of Google Refine, a data cleansing and transformation tool. This was
not used in our project. github.com/maxogden/recline
5
MDX Multi-dimensional expressions, a specialized syntax for querying and manipulating the
multidimensional data stored in OLAP cubes. en.wikipedia.org/wiki/MultiDimensional_eXpressions
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
42
analysis of large volumes of data in order to improve query response time, to report in a flexible way,
to build a historical record and to alleviate operational systems [Kimball, 2004]. A datamart is
comparable to a datawarehouse, but usually with a smaller amount of data and pre-structured for a
specific purpose. A datamart is often compared to a distribution center that is used to bring a selecton
of products closer to the consumer. A datamart offers more direct possibilities to cater to specific
information needs. A data cube is very similar to a datamart but this term is often used for vendor
specific solutions [Kimball, 2004].
CouchDB serves very similar purposes in our project. The main differences lie not so much in the
steps that are applied in the ETL process (Figure 6.2), but in the balance in freedom between schema
and logic. In relational applications schematisation is very strict, but the logic interacting with these
schema’s can be freely defined. In CouchDB schematisation is very loose and only a few logical
function types can be applied. These functions are powerful because they reside at a more abstract
development level than standard web application logic. Tedious storage management concerns have
been optimised and encapsulated into the CouchDB engine.
Figure 6.3 Conceptual difference between Relation approach (left) and CouchDB approach (right)
Figure 6.3 shows that CouchDB as an example of the non-relational approach [Madsen, 2012]
incorporates two architectural changes: using the file system directly and separating concurrent logic
from strategic application logic. The first is a reaction to the need to deal with web page processing.
By cutting out the database layer distribution management becomes more easy. Combined with a
drop in costs of hardware and an increase in processing power of commodity machines a scaling out
approach is created.
The second architectural change is a software abstraction that shields programmers from dealing with
concurrency issues. Building on use cases for the web, application logic is framed in a high-level
Gamma Strategy pattern [Gamma, 1995]. Formally speaking, the strategy pattern defines a family of
algorithms, encapsulates each one, and makes them interchangeable. Strategy lets the algorithm vary
independently from clients that use it. In this way CouchDB updates activity data and activity
documents as JSON documents, so they can be transformed by MapReduce functions in JSON entity
aggregates. These are provided by list functions to different BI clients.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
43
6.4 Aggregation strategy in this project
Aggregation in CouchDB is done using Reduce functions. When the index consists of an array of keys
(complex keys6
) the associated values computed by the Reduce function can be aggregated according
to the position of the key in the array. For example, if we would map Activity documents on a view
with an array index as shown in Figure 6.4 the reduce function would also precompute the number of
activities for the aggregation level region. At this level the number of activities for each country are
aggregated into the sum of these numbers for regions. By calling Reduce with a different group level
value, behavior similar to drill-up and drill-down in BI OLAP applications can be emulated.
[region, country] -> #Activities
Figure 6.4 Example of a complex key; ie an array with two simple keys
CouchDB’s API simplicity and speed comes from a its linear access of precomputed inverted indexes,
ie. views. In these views keys (single or arrayed) are always Unicode sorted. Care has to be taken to
design the order in the key array correctly to allow access to the desired range of aggregations. A
CouchDB equivalent of a datamart is made by precomputing the desired permutations of a
dimensional key set. The general template that was adopted in this project is shown in Figure 6.5.
Key array -> value
[dimensional keys permutation, fact key, roll up dimensions] -> measurement
Example permutations
[recipient, donor, policy, sector, activity-identifier, cost type, year, month, day] -> Amount
[sector, recipient, donor, policy, activity-identifier, cost type, year, month, day] -> Amount
[policy, sector, recipient, donor, activity-identifier, cost type, year, month, day] -> Amount
[donor, policy, sector, recipient, activity-identifier, cost type, year, month, day] -> Amount
Figure 6.5 Practice adopted in this project to design views
Each permutation allows the fact set to be sliced according to a specific sorting order of which four
orders are shown in Figure 6.5. Providing a fact table slice prevents flooding the client memory with
the complete fact set. Dimensions in the permutation can not be rolled up in the aggregation because
for a specific slice the client logic needs all dimensional values to support front-end filtering (shown in
Figure 3.4, 3.5 & 3.6 & 3.7 in chapter 3). The fact key serves to bring in the associated fact. The roll-
up dimensions can be used to aggregate the measurements according to a specific reduce level, eg.
by cost type, by cost type and year, by cost type and year and month etc. Such an aggregation
strategy allows the front-end to fetch a slice of the fact table using a primary view specification (first
key in order of keys) and a range parameter (startkey and endkey). An example is shown in Figure
6.6.
Example
/buza_iati/_design/activities/_list/item_transaction_by_provorg/fact_transaction_by_pr
ovorg?group=true&startkey=[“NL-1”]&endkey=[“NL-1ufff0”]
Template
/<database>/<appqualifier>/<appname>/<listqualifier>/<listname>/<viewname>?<aggregate>
&<range>
Figure 6.6 API call to request a slice of the fact-table containing all facts for a primary dimension. In
the example a slice of the fact table is requested containing all facts from the provider NL-1 (Dutch
Ministry of Foreign Affairs).
6 http://wiki.apache.org/couchdb/View_collation#Complex_keys
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
44
The structure of the URL must be understood as follows. The qualifier “_design” is a prefix to refer to
the trailing CouchDB design document, holding Update, MapReduce and List functions. The qualifier
“_list” is a prefix to refer to the trailing CouchDB list function. A list function processes the output of a
MapReduce function. The CouchDB name for the output of a MapReduce function is a View. The
reference to a View is placed after the reference to a List function.
In the example in Figure 6.6 a list named item_transaction_by_provorg based on the view named
fact_transaction_by_provorg from the design document named activities hosted in the database
activities_database is requested. The list should return a full aggregation indicated by the parameter
group=true. Also the list should return, not the full set, but the range of which the first key in the
array starts with “NL-1” and the first key in the array is not Unicode larger than “NL-1ufff0”. This
ensures that all facts in the view fact_transaction_by_provorg of the providing organisation NL-1 are
returned. The facts will contain keys to all dimensional items specified in the view. Within the slice
containing the primary dimension provider organisation all other associated dimensions can be used to
filter aggregated values of the facts in the front-end.
So defining the same views on the fact table with different key orders allows us to keep a check on
volume for the front-end while still providing all related dimensions for all facts containing a primary
dimension. In the demonstrator we have hidden this complexity in the slice selection entrance for the
application shown in Figure 6.7.
Figure 6.7 (Primary) Viewpoints page in the demonstrator. Five primary viewpoints are currently
available: countries (shown), receiving organisations, sectors, funding organisations, and
transactional year. The example shown in Figure 6.6 is hosted in the tab funding organisations.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
45
6.5 System models
Figure 6.8 and Figure 6.9 show the system models [Sommerville, 2009] that illustrate how CouchDB
works together with the pre-processing and front-end components. Figure 6.8 shows the architectural
system model and Figure 6.9 shows the behavioral system model.
Figure 6.8 Architecture diagram of Data Staging solution, Information Platform and Business
Intelligence Application working together
In the architecture model three components Data Staging, Information Platform and Business
Intelligence application are delineated. It can be seen that CouchDB acts partly as the Data Staging
area and partly as the Information Platform.
In the current set-up both IATI XML data and IATI documents are collected on a local file system. In
the future it is anticipated that these are retrieved from the IATI registry by means of webservices.
The XML data is split into individual activities using a Python script, denoted as Python splitter. The
activity is loaded into CouchDB where it is first converted to a JSON document using an external plug-
in, denoted as XML update handler.
The Activities documents, ie. Result Reports, in DOCX format are stripped from their presentation and
styling content by a Python script, denoted as Python extractor. The plain text resulting from that is
sent to the semantic annotation webservice Open Calais, using another Python script, denoted as
Python Calais connector. The output of the Calais service is a JSON version of the report with
identified entities included as metadata. In addition the activity identifiers are extracted from the text
using regular expressions in the Python script. These are added to the JSON document. This can be
directly saved into CouchDB.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
46
At this stage both the activity data and documents are accessable in JSON format. From these JSON
documents views are generated with MapReduce functions. These views are materialised entities.
The front-end application is served up by CouchDB as an HTML page built from a HTML template. This
is done using a show function that populates the template with data and metadata. The page holds
references to javascript libraries which load the Exhibit-Simile javascript engine. Also it calls back to
CouchDB to serve up lists with items, i.e. entities, that the Exhibit-Simile engine can consume. This
engine processes the incoming items into a local data-graph which can be viewed and manipulated
using Exhibit-Simile widgets. Individual activities and reports can be retrieved from the overview
widgets. These are rendered using show functions that are directed via templates.
Note that this front-end technology can be replaced by another. At the CouchDB side only new list
functions have to be designed. The MapReduce functions to produce the entities remain the same.
This shows how the data sources and the front-ends consuming them are loosely coupled.
Figure 6.9 shows this dynamic in a behavorial diagram. Here three engineering interactions and three
analyst interactions can be seen. The top two engineering interactions show the pre-processing and
uploading of resp. an activity XML record and an activity DOCX document. The third engineering
interaction shows the generation of views with materialised entities. It can be seen that the views
form the interaction points between engineering and analysis.
The bottom analyst interaction shows the request of an Activity or Document description page. The
user facing pages were shown in Figure 3.2 and Figure 3.3 in chapter 3. These HTML pages are
templates that are populated with the corresponding JSON document using Mustache directives. Once
the page is loaded it calls back to CouchDB to request resp. related documents and activities. These
have been prepared using views.
The description pages described above are accessed by the user from aggregated fact pages as was
shown in Figure 3.4 and Figure 3.7 in chapter 3. The middle analyst interaction shows how this page
interacts with Exhibit-Simile and CouchDB. In the first pass the page is requested using a show
function that populates a template with specifications. These are general variables that occur multiple
time in the template, therefore it makes sense to store them externalised. This is done in a special
JSON document. In the second pass the HTML page calls back to request a slice of the fact tables for
activities and transactions. This slice corresponds to the viewpoint selected in the top analyst
interaction, which is explained below. For this slice all corresponding dimensional items are requested
as well. Exhibit loads the items in a client-side data graph, which is controlled and viewed using
Exhibit-Simile widgets. Several examples were shown in Figure 3.4 to Figure 3.7 in chapter 3. The
facet widgets allow for filtering items. Filter selections propagate through the graph according to how
the collections are bound together. This is explained in detail in chapter 9. Exhibit tabs display the
corresponding tables with aggregated facts.
An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents
47
Figure 6.9 Sequence diagram of Data Staging solution, Information Platform and Business Intelligence
Application working together
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro
thesis.dmci.hva.kuijper.500609474.20120827.distro

Weitere ähnliche Inhalte

Ähnlich wie thesis.dmci.hva.kuijper.500609474.20120827.distro

Project matching summary_04.02.11_final
Project matching summary_04.02.11_finalProject matching summary_04.02.11_final
Project matching summary_04.02.11_finalSuresh Fernando
 
Project matching summary_04.02.11_final
Project matching summary_04.02.11_finalProject matching summary_04.02.11_final
Project matching summary_04.02.11_finalSuresh Fernando
 
CINECA webinar slides: FAIR software tools
CINECA webinar slides: FAIR software toolsCINECA webinar slides: FAIR software tools
CINECA webinar slides: FAIR software toolsCINECAProject
 
Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Carlo Vaccari
 
Symposium 2010 Gnaedinger Managing (And Leveraging) Information) (3)
Symposium 2010    Gnaedinger Managing (And Leveraging) Information) (3)Symposium 2010    Gnaedinger Managing (And Leveraging) Information) (3)
Symposium 2010 Gnaedinger Managing (And Leveraging) Information) (3)robgnaedinger
 
Implementing Sustainable Digital Preservation
Implementing Sustainable Digital PreservationImplementing Sustainable Digital Preservation
Implementing Sustainable Digital Preservationneilgrindley
 
2014 11-17 crichton institute talk on open data
2014 11-17 crichton institute talk on open data2014 11-17 crichton institute talk on open data
2014 11-17 crichton institute talk on open dataPeterWinstanley1
 
Sheet1 .docx
Sheet1                                                            .docxSheet1                                                            .docx
Sheet1 .docxbjohn46
 
A Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-conceptA Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-conceptUN Global Pulse
 
Knowledge Hub Advisory Group 17 Sep09
Knowledge Hub Advisory Group 17 Sep09Knowledge Hub Advisory Group 17 Sep09
Knowledge Hub Advisory Group 17 Sep09Collabor8now Ltd
 
Project On-Science
Project On-ScienceProject On-Science
Project On-ScienceAmrit Ravi
 
Research Proposal Attic Media
Research Proposal Attic MediaResearch Proposal Attic Media
Research Proposal Attic Mediaguestb67122e
 
Research proposal attic media
Research proposal attic mediaResearch proposal attic media
Research proposal attic mediaguestb67122e
 
Research Proposal Attic Media
Research Proposal Attic MediaResearch Proposal Attic Media
Research Proposal Attic Mediaguestb67122e
 
Research Proposal Attic Media
Research Proposal  Attic  MediaResearch Proposal  Attic  Media
Research Proposal Attic Mediaguestb67122e
 
Research Proposal Attic Media
Research Proposal Attic MediaResearch Proposal Attic Media
Research Proposal Attic Mediaguestb67122e
 
Semic 2011 highlights report
Semic 2011 highlights report Semic 2011 highlights report
Semic 2011 highlights report Semic.eu
 
Open Kollab Vision
Open Kollab VisionOpen Kollab Vision
Open Kollab Visionguest33bb1ae
 

Ähnlich wie thesis.dmci.hva.kuijper.500609474.20120827.distro (20)

Project matching summary_04.02.11_final
Project matching summary_04.02.11_finalProject matching summary_04.02.11_final
Project matching summary_04.02.11_final
 
Project matching summary_04.02.11_final
Project matching summary_04.02.11_finalProject matching summary_04.02.11_final
Project matching summary_04.02.11_final
 
CINECA webinar slides: FAIR software tools
CINECA webinar slides: FAIR software toolsCINECA webinar slides: FAIR software tools
CINECA webinar slides: FAIR software tools
 
Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8Sharing Advisory Board newsletter #8
Sharing Advisory Board newsletter #8
 
Symposium 2010 Gnaedinger Managing (And Leveraging) Information) (3)
Symposium 2010    Gnaedinger Managing (And Leveraging) Information) (3)Symposium 2010    Gnaedinger Managing (And Leveraging) Information) (3)
Symposium 2010 Gnaedinger Managing (And Leveraging) Information) (3)
 
Implementing Sustainable Digital Preservation
Implementing Sustainable Digital PreservationImplementing Sustainable Digital Preservation
Implementing Sustainable Digital Preservation
 
2014 11-17 crichton institute talk on open data
2014 11-17 crichton institute talk on open data2014 11-17 crichton institute talk on open data
2014 11-17 crichton institute talk on open data
 
Sheet1 .docx
Sheet1                                                            .docxSheet1                                                            .docx
Sheet1 .docx
 
A Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-conceptA Guide to Data Innovation for Development - From idea to proof-of-concept
A Guide to Data Innovation for Development - From idea to proof-of-concept
 
Knowledge Hub Advisory Group 17 Sep09
Knowledge Hub Advisory Group 17 Sep09Knowledge Hub Advisory Group 17 Sep09
Knowledge Hub Advisory Group 17 Sep09
 
ICWI_2002 (1).pdf
ICWI_2002 (1).pdfICWI_2002 (1).pdf
ICWI_2002 (1).pdf
 
Web 2.0 and the Digital Divide
Web 2.0 and the Digital DivideWeb 2.0 and the Digital Divide
Web 2.0 and the Digital Divide
 
Project On-Science
Project On-ScienceProject On-Science
Project On-Science
 
Research Proposal Attic Media
Research Proposal Attic MediaResearch Proposal Attic Media
Research Proposal Attic Media
 
Research proposal attic media
Research proposal attic mediaResearch proposal attic media
Research proposal attic media
 
Research Proposal Attic Media
Research Proposal Attic MediaResearch Proposal Attic Media
Research Proposal Attic Media
 
Research Proposal Attic Media
Research Proposal  Attic  MediaResearch Proposal  Attic  Media
Research Proposal Attic Media
 
Research Proposal Attic Media
Research Proposal Attic MediaResearch Proposal Attic Media
Research Proposal Attic Media
 
Semic 2011 highlights report
Semic 2011 highlights report Semic 2011 highlights report
Semic 2011 highlights report
 
Open Kollab Vision
Open Kollab VisionOpen Kollab Vision
Open Kollab Vision
 

thesis.dmci.hva.kuijper.500609474.20120827.distro

  • 1. Title An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents Subtitle Integrated Access to structured and unstructured data using the document-oriented database CouchDB Commission Nivocer B.V. Document Thesis Bachelor of ICT Instelling Amsterdam University of Applied Sciences Domain DMCI: Domain Media, Creation and Information Curriculum ICT parttime Module Graduation Project Period February - July 2012 Delivery 13 August 2012 Author Michiel Kuijper Studentno. 500609474 Version Distribution
  • 2.
  • 3. Ontsluiting van Open Data voor ontwikkelingshulp [1 ] De Millennium Ontwikkelings Doelen definiëren een aantal zaken die voor iedereen goed geregeld zouden moeten zijn, bv de leefomstandigheden in huis. Zo zijn er in Afrika nog veel huishoudens die binnenshuis met hout koken. Dit veroorzaakt oa. ademhalingsproblemen en oogproblemen en is dus slecht voor de gezondheid. Deze manier van koken zorgt ook voor veel houtkap waardoor er ontbossing onstaat. Dit heeft weer invloed op het ecosysteem. Ook sociaal gezien heeft deze manier van koken een impact. Het kost vrouwen en kinderen veel tijd om hout te verzamelen waardoor zij langdurig aan het huishouden gebonden worden. Hierdoor zijn ze niet in staat om sociaal gezien op een hoger plan te komen. Eén van de oplossingen die bedacht is om dit aan te pakken is koken op biogas. Biogas is het gas dat vrij komt uit de uitwerpselen van koeien, geiten en kippen. Dit werkt omdat hierdoor geen rook in huis komt, er geen hout gekapt wordt en vrouwen en kinderen hun tijd aan anderen dingen kunnen besteden, zoals onderwijs of nijverheid. Voor het koken op biogas zijn speciale installaties nodig. Voor het bouwen van deze installaties zijn kennis en kunde nodig. Om deze kennis en kunde over te brengen is geld nodig. Dit geld komt uit ontwikkelingsprogramma’s. Deze programma’s worden gesponsored door de welvarende landen. Deze landen geloven dat het in het belang van iedereen is dat ook de niet welvarende landen economisch en sociaal gezond zijn. Een probleem bij ontwikkelingsprogramma’s is dat er veel zijn en dat ze ook vaak overlappen in hun doelstellingen. Andere doelstellingen worden soms onderbelicht doordat donoren van elkaar denken dat anderen hier zich mee bezig houden. Er is dus een gebrek aan overzicht. Daarom hebben de samenwerkende organisaties bedacht dat ze hun informatie 1 Deze pagina met links is beschikbaar op http://www.michielkuijper.nl/aboutiatidemo over ontwikkelingsprogramma’s moeten publiceren om een begin te maken met het vinden van overzicht. Hier is het International Aid Transparancy Initiative uit voort gekomen. Het HvA project dat dit rapport beschrijft heeft onderzocht wat er gedaan moet worden om de data gepubliceerd in de IATI standaard om te zetten in overzicht. Het HvA onderzoek heeft met een business intelligence bril gekeken naar de IATI data. IATI data wordt gepubliceerd als XML. XML is een uitwisselingsstandaard voor computers. Om XML voor mensen leesbaar te maken moet zij in tabelvorm worden omgezet. Om deze tabellen met elkaar te kunnen vergelijken moet hun betekenis gelijkvormig en éénduidig worden gemaakt. In het HvA project is software ontwikkeld om IATI XML in te lezen, éénduidig te maken en te presenteren in vergelijkingstabellen. De activiteiten en uitgaven van donoren kunnen bekeken worden vanuit verschillende invalshoeken. Bijvoorbeeld, welke donoren geven geld aan welke landen, voor welke doelstellingen en welke werkvorm gebruiken ze hiervoor. Dit stelt geïnteresseerden in staat om te zien hoe de bijdragen van verschillende donoren bij elkaar komen voor een bepaalde invalshoek. In het HvA project is ook onderzocht hoe documenten over de voortgang van ontwikkelingsprogramma’s gekoppeld kunnen worden aan de IATI XML data. Deze koppeling heeft wederzijdse voordelen. Documenten bieden meer achtergrond aan invalshoeken, en invalshoeken bieden houvast om document archieven te ontsluiten. In het project kwam bijvoorbeeld naar voren dat de HvA oplossing managers in ontwikkelingsorganisaties kan helpen om een scherper beeld te krijgen van andere organisaties waarmee zij samen programma’s zouden kunnen inrichten en uitvoeren. Door bijvoorbeeld de ontwikkelings activiteiten voor Ethiopia te bekijken zien zij ook alle documenten die rapporteren over die activiteiten. In deze rapportages worden
  • 4. organisaties vermeld die niet in de IATI XML tabellen voorkomt. Hiermee krijgen zij een rijker beeld van de organisaties die in Ethiopia actief zijn. Dit is bv. voor een programma manager voor het land Ethiopia van Oxfam interessant. Tegelijkertijd biedt dit structuur aan de archivarissen van Oxfam die willen begrijpen hoe hun interne klanten in deze documenten willen zoeken. De IATI standaard wordt als het ware toegepast om gestructureerde zoekvragen op documentarchieven te definiëren. Het HvA project laat dus zien dat twee belangrijke organisatiedoelstellingen worden ondersteund: het vergelijkbaar maken van uitgaven, een business intelligence doelstelling, en het ontdekken van achtergrond informatie, een business discovery doelstelling. Technisch gezien gaat het ontsluiten, koppelen en presenteren van XML data en documenten als volgt in het werk. Twee hoofdproblemen moeten worden opgelost: de variatie in publicaties en het verwerken van grote hoeveelheden. Hiervoor wordt een duidelijk methodologisch onderscheid gemaakt tussen ontsluiten en presenteren. Het ontsluiten houdt in dat de verschillende data- en documentbronnen wordt klaargezet, staging genoemd, om gemakkelijk gekoppeld te kunnen worden. Als de koppeling eenmaal gemaakt is wordt het ook gemakkelijker de informatie te presenteren. Het kern idee hierachter is dat alle berekeningen die moeten worden uitgevoerd om data en documenten éénduidig te maken van te voren worden gedaan zodat de eindgebruiker hier niet op hoeft te wachten. Het HvA project heeft zich geconcentreerd op het klaarzetten en éénduidig maken van data en documenten. Deze worden aangeboden in een informatie platform. Ontwikkelaars van presentatie oplossingen kunnen via een standaard webprotocol deelverzamelingen van de informatie opvragen die zij op hun eigen manier kunnen vormgeven. In het project is een voorbeeld presentatie oplossing ontwikkeld om te laten zien hoe dit werkt. Het informatie platform is gebouwd in de technologie CouchDB. Dit is een web- geörienteerde database die ontwikkelaars in staat stelt te werken in de web- programmeertaal javascript en de bijbehorende dataopslag standaard JSON (Javascript Object Notation). Het voordeel van JSON is dat data opslag gemakkelijker wordt, waardoor het zoeken naar éénduidigheid in variërende publicaties sneller gaat en er minder technisch boekhoudwerk hoeft te worden gedaan. Hiermee komt er snelheid in het oplossen van het variatie probleem. Het volume probleem wordt gedeeltelijk opgelost door deelverzamelingen vanuit invalshoeken klaar te zetten. Dit heet dimensioneel modelleren. Doordat voor deze deelverzamelingen éénduidigheid voorberekend is kan een presentatie oplossing snel worden voorzien van alle informatie behorende bij een bepaalde invalshoek. CouchDB is gespecialiseerd in het snel uitleveren van invalshoeken op grote hoeveelheden informatie. Daarom is deze technologie geschikt voor het HvA project. Om deze snelheid te kunnen waarborgen is het éénduidig klaarzetten gebonden aan een specifieke programmeeropzet. Deze opzet kent twee stappen: het afbeelden van informatie op invalshoeken, en het berekenen van optellingen, gemiddelden en andere maten voor deze invalshoeken. Deze opzet wordt MapReduce genoemd. Het HvA project laat dus zien dat twee belangrijke technische doelstellingen worden ondersteund: het gemakkelijk experimenteren met éénduidigheid en het snel uitleveren van deelverzamelingen voor bepaalde invalshoeken. Hiermee kunnen de bedrijfsdoelstellingen overzicht te krijgen en ontdekkingen te doen vanuit bepaalde invalshoeken flexibel en uitbreidbaar worden gerealiseerd.
  • 5. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 3 Foreword This document is the thesis for a Bachelor of ICT degree from the Amsterdam University of Applied Sciences (HvA). It reports on a software development project commissioned by Nivocer B.V. The goal of the project was to make Open Data about Development Aid programmes more accessible and combine it with corresponding documents. This facilitaties other parties to develop analytical complements. A demonstrator complement was developed as an example1 . Thanks to Rolf Kleef from Nivocer for granting me this opportunity and for engaging in many interesting discussions. Thanks to Gerke de Boer from the HvA for guiding me in the writing of this report. Amsterdam, August 12 2012, Michiel Kuijper 1 http://www.michielkuijper.nl/iatidemo
  • 6. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 4
  • 7. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 5 Summary This report describes a software development project in support of Aid Transparency, ie. making data and documents about investments for Development Aid programmes suitable for analysis. The trigger for this project was the recent initiative by cooperating Aid Organisations to publish raw data in an agreed upon format, the IATI (International Aid Transparency Initiative) format. Our assignment was to develop software middleware to link this data to corresponding documents in order to support a rich picture of Aid Activities. This assignment was commissioned to us by Nivocer B.V., an intermediary party in the Aid network, specialised in information services. The project was managed using the Maes nine-squares model. This model allowed us to look at our research and development using three concerns intersecting with three process levels. These concerns are business aspects, information aspects and technical aspects. These levels are strategy, tactics and operations. This model is used to structure the report and present our products. The model allowed us to operationalise our high-level assignment into three concrete assignments: a business assignment, an information assignment and a technical assignment. • The technical assignment was to stage structured data and unstructured documents so they can be manipulated in a uniform way. • The business assignment was to implement a demonstrator front-end that can visualise Information Entities about Development Aid in relationship to each other. • The information assignment was to develop an architecture and proof-of-concept software support to provide integrated access to Information Entities consumable by multiple types of visualisation front-ends. A user facing preview of our demonstrator is provided to create a concrete frame of reference for the more abstract remainder of this report. The demonstrator shows that data and documents are linked at a descriptional level as well as at an analytical level. The business assignment is contextualised using business intelligence theory. Integral Performance Management shows performance indicators give direction to BI projects. The BI cycle provides a 15 step divide- and-conquer approach to data preparation, fact analysis and decision making. Typical BI user types and use cases give requirements analysis for BI applications a headstart. Our demonstrator seems to gravitate towards the typical use case “Exploration”. This use case is representative for managers in Aid Organisations with cross-organisational responsibilities. Using the structured analysis approach from BI to discover relevant information in unstructured documents helps to disclose an organisation’s archives. Our information platform links up data and documents in a business intelligence view on organisational collaboration
  • 8. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 6 In the technical assignment the raw data was profiled by looking at existing studies. These were complemented using our own investigations, in particular around the Dutch Ministry of Foreign Affairs as a providing organisation and Sudan as a recipient country. In particular the variety and volume of the data and documents are challenging. These two challenges lock each other in causing a long development curve. To deal with the variety and volume challenges we have selected CouchDB, a schemaless datastore using JSON data formats and javascript-based logic for data staging and information provisioning. In the information assignment an architecture was developed using CouchDB’s strategic application logic. CouchDB has two direct advantages for our project. Its generic schematisation allows us to capture sources with diverse structures in a common staging area very early in our transformation process. This prevents having to invest heavily in data typing and storage management. This allows for flexible experimentation and agile adaptation. This helps us to address the variety problem. CouchDB’s transformation functionality is used to precompute inverted indexes representing the analytical views we want to provide to BI applications. To separate analytical from descriptional properties we have used the Kimball dimensional modelling approach. An aggregation strategy was devised to provide a slice of a fact table depending on the viewpoint of the requesting use case. The helps us to address the volume problem, both on resp. a micro and macro level. System models are provided to explain the structure and behavior of our staging area, information platform and front-end application working together. Using a data flow model the data transformation is illustrated using a running example about the Aid Activity “Africa Biogas Partnership Programme”. The different formats in the transformation are described as well as the logic performing the transformation. The data uploaded to CouchDB was pre-processed using Python scripts. The documents uploaded to CouchDB were pre-processed using the semantic annotation webservice Open Calais. The demonstrator application was developed in the open source data visualisation framework Exhibit- Simile. The demonstrator proves our concept of software middleware to link data and documents in order to provide a rich picture of Aid Activities. CouchDB can cater for loose coupling of changes in data supply and changes in information demand. It does this by encapsulating Activity data and Activity documents as JSON documents, so they can be transformed by MapReduce functions into JSON entity collections. These can be provided to different BI clients in customised formats using CouchDB list functions. Our approach fits the three stage strategy recommended by Business Intelligence practitioners; ie. data preparation for indicators, fact-based analysis and peer-based decision making. In this project we have covered the first stage, leaving open multiple options for supporting the second and third stage. Next steps have been identified to bring the platform to industrial-scale quality. There were two major deviations from the expectations captured in our initial plans. These were the abandonment of the ABPP as primary focus of the demonstrator and the employment of text mining techniques for disclosing Activity documents. The first was abandoned because of a lack of operational data. The second was substituted for the external webservice Open Calais. The core areas of competence development this project has addressed are analysis of business processes and analysis of existing software frameworks. The secondary areas of competence development this project has addressed are design and implementation of business processes, and design and implementation of software combinations. The main soft skills this project has addressed are self-education of web APIs and non-relational database technologies.
  • 9. Term AidDomain BIProcess Software Description ABPP X Africa Biogas Partnership Programme. Used in the report as a running example. See also chapter references Activity X Aid programme record according to IATI standards Assets X HTML, javascript or multimedia files held in a CouchDB design document BI cycle X Structured process to stage data, analyse facts and support decision making BI / business intelligence X The directed process to collect and analyse data and apply the resulting information to govern an organisation BUZA X Dutch Ministry of Foreign Affairs. See also DGIS CouchDB X Document-oriented datastore that uses JSON to store data, JavaScript as its query language using MapReduce and HTTP for an API. CSO X Civil Society Organisation; Volunteering is often considered a defining characteristic of the organizations that constitute civil society, which in turn are often called Non-Governmental Organisations, or Non Profit Organisations. CSV X Comma Separated Values - format to mark up data tables using commas or semicolons for consumption by spreadsheet sofware - eg. cell1;cell2;cell3 CURL X Command-line tool to interact with websites DAC X The Organisation for Economic Co-operation and Development's Development Assistance Committee (DAC) is a forum for selected OECD member states to discuss issues surrounding aid, development and poverty reduction in developing countries. DGIS X Directorate Generale for International Cooperation (Samenwerking). Department of the Dutch Ministry for Foreign Affairs responsible for Aid Activities DMW X Department Environment and Water. Department of DGIS DOCX X File format for Microsoft Word. Content, logic and style are annotated separately using XML DOM X Domain Object Model - a hierarchy of nested tags with presentation, logic or content roles that make up a webpage; eg. <div id='myDOMelement'></div> ETL X Extract, Transform, Load; Structured process to put data in a datawarehouse Exhibit-Simile X Open Source light-weight data visualisation framework building on the HTML DOM model HBO-I Hogere Beroeps Opleiding Informatica (Vocational Training Body) Hivos X Humanist Institute for Cooperation (Humanistisch Instituut voor Ontwikkelingssamenwerking) is a Dutch organization for development co-inspired by humanist values HTML X Hyper Text Mark-up Language; language used to build a webpage HTTP X Hyper Text Transfer Protocol; network standard for transferring webpages IATI X International Aid Transparancy Initiative
  • 10. Term AidDomain BIProcess Software Description Inmon-explorer X BI term for a user with a need to slice and dice through information and drill up and down aggregation level Inmon-farmer X BI term for a type of user with a strategic role in an organisation with a predictable need for digested information. Inmon-miner X BI term for a user interested in discovering trends and anomalies in information to explain past events and predict future ones. Inmon-tourist X Bi term for a user with an operational role in an organisation with a predictable need for specific information IPM X Integral Performance Management; Management method to connect a business strategy with business operations by means of indicators JasperETL X ETL component of JasperSoft JasperReports X Reporting component of JasperSoft JasperSoft X Open Source BI suite well known for its reporting components JavaScript X programming language mostly used in web browsers JSON X Java Script Object Notation - Typical way to represent data structures used in Javascript - eg. {key : value} MapReduce X CouchDB functionality to select properties from CouchDB documents and use those as look up keys for a specified calculation. Calculations are aggregated on levels corresponding with key positions. MDG X Millennium Development Goals - Key objectives of United Nations to govern Aid Activities MDX X Multidimensional Expressions; a query language for OLAP databases Mustache X HTML directives framework - see directive NGO X Non-governmental organisation; a legally constituted organization created by natural or legal persons that operates independently from any form of government. NPO X Not-for-profit/Non-profit organisation; an organization that uses surplus revenues to achieve its goals rather than distributing them as profit or dividends ODA X Official Development Assistance; Term to measure aid (coined by DAC of the OECD). It is widely used by academics and journalists as a convenient indicator of international aid flow. OECD X Organisation for Economic Cooperation and Development; an international economic organisation of 34 countries founded in 1961 to stimulate economic progress and world trade. OLAP X On Line Analytical Processing; is an approach to swiftly answer multi-dimensional analytical queries Open Calais X Free semantic annotation webservice by Thomson-Reuters; matches up words in a text with domain concepts Open Data X Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Python X Programming language often used for network programming - eg file manipulation and exchange
  • 11. Term AidDomain BIProcess Software Description Python-Calais X Python module to abstract interaction with Open Calais webservice ResRaps X Result Reports - Document reporting on the progress of Aid Activities for a specific sectoral purpose Simile X The widget suite of the Exhibit-Simile framework SNV X Netherlands Development Organisation; a non-profit, international development organisation that aims to alleviate poverty by enabling increased income and employment opportunities and increasing access to basic services UNICODE X computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. URL X Uniform Resource Locator; ie. a weblink XML X eXtensible Mark-up Language - used to mark up content, logic or style in web applications - eg. <tag>content</tag> Distinguishing words in this project aggregation X BI term for summarised results of calculations on values of facts - often the sum, average, min or max of a set of values analytical property X BI term used to contrast with descriptional, where analytical refers to properties that are used to view summarised values for facts, while a descriptional property provides information about one individual entity collection X Exhibit-Simile term for a set of items of a certain type descriptional property X BI term used to contrast with analytical, where analytical refers to properties that are used to view summarised values for facts, while a descriptional property provides information about one individual entity dimension X BI term for a viewpoint on facts; fact values can be aggregated by viewpoint directive X Used to mark up places in webpages at designtime that should be filled in with data at runtime - eg. JSP, Mustache directive; when mark ups are encountered by processing engine, logic is directed to fetch and fill in data that matches with marked up variable document X X IATI term for a report describing the progress of an Activity; CouchDB term for a JSON object; in this report either Activity document or JSON document element X IATI term for a node in the XML description of an Activity; corresponds to an entity emit X CouchDB term for the output of a Map function entity X Domain concept used by humans to reason about the domain and used by software as a data object facet X Exhibit-Simile term for a BI dimension or viewpoint facts X BI term for a domain entity with a measurement value associated with it filter X Software term for selecting entities with specific value for a property grain X Most detailed level at which facts are available for analysis in a BI application
  • 12. Term AidDomain BIProcess Software Description indicator X BI term for a type of performance measurement. item X Exhibit-Simile term for an entity/data object level X CouchDB term for an aggregation level of a Reduce function list X CouchDB term to wrap the output of a MapReduce Function (a View) into a client-specific format precompute X Software term to indicate that the result of a query is stored for immediate response publication X IATI term for making Activity data available for third parties signatories X IATI term for organisations who have signed a manifest to comply the IATI standard structured (data) X Software term to refer to data from databases or marked-up data unstructured (data) X Software term used to refer to texts of webpages and documents
  • 13. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 11 Contents of this report Foreword 03 Summary 05 Abbreviations & Glossary 07 Chapter 1. Introduction 13 Chapter 2. Assignment – linking open data and open documents 17 2.1 Aid Activities 17 2.2 Aid Documents 18 2.3 Operationalisation of the Assignment 18 Chapter 3. Business Intelligence demonstrator: user facing preview 21 3.1 Activity description page 21 3.2 Document description page 21 3.3 Activities analysis page 21 Chapter 4. Business Intelligence in the Aid Domain 27 4.1 Integral Performance Management 27 4.2 Business Intelligence Cycle 28 4.3 Inmon-BI Use Cases & User Types 29 4.4 Organisational profiling as example 30 Chapter 5. Data profiling of the IATI XML sets 33 5.1 Variety in IATI XML sets 34 5.2 Volume in IATI XML sets 35 5.3 Dimensional modelling 35 5.4 Extract Transform Load investigations 37 Chapter 6. Technology for Variety and Volume 39 6.1 Document-oriented datastore CouchDB 39 6.2 Functions of CouchDB 40 6.3 Comparison to Relational Warehousing 41 6.4 Aggregation strategy in this project 43 6.5 System models 45 Chapter 7. Staging of Structured and Unstructured Data 49 7.1 Introducing the data flow 49 7.2 Staging Activities data 51 7.3 Staging Activities documents 53 7.4 Data Flow diagram 58 Chapter 8. Information Provisioning for Front-end Configuration 59 8.1 Information entities in JSON Activities 60 8.2 Information entities in JSON Reports 61 8.3 Providing Information Entities to a Front-end 62 8.4 Completing the Data flow diagram 64 Chapter 9. Business Intelligence demonstrator: developer configuration 65 9.1 Exhibit-Simile 65 9.2 Sourcing aggregate data from the information platform 66 9.3 Define a data graph in terms of collection dependencies 67 9.4 Configuring widget to control and view model properties 68 Chapter 10. Conclusions 71 References 73
  • 14. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 12 Chapter 11. Phasing: planned and actual course of the project 75 11.1 Project Threads 75 11.2 Major deviations from expectations 80 11.3 Nine-squares model and choices made 80 Chapter 12. Reflection: connecting project execution to the HBO-i competences framework 81 12.1 HBO-i competences framework 81 12.2 HBO-i competences in this project 82 12.3 Soft skills 83 Appendix A. Technical references 85 Appendix B. Sudan Activity sets 87 Appendix C. Schemas landscaped 89 End of this report 106
  • 15. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 13 Chapter 1. Introduction This report describes a software development project in support of Aid Transparency for the Millennium Development Goals. The Millennium Development Goals (MDGs) are eight international development goals that 193 United Nations member states and at least 23 international organisations have agreed to achieve by the year 20151 . The goals are to eradicate extreme poverty and hunger, to achieve universal primary education, to promote gender equality and empower women, to reduce child mortality rates, to improve maternal health, to combat HIV/AIDS, malaria, and other diseases, to ensure environmental sustainability, and to develop a global partnership for development. In order to achieve those goals development aid donors fund aid activities, ie. programmes, in recipient countries and regions. Unfortunately nobody sees the bigger picture of who is spending money on what and whether is it has any effect: Aid spending and effects are not transparent. Therefore the International Aid Transparency Initiative (IATI) promotes a common format, the IATI standard, for sharing relevant information so that it will be easier to understand, compare and use [MakeAidTransparent, 2011]. Our project builds on this standard. Aid Transparency fits within the movement of Open Data. This is data produced by governments while serving the public, paid for by our taxes. Open Data is published with limited legal restrictions in order to enhance transparency of governance. However, merely publishing raw data is not enough. Raw data needs to be turned into information, and information has to be combined to make decisions. This resembles the motives of Business Intelligence within commercial organisations. Business Intelligence (BI) is the directed process to collect and analyse data and apply the resulting information to govern an organisation. In our project we are dealing, not with one particular organisation, but with a network of organisations. These network participants are exchanging data with each other, but each participant has to spend considerable effort and time on turning that data into information and combining this into intelligence that decisions can be based on. There is gap between the ability to process data on the one hand and the speed, volume and variety with which data becomes available on the other hand: the so-called information gap [Beek, 2010]. Nivocer B.V.2 is an intermediary party in the development aid network that provides services to address the information gap in Aid Transparency. Nivocer commissioned us with the following project: develop software middleware that can link structured Aid Activity data with unstructured Aid Activity documents, in order to support a rich picture of Aid Activities. In this report we describe the solution to this project following a business intelligence approach. This project consisted of building three main cases: a business case, an information case and a technical case. The business case answers the question what Aid Intelligence means for the different organisational participants in the Aid Sector. The technical case answers the question how the data and documents provided by different participants can be integrated. The information case answers the question how data supply and intelligence demand can be loosely coupled. Loose coupling is an important design driver in software engineering: requirements are constantly shifting, while scalable solutions need a stable foundation. Also in our project, analytical cases breed demand for ever more sophicated combinations of information, while community buy-in of standardised publishing is dependent on simplicity and stability. This has been the reason that our core focus has been on the information case providing middleware between emerging practices of 1 http://en.wikipedia.org/wiki/Millennium_Development_Goals 2 http://www.nivocer.com
  • 16. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 14 analysis and existing practices of publishing. The business intelligence application we have developed in this project acts as a front-end showcase to our middleware, while the data we use to feed our middleware is taken from existing datasets and documents. These project goals were managed using the model as shown in Figure 1.1 en 1.2. Figure 1.1 shows the theoretical purpose of the model and Figure 1.2 shows its application to our project. The model shown in Figure 1.1 is the Maes nine-squares model of information systems development [Maes, 2004]. The columns in the model represent three organisational concerns: business, information and technology. The rows in the model represent three process levels: strategy, tactics and operations. The model captures a divide- and-conquer approach to information systems development in which the squares, representing pieces of the solution puzzle, mutually inform and constrain each other. Using the Maes nine-squares model we were able to express our three cases as constrained and informed by existing strategies and operations. Figure 1.2 Application of Maes nine-squares model for our project Figure 1.2 shows how domain analysis and requirements framing progressed from the business column (left) to the technology column (right) and how an application was designed, implemented and delivered from the technology column back to the business column. The nine-squares model imposed an information system development structure on our relatively open-ended assignment. To understand the business purpose of the assignment we looked at business intelligence in the context of the Millennium Development Goals strategy for Development Aid Participant organisations. To Figure 1.1 Maes nine-squares model of information systems development
  • 17. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 15 understand the technical options for our assignment we looked at strategies to deal with data of a large volume and variety, both between documents and data and within IATI XML data-sets. The contribution of this project is visualised in Figure 1.3 and Figure 1.4. At the start of this project the assignment was as open-ended as to link up data and documents. During the project the seven remaining pieces of the puzzle were filled in iteratively and incrementally. This process is accounted for and reflected upon in Chapter 11 and 12. The deliverables of this project are presented looking back through the lens of the completed nine-squares model. Figure 1.3 Situation at the start of the project Figure 1.4 Situation at the end of the project The report is structured as follows: • Chapter 2 describes the assignment in more depth, explaining the nature of Aid Activities as represented in data and documents. • Chapter 3 presents a preview of the user facing side of the demonstrator we have developed. This provides the reader with a concrete frame of reference to process the more abstract remainder of the report. • How we arrived at our business and user requirements by building on Business Intelligence theory is explained in Chapter 4. • Chapter 5 reports on the data profiling of the available IATI XML sets. • The technology we selected for dealing with a large variety and volume of data and documents is described in Chapter 6. • Chapter 7 illustrates the data staging process using the selected technology. • Chapter 8 presents the provisioning techniques of our information platform. • In the Chapter 9 we explain the developer facing side of our demonstrator. This completes the circle to chapter 3. • The conclusions of our research and development are presented in Chapter 10. • Chapter 11 describes the choices we faced during the project and the project threads we abandoned. • Chapter 12 reflects about the technical and non-technical competences that were progressed in this project. • Lastly, appendices with technical references, data profiles and landscape schemas are provided.
  • 18. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 16
  • 19. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 17 Chapter 2. Assignment – linking open data and open documents This chapter describes the assignment of this project. The project commissioned by Nivocer B.V. was to develop software middleware that can link structured Aid Activity data with unstructured Aid Activity documents, in order to support a rich picture of Aid Activities. The input for this project was the following structured and unstructured data: 1. Aid Activities data sets in XML 2. Result Reports about Aid Activities, mostly in MS Word docx format. 2.1 Aid Activities An Aid Activity is the basic unit of reporting in IATI [IATI, 2011]. This is typically an individual programme or logical grouping of work in an Aid Participant organisation's budget. Each Activity is represented by a Activity record. This record has three main parts: 1. Who is involved, where and how? 2. What are the basic management details for the project? 3. What are the financials details Who is involved, where and how? Example: Africa Biogas Partnership Programme • What is the name of the reporting organisation? • Which organisations are funding you? • Which organisations are you funding? • What is the nature of the funding relationship? • Ministry of Foreign Affairs (DGIS) • Not applicable, this is a donor • Funding HIVOS in order to enable SNV to work with countries in the region South of the Sahara • Untied (No obligation to purchase from donor economy) What are the basic management details for the project? • What is the IATI identification code for this project? • Project name and description • What are the documents related to this project? • What are the contact details for the project? • What other projects are related to it? • What are the geographic details? • What are the start and end dates? • What is the current status of the project? • What are the expected and actual results? • Which sector does the project contribute to? • What are the cross-cutting themes? • Are there terms and conditions? • NL-1-PPR-18384 • DMW ABPP; Africa Biogas Partnership Programme • Not included, but intended for Result Reports • Not included, but intended for HIVOS contact • Not included, but intended for other Activities • Not included, but intended for geo-coordinates • Start 2008, End: 2013 • Implementing • Not included, but intended for MDG-related indicators • Power generation/renewable sources • Biological Diversity, Combat Desertification, Gender Equality • ODA (Official Development Assistance) What are the financials details • What are the total budgets for each financial year? • What type of aid is this? • What are the disbursements? • What are the financial mechanisms used? • Commitment of 30M euro over the course of programme • Project type intervention • Diverse chunks of the commitment money transferred • Aid grant excluding debt reorganisation Figure 2.2 Open Data: conceptual properties of an IATI Aid Activity record (technically provided in XML). Illustrated by the Aid Activity Africa Biogas Partnership Programme [SNV, 2009] Table 2.2 lists the properties of Aid Activity record using an example programme [SNV, 2009]. Figure 2.1 This chapter describes the squares Open Documents and Open Data from the nine-squares model explained in chapter 1
  • 20. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 18 2.2 Aid Documents Result reports are short documents delivered by Dutch embassies describing the status of development related to the MDGs in a particular geographic area. There is a many to many relationship with Aid Activities as registered in the donor’s transactional systems. They provide the context to obtain a richer picture about Activities. It is not the case that one Aid Activity is exactly covered by one Result Report. Reports follow a common structure (that is under revision) and are typically composed out of the paragraphs listed in table 2.3. Result Report on Sectoral Purpose in Geographic area Africa Biogas Partnership Programme1 Metadata about the embassy involved, the recipient country or region, the strategic goal, authors To develop a biogas economy in countries South of the Sahara Context description about the situation in the recipient country or region People now mainly use wood and fuel to cook in their houses. This causes respiratory problems, deforestation or CO2 emission, costs a lot of preparation time for women and children … Results and lessons learned Biogas installations allow cooking on gas, with no smoke in the house. They also free up time for women to develop themselves. In addition dung is removed from living lots improving hygiene. Biogas construction companies deliver employment boosting local economies. This all contributes to a perception of social progress. What went less well and why? Biogas installations need careful maintenance, which has been the cause of failure in some cases. The product needs to be accompanied with a life cycle process. What has been learned (process) Don’t give people installations, but learn them to build installations to create a sense of ownership. This creates sustainable economies. Resources spent and Aid Activities involved as registered These are typically the figures provided in the Open Data set, but not one on one. Most reports do contain references to Activity identifiers. The description for ABPP above is for illustration purposes, but reports are likely to be at the level of sectoral purposes, eg. renewable power sources. Traffic light score about status of investment area Ordinal scoring on a scale with values like On track, In danger to be off-track, Off- track Figure 2.3 Open Documents: Typical paragraphs listed in a Result Report (technically provided as MS Word of Adobe pdf document). Illustrated by the Aid Activity Africa Biogas Partnership Programme taken from [SNV,2009] 2.3 Operationalisation of the assignment During the feasibility phase of the project the assignment gravitated towards a conceptual middleground present in both structured and unstructured data: information entities. An information entity in this project is defined as • a domain concept that domain participants use to reason about the field, • that captures a re-occuring set of properties and • that can be manipulated by software to support discovery of quantitative and qualitative relationships. This has resulted in the three assignment operationalisations listed below. 1 [SNV, 2009]
  • 21. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 19 The development of software middleware to prepare linking between data and documents is operationalised in this project as follows and results in the Technical assignment: Stage structured data and unstructured documents so they can be manipulated in a uniform way The development of software middleware to pass on data as information is operationalised in this project as follows and results in the Information assignment: Deliver an architecture and proof-of-concept software support to provide integrated access to Information Entities consumable by multiple types of visualisation front-ends Support for a rich picture of Aid Activities is operationalised in this project as follows and results in the Business assignment: Implement a demonstrator front-end that can visualise Information Entities in relationship to each other The three operationalised assignments have lead to the products Data Staging solution, Information Platform and BI application, depicted in the middle row of our nine-squares project model. The next chapter will present a preview of the user facing side of the BI application. This provides the reader with a concrete frame of reference in order to process the more abstract remainder of the report.
  • 22. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 20
  • 23. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 21 Chapter 3. Business Intelligence demonstrator: user facing preview This chapter gives a user facing preview of the demonstrator developed in this project. The purpose of this preview is to create a concrete frame of reference for the more abstract remainder of this report. The front-ends that are shown are: • Activity description page • Document description page • Activities analysis page 3.1 Activity description page The activity description page offers a listing of the properties of an activity as they were described in chapter 2. In Figure 3.2 the properties for the activity “Africa Biogas Partnership Programme” are shown. Two aspects need to be noticed at this stage of reading. One, the raw XML format is presented as a human readable table. Two, the page contains a link to a corresponding report at the bottom of the page. In addition, a link to the raw XML is included at the top left of the page. The latter allows for inspection of technical metadata. 3.2 Document description page The document description page (Figure 3.3) offers a raw listing of the text of a report. The original document as formatted by the publisher can be downloaded using the link at the top left of the page. Entities are marked up in the page, indicated in red. These entities link to a corresponding external wikipedia page if one exists. It is expected that these entities will be linked to the Web of Data1 in future versions. Important to notice is the bar with activity identifiers listed at the top. Each identifier links back to the corresponding activity. One report covers more than one activity, one activity can be covered by more than one report. 3.3 Activities analysis page These two pages illustrate the basic way in which activities and documents are linked in our demonstrator application. Before they existed as disparate data sources. Both the activity and document description pages can be reached from the activities analysis page, which is the main entrance of the application. On this page the analytical use cases are implemented. Currently they contain the raw building blocks for these use cases. Three examples are shown: • Figure 3.5 shows a map displaying the number of activities in six countries that had transactions in the year 2001 (selected at the left) • Figure 3.6 shows a timeline of the same activities, stretching a bar between start and end date • Figure 3.7 shows a table with these activities (clicking on an activity jumps to its description page). The table also shows the aggregated transaction amount for each activity. 1 http://en.wikipedia.org/wiki/Linked_data Figure 3.1 This chapter describes the square Business Intelligence Application from the nine-squares model explained in chapter 1
  • 24. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 22 The analysis page shows the analytical building blocks in the middle. Several tabs are visibile at the top. Figure 3.5, Figure 3.6 and Figure 3.7 show the building blocks of three of those. In the left and right column filters are shown that correspond with the properties of an activity. Activities with common properties can be shown together using those filters. When a property value is selected, eg. the year is 2001, activities with transactions in 2001 will be shown. This will also cause the other filters to only show the property values of those activities. It is possible to select values in different filters consecutively which will narrow down the shown set even more. Finding specific measures for properties is then done by selecting the specific tab at the top. This makes up the BI functionality. Note that the right column also contains filters for report and related entities which listen to selections in the other filters. Hence it possible to use the analysis page for discovering links between activities and documents. Document descriptions are reached via the tab reports and related entities.
  • 25. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 23 Figure 3.2 Activity detail page for the Activity "Africa Biogas Partnership Programme" (Not all properties shown). The report link below leads to a corresponding Document description page.
  • 26. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 24 Figure 3.3 Document description page. A document related to the Activity “Africa Biogas Partnership Programme” is shown. At the top links to Activity descriptions are provided. Entities are marked up in the page. This page can be reached from reports and related entities shown in Figure 3.4 Figure 3.4 The activities analysis page with the tab related entities shown. Note that in the right column the report from Figure 3.3 is selected causing the other filters to show related property values.
  • 27. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 25 Figure 3.5 The activities analysis page with the tab countries map shown. Note that in the left column the year 2001 is selected causing the other filters to show the property values of activities that have transactions in 2001 Figure 3.6 The activities analysis page with the tab periods timeline shown. Note that in the left column the year 2001 is selected causing the other filters to show the property values of activities that have transactions in 2001
  • 28. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 26 Figure 3.7 The activities analysis page with the tab activities shown. No filter selections are applied. Note the aggregated transaction values in the activities table. Other tabs host aggregated transaction values for different views. Further calculations for showing different measures, eg. percentages, and visualisations, eg. norm-related color coding, will be added in future versions. At the time of showing the database contained a total of 125 activities. The industrial-scale version will contain hundreds of thousands of activities. One front-end page is not shown at this stage of reading, which is the fact slicing page. The fact slicing page is part of our aggregation strategy which is explained in chapter 6. The webpage front-ends2 shown above are implemented using the Exhibit-Simile data visualisation framework [Huynh, 2007]. Chapter 9 will explain the developer side of these pages as they are configured from the information provided by our information platform. The remainder of this report describes the research and development process that has lead to this implementation. 2 The demonstrator can be found at www.michielkuijper.nl/iatidemo
  • 29. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 27 Chapter 4. Business Intelligence in the Aid Domain This chapter describes the business aspects of our project. Our project follows a business intelligence approach to frame its requirements. Business intelligence is defined as the directed process to collect and analyse data and apply the resulting information to govern an organisation. In our project we are not dealing with one particular organisation, but with a network of organisations. The business intelligence approach has informed us with three important theories to build on. These are: 1. Integral Performance Management 2. Business Intelligence Lifecycle 3. Inmon-BI use cases & user types The reason why we discuss these in the context of a software development project is that they represent best practice frameworks for setting up a specific type of application: a BI application. They can be seen as templates for the software lifecycle phase Requirements Analysis and Design, that can be used to build on the accumulated insight of BI practitioners. 4.1 Integral Performance Management Integral Performance Management (IPM) is a methodology that links up a business strategy with business processes by means of Performance Indicators [Geelen, 2005]. A business strategy is derived from a business mission, the Why, and the business vision, the What. The business strategy is the way to achieve this, the How. In our project this “business” mission is to achieve the Millennium Development Goals as set out by the United Nations. In our project the vision is to achieve Aid Effectiveness. Our project addresses one particular thrust of the underpinning strategy: Aid Transparency. Transparency holds donor governments to account, commits them to results, helps improve the performance of aid agencies, decreases corruption and enables better planning and coordination amongst donor agencies1 . IPM recommends defining a hierarchy of indicators to connect daily operations to a strategy. At the top of this hierarchy are Key Performance Indicators that represent the external state of affairs of an organisation. Lower in the hierarchy are Performance Indicators that represent the internal state of affairs an organisation. The core idea is to connect these two views in order to align internal diagnostics to external market positioning. Then you know as an organisation what to change inside when the market changes. Indicators are typically calculations compared to an agreed target or norm provided by the strategy. They should be defined in a specific, measurable, attainable, realistic and timebound fashion (S.M.A.R.T.). Such a definition will guide decision making, information design and data selection. 1 http://www.aidmonitor.org/ Figure 4.1 This chapter describes the squares Millennium Development Goals and Aid Participant Organisation from the nine-squares model explained in chapter 1
  • 30. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 28 4.2 Business Intelligence Cycle Indicators are part of a business intelligence cycle. This cycle is the feedback loop that connects strategy governance to daily operations. The BI cycle is defined at two levels [Beek, 2010]. The outer BI cycle consists of the phases Registration of Data, Processing Information and Reacting to Knowledge. Most organisations have automated systems in place to Register transactions. In order to be able to React the registered data has to be Processed for that purpose. Figure 4.2 Outer- and Inner BI cycle [Beek, 2010] This Processing step is broken down into three main phases itself: Preparing Indicators, Analysing Facts & Distributing Decisions. This is often referred to as the inner BI cycle and forms the methodological process by which most BI programmes are structured (Figure 4.2). This methodology is useful as a divide-and-conquer approach as there is a strong inclination to get bogged down in the diversity of the data and the preparation conflicts resulting from that. This is a main cause of failure for BI programmes. Indicators are the pivot points between the Preparation and Analysis phase. The state of affairs that needs to be distilled from the data should be captured before preparation as this informs the manner in which different sources should be filtered, combined and aggregated. Preparation should be done in service of the business strategy which is operationalised in terms of indicators. A typical sequence of activities in the Preparation phase is: Collecting data sources, Filtering out low quality data, Combining different sources in a common format and meaning, Aggregating individual facts according to different views, Visualising aggregations by means of graphs and charts to assess proportions and trends, Interpreting patterns in terms of domain events. A typical sequence of activities in the Analysis phase is: Internalisation of the perceived patterns, Adapting mental models and targets, Checking the data and analysis again in this new light, Augmenting with complementary data to increase analytical scale or scope. A typical sequence of activities in the Distribution phase is: Sharing insights with peers, Materialising insights into new management principles, Deciding what to do by seeking consensus, Communicating and Evangelising these decisions, Anticipating on events that are likely to occur again.
  • 31. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 29 4.3 Inmon-BI use cases & user types Analytical use cases tend to resemble one of four types [Inmon, 1998]. These four types are associated with types of BI users with corresponding use cases: 1. Farmers, mostly users with a strategic role in an organisation with a predictable need for digested information. Eg. Board members, stakeholders, government. In the Aid sector this type of role translates to Donor Policy Makers, Parliamentarians and Civil Society Organisation (CSO) representatives with a strong interest in Accountability: how is tax money being spent on Aid. 2. Tourists, mostly users with an operational role in an organisation with a predictable need for specific information. In the Aid sector this type of role translates to Recipient Line Ministry officials and Recipient Community Council members with a strong interest in available funds (aggregated from different donors) for specific sectoral purposes or local programmes. 3. Explorers, mostly knowledge workers in an organisation with a need to slice and dice through information and drill up and down aggregation levels. In the Aid sector this type of role translates to Non-Governmental Organisations (NGOs) and Civil Society Organisations (CSOs) with a strong interest in Aid Effectiveness. 4. Miners, mostly researchers interested in discovering trends and anomalies in information to explain past events and predict future ones. In the Aid sector this type of role translates to Academics with a strong interest in arrangements of Aid Management to study for instance the difference between institutional and grass-roots approaches. IATI aims to cater for all types of users but with different horizons of implementation [AidInfo, 2010]. The main driver has been to support Recipients, ie. Inmon-Tourists, to more swiftly obtain an overview of available funds. This can support more agile forward planning and prevent delays in policy execution. A second driver has been to create more opportunities for CSOs and Parliamentarians, ie. Inmon-Farmers, to hold their government to account. The result has been that most progress in structured publications has been made in aligning financial transactional data exchange. The information necessary for typical Inmon-explorers resides mainly in the unstructured Result Reports. Typical Inmon-miners are currently still focused outside the IATI space on statistical data provided by organisations such as the Organisation for Economic Co-operation and Development (OECD), who also provide information about disease levels in developing countries for instance.
  • 32. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 30 4.4 Organisational profiling as example A good example of Inmon-based exploration is the use case of organisational profiling. Figure 4.3 show a wireframe provided by Nivocer after the first prototype of our demonstrator was constructed. The figure shows a number of analytical building blocks expressing information about a filtered set of activities. In addition a listing of related documents is shown. The figure shows how an organisational profile can be obtained by juxtaposing related activities and documents. This can serve to get a rich picture of the workings of an affiliated organisation in, for instance, an NGO country manager’s portfolio. This affiliated organisation might be a potential partner in new activities. Figure 4.3 Nivocer wireframe showing analytical building blocks about a set of activities in a NGO country manager’s portfolio. The related activities and documents offer information to get a richer picture of the organisation’s workings. This use case shows how analytical and descriptive information, when linked up, allow for richer discovery routes than just analytical information alone. Organisations have large archives with unstructured information. How to disclose these documents is not always obvious without a clear use case. The IATI format offers analytical viewpoints that can be used to deliver these use cases. These viewpoints can be seen as structured queries on the contents of these documents. Using the structured analysis approach from BI to discover relevant information in unstructured documents therefore helps to disclose an organisation’s archives. The justification of our assignment to link up data and documents is therefore confirmed, but the use case also inspires new ways of contextualising business intelligence. Vice versa a predominantly
  • 33. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 31 hypertext based approach to discovery gains more depth by adding analytical aggregations to purely description information. In this chapter we obtained an overview of the purpose and meaning of the available data and documents in the context of the Business assignment. The next chapter looks at the available data and documents from a technical perspective.
  • 34. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 32
  • 35. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 33 Chapter 5. Data profiling of IATI XML sets In chapter 7, 8 and 9 a running example is used to explain the data flow from staging to application. This running example uses representative properties from the structured data sets and unstructured documents to abstract from the complexity of the full data sets and make clear the essence of the transformations. This chapter describes the aspects of variety and volume we encountered during the data profiling process of the full data sets in different stages of the project. An index of IATI XML sets can be found at the IATI registry1 . The registry allows a look-up of IATI files according to • 2 File types (1407 Activity files, 11 Organisation files) • 192 Recipient Countries & Regions • 60 Publishers • 7 types of organisation of which the bulk is Governments IATI based publication is in its first iteration after version 1 of the standard was agreed in 2011. Currently IATI has 29 official signatories, organisations who signed up to the standard, but also non-signatories have published data. Although more than only Activities data is available we decided to focus on Activities because they make up the bulk of the available data. A full specification of the IATI metadata in Activities can be found in [IATI, 2011]; Figure 5.2 shows a summary. Figure 5.2 Summary of IATI metadata in Activities files 1 http://www.iatiregistry.org/ Figure 5.1 This chapter describes the square IATI XML from the nine-squares model explained in chapter 1
  • 36. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 34 5.1 Variety in IATI XML sets In chapter 2 we have described the nature of the structured data and unstructured documents and concluded that they have a common middleground in Information Entities. In chapter 3 we have seen that the analytical role of Information Entities is to provide views on data and documents. These views allow for filtering data and documents together and therefore enhance the Inmon exploration use case. It is therefore of importance that different data sources are combined to integrated information entities. In a BI project data profiling is used to audit the different sources available and specify the transformations necessary to rally them in a common format and meaning. The following data quality aspects typically are addressed: • Inconsistencies in spelling, eg. organisation vs organization • Inconsistencies in innerfield logic, eg. 31-12-2012 vs 12/31/2012 for a date • Inconsistent use of definitions, eg. different calculations of net value after taxes • Inconsistent values in different sources, eg. different addresses of the same client • Completeness of values, eg. missing fields in a record • Double entries, eg. a client appears several times in one source • Referential Integrity, eg. a client key in a transaction does not have a corresponding CRM entry • Semantically unlikely values, eg. a shoesize of 64 In our project many of these issues are pre-empted by the IATI standard proving a clear set of metadata concerning domain definitions, technical definitions and publication history definitions. Many of the consistency and integrity issues mentioned are taken care of before publication. Currently the biggest issue is the completeness of data, and especially the variety in completeness. An investigation by the IATI programme office [IATI, 2012] shows that only the following core data elements are being provided by most publishers: 1. Activity period dates 2. Activity status 3. Participating organisations 4. Geography (not providing recipient country or region) 5. Sectors 6. Transactions, amount and dates, providing and receiving organisations We also did our own investigation of Activity files. Here we used the activities classified against recipient country Sudan. We selected Sudan because this appeared to have activities associated with it involving many different donors. We found a spread of 132 data elements (including technical metadata; see Appendix Sudan]) with an overlap of 8 data elements across all files. That is, to do aggregated analysis only 8 data elements can be fully compared to each other across all Sudan sets. These data elements were: 1. Activity identifier 2. Activity title 3. Activity description 4. Participating Organisation name 5. Participating Organisation role 6. Transaction value 7. Transaction type 8. Transaction code
  • 37. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 35 This has been the reason to focus in the transformation flow on the core data elements outlined in chapter 7, 8 and 9. One source of variety between publishers comes from the data elements they publish as shown above, another source of variety between publishers comes from the cardinality between the data elements. Again during investigation of the Sudan sets differences were found between publishers in: • The number of Activity records in a published Activities file • The number of Transactions per Activity • The number of attributed Sectors per Activity determining the weight of Transactions ETL scripts are BI tool jobs strung together to transform different data elements to a common format. This variety has made it difficult to transform the data solely on the basis of its abstract schema. Additional logic is required to determine the exact number of links between elements. This makes the ETL jobs quite elaborate, complex and hard to combine in a well coordinated script. 5.2 Volume in IATI XML sets During our data profiling research we discovered that the structured data sets are published in different sizes, ranging from sets with one activity to sets with thousands of activities. The number of accumulated activities is in the order of tens of thousands. An analysis of the IATI registry shows that at the time of writing some 1400 Activity files have been published in the first iteration after publication started in 2011. OpenSpending [OpenSpending, 2011] transformed these Activity files to weigthed transactions; ie. one line for each transaction per sector per activity. This has produced some 450.000 transactional rows. The CSV set weighs around 600 Mb and takes about 3 minutes to open a consumer desktop. Estimated from the Activity files we have investigated one Activity record contains 10 unweighted transactions on average. This means that at least 45.000 individual Activity records have been published in the first iteration. It should be expected that this number will grow in the next iterations and contribute to the accumulation. The combination of variety and volume posed a barrier to get the data quality investigations started; ie. to investigate which types of variety were present, Activity files had to be loaded in bulk in our Data Analysis tools. Due to the volume of the sets their performance was low. This hampered the learning about which parts were variant and which parts were common. Disclosing the data sets for aggregated analysis has two problems that lock each other in. Analysts acting on their own would face these same problems. Our information platform lowers the barriers [Cusumano, 2010] for them by preparing the separation of variety and volume. It does this by deploying dimensional modelling. 5.3 Dimensional modelling In transactional systems it is best practice to represent data in a normalised format. Normalisation is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency2 . This prevents maintenance problems. In analytical systems data is represented in a dimensional format. This format is optimised for aggregated calculations. To model data for aggregation analysis we build on Kimball’s dimensional modelling method [Kimball, 2004]. The focus of a dimensional model is the fact table. Every line in fact table is a registration of a measured event. In our case this either an activity or a transaction. The viewpoints or dimensions are determined by asking questions like who is involved, what is it about, where does it take place, when does it take place, why does it place [Linden, 2012]. 2 http://en.wikipedia.org/wiki/Database_normalization
  • 38. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 36 Figure 5.3 Target dimensional model in our project; Two fact tables: an Activity Fact table with grain “An activity per policy per sector” and a transaction Fact table with grain “A transaction per sector”. Kimball’s method promotes the clear separation of facts and dimensions. Facts are entities with measurement values associated with them that can be studied in aggregate. Dimensions are entities that look at facts from specific viewpoints. Often facts are analysed against two dimensions at one time, for instance the number of Activities sponsored by a Donor in a specific period or the accumulated Committed investment for a Recipient Country in a specific period. Kimball’s method optimises facts for calculation by externalising dimensional weight; ie. dimensions are referenced by means of lightweight keys, that point to entities with heavier descriptional content. This rationale stems from a limitation in computational and memory capacity on data warehouse servers, which has been caught up by new hardware developments. However the modelling approach is now becoming relevant again due to in-browser or on-device processing requirements. Also in our case we expect browser-based and mobile-based analysis to become more popular, since analysis is made relevant by sharing it with peers and made useful by having it available on location. Figure 5.3 shows our target dimensional model for Aid Activities. In the explanation of the data flow process we will use a simplified version of this model that is used to explain the representative transformation steps involved. This simplified data model is given in Figure 7.2. In the target dimensional model two fact items are visible: an Activities fact item and a Transaction fact item. The Activities fact item has three measurement values: a fact count, a sector percentage and a policy significance. The fact count is used to optimise calculation for counting the number of Activities per Dimension. The sector percentage is included because each Activity can contribute to one or more Sectors, as shown in Figure 5.4. The policy significance is present because each Activity
  • 39. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 37 can be seen to implement one or more Policies, also shown in Figure 5.4. The inclusion of the sector and policy weighting in the Activity fact item ensures that calculations against resp. the Sector and Policy dimension can be done quickly. Consequently, the grain of the Activity fact item is one Activity per Policy per Sector. The Transactional fact item has four measurement values: a fact count, an amount, a sector percentage and a percentual amount. The fact count is used to optimise calculation for counting the number of Transactions per Dimension. The amount is used the calculate the total amount per Dimension. The sector percentage is included because a transaction belongs to an Activity and each Activity can contribute to one or more Sectors, as shown in Figure 5.4. The inclusion of the sector weighting in the Transaction fact item ensures that calculations against the Sector dimension can be done quickly. The percentual amount is used the calculate the total percentual amount per Dimension. Consequently, the grain of the Transaction fact item is one Transaction per Sector. 5.4 Extract Transform Load investigations Figure 5.4 shows selected elements of the IATI Activity as it published. These elements are selected on the basis of challenges in transforming to a dimensional model. One of the challenges is that an Activity can contain one or more instances of a dimensional type such as Sector and Policy. The implication for the transformation is that one cannot assume that the number of dimensional elements per activity is constant across published sets. In our investigations we encountered a set with one sector per activity, but also a set with more sectors per activity. Also within sets these numbers can vary per activity. Another challenge is the indirect dependency between the transactional values and the sectoral attributions. Both the number of transactional elements and the number of sectoral elements can vary together within one Activity. Figure 5.4 Selected elements in an Activities set as published by IATI signatories. Elements are selected on the basis of challenges in transforming to a dimensional model The power of ETL scripts is that the transformation requirements discovered in a subset of the sources can be transferred to the remainder of the sources. The strategy to deal with hierarchical relationships between data elements is to decompose the elements so that each element has a fixed schema. On
  • 40. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 38 the basis of this fixed schema transformation jobs can be specified and repeated. Due to the variety described above our decomposition needed conditional logic to check the number of dimensional elements per activity. This logic is required to determine the exact number of instances involved in one-to-many and many-to-many relations between nodes. Our technology choice for transformation has been CouchDB, which is described in the next chapter. We have investigated two transformation routes before deciding on CouchDB: JasperETL and Google Refine [SE2][SE3]. JasperETL is the ETL component of the JasperSoft BI suite. The ETL requirements described above in combination with the storage and data typing constraints of the JasperETL solution made ETL in JasperSoft a laborious and time-consuming process [SE2]. Therefore we concluded JasperETL was not a suitable candidate to automate ETL for IATI Activities. Google Refine is a data cleansing and transformation tool using object-oriented design behind the scenes. Using Refine we were able to speed up the ETL process in comparison to JasperETL because of the lack of tedious storage management. Refine transforms the XML node hierarchy into nested records. The transformation strategy then entails transposing rows to columns, and filling out parent elements over all rows [SE3]. The transposition assumes a fixed number of rows to be transposed. Here again the variety prevented reuse of the transposition specifications. Because of the variation in the number of elements the number of columns that have to be transposed is not the same for each record. Therefore we concluded Google Refine was not a suitable candidate to automate ETL for IATI Activities. In both the case of JasperETL and Google Refine we did not investigate the usage of an XSLT schema3 . The XSLT schema might have been able to support us in automating the conditional logic referred to above. By that time we had embarked on our experiments with CouchDB and committed ourselves to this route. The advantage of CouchDB for dealing with the variety described above is that it is designed to loop over hierarchical data objects. Therefore we concluded CouchDB was a suitable candidate to automate ETL for IATI Activities. In this chapter we have seen that the IATI XML sets are characterised by a large variety and volume. In the next chapter we will describe how the selected technology CouchDB deals with this variety and volume. 3 XSLT: Extensible Stylesheet Language Transformations; a declarative, XML-based language used for the transformation of XML documents. http://en.wikipedia.org/wiki/XSLT
  • 41. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 39 Chapter 6. Technology for Variety and Volume In this chapter we describe the technology we have selected on the basis of the variety and volume of data and documents we discovered. In chapter 5 we have profiled the IATI XML sets. Overall two main types of variety were distinguished: 1. documents vs data 2. variety between XML data sets The volume of the data resides on two levels: 1. the number of data sets and documents 2. the size of (some of) the data sets In this project the document-oriented datastore CouchDB has been chosen as our primary data processing technology. Derived from this we chose Python as a pre-processing technology and Exhibit- SIMILE as a post-processing demonstrator technology. Before we got to this stage we have experimented with and investigated different options. These investigations are described in the chapter 11. In this chapter we explain our selected data staging and information platform technology CouchDB. Chapter 7, 8 and 9 will give detailed accounts of an exemplary data flow. 6.1 Document-oriented datastore CouchDB CouchDB is a document-oriented store. A CouchDB document stands for a set of key-value pairs and can therefore also been seen as an Object in the object-oriented sense of the word. Each value in a document carries it’s own semantic declaration in the key it is associated with. The syntactic declaration is the same for all keys, they are strings. Values usually are also strings or numbers, but can be objects or arrays of objects as well. This differs from relational database technology where the meaning of the values is captured in a schema and columns can be of various data types. In a CouchDB document the schema is generic in the sense that it requires valid key-value pair combinations of strings, objects of strings or arrays with objects of strings. The specific implementation complies to the JavaScript Object Notation (JSON1 ). CouchDB was designed to deal with web browsers. Most web browsers make intensive use of JavaScript2 . This design choice makes storage in CouchDB very accessible and frees developers from tedious storage management. This greatly accelerates staging and provisioning processes. The name CouchDB is an acronym for “cluster of unreliable commodity hardware”, because it is designed to scale out over a large set of elementary machines. This must be understood in comparison to relational database techniques that adopt a scaling up approach: deploying more powerful machines. The scaling out approach addresses the limitations of the scaling up approach. These are caused by the complexity and intensity of the coordination required to distribute a relational database across several machines. The scaling out strategy comes with a different development approach. Tasks must be specified in simple jobs that can be farmed out in parallel over a large set of 1 http://en.wikipedia.org/wiki/JSON 2 http://en.wikipedia.org/wiki/JavaScript Figure 6.1 This chapter describes the square technology for Variety & Volume from the nine-squares model explained in chapter 1
  • 42. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 40 machines. An abstraction layer has been developed that shield programmers from dealing with the coordination of these parallel tasks, often referred to as the MapReduce approach [Dean, 2006]. MapReduce functions make it easy to map documents onto inverted indexes of attributes, that can then be aggregated using reduce operations. MapReduce therefore implements BI requirements but uses a different combination of content and logic abstractions than a conventional relational application. More details are explained below. The three main advantages of CouchDB for our project are that: 1. We use its generic schematisation to capture sources with diverse structures in a common staging area very early in our transformation process. This prevents us from having to invest heavily in data typing and storage management. This allows for flexible experimentation and agile adaptation. This helps us to address the variety problem. 2. We can use its MapReduce functionality to precompute inverted indexes representing the analytical views we want to provide to BI applications. Due to the fact that this allows for a scaling out approach this anticipates on a large volume growth of IATI data sets. The helps us to address the volume problem. 3. Because of its web-based design we can use its replication capabilities to easily share data and logic between community participants. 6.2 Functions of CouchDB The basic building block of CouchDB is a JSON document. This is the CouchDB equivalent of a database record. With these documents web applications can be built using a number of functions. These are: • A Design document as a web application project, also called a CouchApp • Views on JSON documents using MapReduce functions, providing property indexes to documents • List functions for customising views in a client-specific format • Show functions to present individual user documents using templates and directives • Update handlers to pre-process content into JSON • Assets such as HTML pages, javascript libraries and multimediafiles held in a design document 6.2.1 Design documents A design document in CouchDB acts as a web application project and holds all view-, list-, show- and update-functions for one specific web application. A design document can easily be exchanged between CouchDB instances. A CouchDB web application can be seen as a two-tier web application. Relational web-applications often have three tiers: database, application logic, front-end. These three architectural components are all provided by CouchDB: json documents as the equivalent of the database. Update, view, list & show functions as the equivalent of the application logic; assets and templates as equivalents for the front-end. Because a browser interacts only with CouchDB for both data and logic it is seen as a two-tier application. 6.2.2 Views A view must been seen as the CouchDB equivalent of behind-the-scenes business logic in a web application. A view consists of the materialised output of a MapReduce function and provides a precomputed inverted index of properties to documents. These indexes are cached for performance. In our information platform views are used to precompute analytical collections of information entities. Figure 8.4 and 8.8 in chapter 8 are examples of MapReduce functions generating views.
  • 43. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 41 6.2.3 List functions The MapReduce functions in a view emit, ie. put out, rows of key-value pairs. List functions allow the developer to easily do something with those rows. In our platform we use the list functions to turn the elements in these rows into front-end specific formats. A list function can be seen as the CouchDB equivalent of a servlet that formats the response. Figure 8.10 and 8.12 in chapter 8 are examples of list functions that present views as Exhibit-Simile data items. 6.2.4 Show functions Show functions are used to present single JSON documents. For this templates can be applied that make use of directives. This means that the templates hold marked-up variables that are filled in at runtime. The can be seen as the CouchDB equivalent of Java Server Pages. In our platform we use the Mustache3 directive framework which is recommend by the CouchDB community. The demonstrator pages shown in chapter 3 are all templates with directives to either JSON documents (an activity, a report or a specification document) or HTTP Request object variables (view names). These are rendered as HTML pages that load data from list functions to fill the page with data as shown in Figure 9.2 in chapter 9. The precise interaction behavior is visualised in Figure 6.9 below. 6.2.5 Update handlers An update handler pre-processes a submitted document before storing it as JSON. It can be seen as the CouchDB equivalent of a servlet handling HTTP PUT and POST requests. In our platform we use update handlers to transform single XML activity records into JSON documents. This is shown in Figure 7.6 in chapter 7. 6.2.6 Assets Assets are HTML pages, javascript libraries and multimediafiles used in web application. 6.3 Comparison to Relational Datawarehousing ETL process Relational (ie. PostgreSQL, MySQL) CouchDB Collect data sources Make a local copy or retrieve network address Make a local copy or retrieve network address Extract data from source Use appropriate connector Pre-processing script and/or update handlers Transform data ETL jobs MapReduce/Recline4 Load data in warehouse ETL jobs Not applicable Optimise aggregated access Create indexes MapReduce Provide to client SQL/MDX5 + application server List and show Figure 6.2 ETL tasks mapped onto typical relational BI environment and CouchDB A datawarehouse is a subject-oriented, integrated and time-dependent database with relatively static data. Its technical purposes are to integrate disparate data-sources, to reconcile them, to optimise the 3 http://en.wikipedia.org/wiki/Mustache_(template_system) 4 Recline is the CouchDB version of Google Refine, a data cleansing and transformation tool. This was not used in our project. github.com/maxogden/recline 5 MDX Multi-dimensional expressions, a specialized syntax for querying and manipulating the multidimensional data stored in OLAP cubes. en.wikipedia.org/wiki/MultiDimensional_eXpressions
  • 44. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 42 analysis of large volumes of data in order to improve query response time, to report in a flexible way, to build a historical record and to alleviate operational systems [Kimball, 2004]. A datamart is comparable to a datawarehouse, but usually with a smaller amount of data and pre-structured for a specific purpose. A datamart is often compared to a distribution center that is used to bring a selecton of products closer to the consumer. A datamart offers more direct possibilities to cater to specific information needs. A data cube is very similar to a datamart but this term is often used for vendor specific solutions [Kimball, 2004]. CouchDB serves very similar purposes in our project. The main differences lie not so much in the steps that are applied in the ETL process (Figure 6.2), but in the balance in freedom between schema and logic. In relational applications schematisation is very strict, but the logic interacting with these schema’s can be freely defined. In CouchDB schematisation is very loose and only a few logical function types can be applied. These functions are powerful because they reside at a more abstract development level than standard web application logic. Tedious storage management concerns have been optimised and encapsulated into the CouchDB engine. Figure 6.3 Conceptual difference between Relation approach (left) and CouchDB approach (right) Figure 6.3 shows that CouchDB as an example of the non-relational approach [Madsen, 2012] incorporates two architectural changes: using the file system directly and separating concurrent logic from strategic application logic. The first is a reaction to the need to deal with web page processing. By cutting out the database layer distribution management becomes more easy. Combined with a drop in costs of hardware and an increase in processing power of commodity machines a scaling out approach is created. The second architectural change is a software abstraction that shields programmers from dealing with concurrency issues. Building on use cases for the web, application logic is framed in a high-level Gamma Strategy pattern [Gamma, 1995]. Formally speaking, the strategy pattern defines a family of algorithms, encapsulates each one, and makes them interchangeable. Strategy lets the algorithm vary independently from clients that use it. In this way CouchDB updates activity data and activity documents as JSON documents, so they can be transformed by MapReduce functions in JSON entity aggregates. These are provided by list functions to different BI clients.
  • 45. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 43 6.4 Aggregation strategy in this project Aggregation in CouchDB is done using Reduce functions. When the index consists of an array of keys (complex keys6 ) the associated values computed by the Reduce function can be aggregated according to the position of the key in the array. For example, if we would map Activity documents on a view with an array index as shown in Figure 6.4 the reduce function would also precompute the number of activities for the aggregation level region. At this level the number of activities for each country are aggregated into the sum of these numbers for regions. By calling Reduce with a different group level value, behavior similar to drill-up and drill-down in BI OLAP applications can be emulated. [region, country] -> #Activities Figure 6.4 Example of a complex key; ie an array with two simple keys CouchDB’s API simplicity and speed comes from a its linear access of precomputed inverted indexes, ie. views. In these views keys (single or arrayed) are always Unicode sorted. Care has to be taken to design the order in the key array correctly to allow access to the desired range of aggregations. A CouchDB equivalent of a datamart is made by precomputing the desired permutations of a dimensional key set. The general template that was adopted in this project is shown in Figure 6.5. Key array -> value [dimensional keys permutation, fact key, roll up dimensions] -> measurement Example permutations [recipient, donor, policy, sector, activity-identifier, cost type, year, month, day] -> Amount [sector, recipient, donor, policy, activity-identifier, cost type, year, month, day] -> Amount [policy, sector, recipient, donor, activity-identifier, cost type, year, month, day] -> Amount [donor, policy, sector, recipient, activity-identifier, cost type, year, month, day] -> Amount Figure 6.5 Practice adopted in this project to design views Each permutation allows the fact set to be sliced according to a specific sorting order of which four orders are shown in Figure 6.5. Providing a fact table slice prevents flooding the client memory with the complete fact set. Dimensions in the permutation can not be rolled up in the aggregation because for a specific slice the client logic needs all dimensional values to support front-end filtering (shown in Figure 3.4, 3.5 & 3.6 & 3.7 in chapter 3). The fact key serves to bring in the associated fact. The roll- up dimensions can be used to aggregate the measurements according to a specific reduce level, eg. by cost type, by cost type and year, by cost type and year and month etc. Such an aggregation strategy allows the front-end to fetch a slice of the fact table using a primary view specification (first key in order of keys) and a range parameter (startkey and endkey). An example is shown in Figure 6.6. Example /buza_iati/_design/activities/_list/item_transaction_by_provorg/fact_transaction_by_pr ovorg?group=true&startkey=[“NL-1”]&endkey=[“NL-1ufff0”] Template /<database>/<appqualifier>/<appname>/<listqualifier>/<listname>/<viewname>?<aggregate> &<range> Figure 6.6 API call to request a slice of the fact-table containing all facts for a primary dimension. In the example a slice of the fact table is requested containing all facts from the provider NL-1 (Dutch Ministry of Foreign Affairs). 6 http://wiki.apache.org/couchdb/View_collation#Complex_keys
  • 46. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 44 The structure of the URL must be understood as follows. The qualifier “_design” is a prefix to refer to the trailing CouchDB design document, holding Update, MapReduce and List functions. The qualifier “_list” is a prefix to refer to the trailing CouchDB list function. A list function processes the output of a MapReduce function. The CouchDB name for the output of a MapReduce function is a View. The reference to a View is placed after the reference to a List function. In the example in Figure 6.6 a list named item_transaction_by_provorg based on the view named fact_transaction_by_provorg from the design document named activities hosted in the database activities_database is requested. The list should return a full aggregation indicated by the parameter group=true. Also the list should return, not the full set, but the range of which the first key in the array starts with “NL-1” and the first key in the array is not Unicode larger than “NL-1ufff0”. This ensures that all facts in the view fact_transaction_by_provorg of the providing organisation NL-1 are returned. The facts will contain keys to all dimensional items specified in the view. Within the slice containing the primary dimension provider organisation all other associated dimensions can be used to filter aggregated values of the facts in the front-end. So defining the same views on the fact table with different key orders allows us to keep a check on volume for the front-end while still providing all related dimensions for all facts containing a primary dimension. In the demonstrator we have hidden this complexity in the slice selection entrance for the application shown in Figure 6.7. Figure 6.7 (Primary) Viewpoints page in the demonstrator. Five primary viewpoints are currently available: countries (shown), receiving organisations, sectors, funding organisations, and transactional year. The example shown in Figure 6.6 is hosted in the tab funding organisations.
  • 47. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 45 6.5 System models Figure 6.8 and Figure 6.9 show the system models [Sommerville, 2009] that illustrate how CouchDB works together with the pre-processing and front-end components. Figure 6.8 shows the architectural system model and Figure 6.9 shows the behavioral system model. Figure 6.8 Architecture diagram of Data Staging solution, Information Platform and Business Intelligence Application working together In the architecture model three components Data Staging, Information Platform and Business Intelligence application are delineated. It can be seen that CouchDB acts partly as the Data Staging area and partly as the Information Platform. In the current set-up both IATI XML data and IATI documents are collected on a local file system. In the future it is anticipated that these are retrieved from the IATI registry by means of webservices. The XML data is split into individual activities using a Python script, denoted as Python splitter. The activity is loaded into CouchDB where it is first converted to a JSON document using an external plug- in, denoted as XML update handler. The Activities documents, ie. Result Reports, in DOCX format are stripped from their presentation and styling content by a Python script, denoted as Python extractor. The plain text resulting from that is sent to the semantic annotation webservice Open Calais, using another Python script, denoted as Python Calais connector. The output of the Calais service is a JSON version of the report with identified entities included as metadata. In addition the activity identifiers are extracted from the text using regular expressions in the Python script. These are added to the JSON document. This can be directly saved into CouchDB.
  • 48. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 46 At this stage both the activity data and documents are accessable in JSON format. From these JSON documents views are generated with MapReduce functions. These views are materialised entities. The front-end application is served up by CouchDB as an HTML page built from a HTML template. This is done using a show function that populates the template with data and metadata. The page holds references to javascript libraries which load the Exhibit-Simile javascript engine. Also it calls back to CouchDB to serve up lists with items, i.e. entities, that the Exhibit-Simile engine can consume. This engine processes the incoming items into a local data-graph which can be viewed and manipulated using Exhibit-Simile widgets. Individual activities and reports can be retrieved from the overview widgets. These are rendered using show functions that are directed via templates. Note that this front-end technology can be replaced by another. At the CouchDB side only new list functions have to be designed. The MapReduce functions to produce the entities remain the same. This shows how the data sources and the front-ends consuming them are loosely coupled. Figure 6.9 shows this dynamic in a behavorial diagram. Here three engineering interactions and three analyst interactions can be seen. The top two engineering interactions show the pre-processing and uploading of resp. an activity XML record and an activity DOCX document. The third engineering interaction shows the generation of views with materialised entities. It can be seen that the views form the interaction points between engineering and analysis. The bottom analyst interaction shows the request of an Activity or Document description page. The user facing pages were shown in Figure 3.2 and Figure 3.3 in chapter 3. These HTML pages are templates that are populated with the corresponding JSON document using Mustache directives. Once the page is loaded it calls back to CouchDB to request resp. related documents and activities. These have been prepared using views. The description pages described above are accessed by the user from aggregated fact pages as was shown in Figure 3.4 and Figure 3.7 in chapter 3. The middle analyst interaction shows how this page interacts with Exhibit-Simile and CouchDB. In the first pass the page is requested using a show function that populates a template with specifications. These are general variables that occur multiple time in the template, therefore it makes sense to store them externalised. This is done in a special JSON document. In the second pass the HTML page calls back to request a slice of the fact tables for activities and transactions. This slice corresponds to the viewpoint selected in the top analyst interaction, which is explained below. For this slice all corresponding dimensional items are requested as well. Exhibit loads the items in a client-side data graph, which is controlled and viewed using Exhibit-Simile widgets. Several examples were shown in Figure 3.4 to Figure 3.7 in chapter 3. The facet widgets allow for filtering items. Filter selections propagate through the graph according to how the collections are bound together. This is explained in detail in chapter 9. Exhibit tabs display the corresponding tables with aggregated facts.
  • 49. An Information Platform for Business Intelligence in the Aid Sector based on Open Data and Documents 47 Figure 6.9 Sequence diagram of Data Staging solution, Information Platform and Business Intelligence Application working together