Big data presents organizations with a massive opportunity to generate new sources of value, providing evidence and content for the design of new products, new processes, and more efficient operations. To generate this value, data-driven organizations are leveraging sources they were never able to analyze before, such as machine-generated data, log files, and transcripts, in addition to traditional relational files. But simply having access to more data than ever before does not automatically lead to usable business insights. The data still needs to be wrangled—explored, structured, and cleansed—before it can be analyzed effectively.
The Royal Bank of Scotland serves over 30 million customers worldwide, and ensuring that they receive top-notch service is crucial for RBS’s business. Making use of big data—particularly unstructured and semistructured data from online customer web chats—allows RBS to understand the customer experience at all points of interaction with the bank and leverage that data to personalize and enhance their customer engagement processes.
Dan Jermyn and Connor Carreras discuss how data wrangling has enabled RBS to easily extract insights from unstructured data stored in Hadoop and dramatically decrease the time required to prepare data for analytics. Dan and Connor then describe specific big data wrangling techniques that can be used to support organizational analytics goals and explain how successfully leveraging big data can transform the customer experience.
10. I
N
G
E
S
T
I
O
N
A
C
C
E
S
S
DATA SOURCES
Transactional Data
banking
credit cards
lending
wealth
mortgages
ledgers
trades
payments
Interaction Data
social
webchat
Analytics
Reporting
Data Product Models
BUSINESS OPERATIONS
Data Wrangling within the RBS Hadoop Data Lake
Discovery
Zone
Shared
Zone
Raw Data
Zone
11. Big Data Wrangling with Trifacta
1. In-line Profiling: Surfaces distributional issues or quality problems as you
work.
1. Sampling & Scalability: Enables real-time profiling and responsive wrangling
that easily scales to many TB of data.
1. Structured Transformation Previews: Visualize the result of every
transformation as you work. Reduced number of iterations.
1. Transformation Suggestions: Targeted suggestions based on prior behavior,
metadata, and user interactions allow you to quickly transform your
data…technical skills not required.
11
19. See a Demo at Booth #309
Download Trifacta Wrangler for Free
trifacta.com/start-wrangling
Hinweis der Redaktion
Before diving into an explanation of how RBS has used data wrangling to unlock the value of their webchat data, I want to share an interesting statistic that Forrester research has compiled. <quote>
Why? Because it’s difficult and time-consuming to analyze. You might have experienced this first hand at your organization.
Big data throws a wrench into the traditional data analytics process of formulating a question, analyzing the curated dataset in the enterprise data warehouse, and discovering insights from that data. Now your analyst is trying to look at non-traditional data sources that have not been curated internally and given a schema in the EDW. Since non-tabular, semistructured, and unstructured data doesn’t fit easily into relational databases, Excel or Tableau, you need to perform a significant amount of work upfront to make the data ready for analysis. And depending on its complexity, making that data ready for analysis requires someone with a very technical skillset—often someone who knows how to code.
<CLICK>
That’s where data wrangling comes in.
Brief discussion of what data wrangling is and why it’s valuable.
At RBS, Trifacta is deployed in the discovery zone. Analysts use Trifacta to access either the raw data ingested into the raw data zone or data already cleansed by their colleagues and landed in the discovery zone. They can then publish their wrangled data back into the discovery zone for reuse by a larger team or, when they are ready for their data product to be consumed by the downstream business analytics tools, they can publish the wrangled data into the shared zone.
The simplified IT data workflow will look at something like that
Data sources are extracted or provided (third-party) batch and realtime
Data feeds by ETL and ESB technology lands in the Hadoop Data Lake (the Raw Zone) & the data product zone (real-time interaction ie. Next best offer, fraud detection, etc.)
Data flows from the refinery zone (exploration/discovery/hypothesis) to the Optimized zone (data getting ready for consumption)
Data is consumed by data integration technology to feed other systems. Data is consumed by business insight tool for reporting or statistical modeling
Refinery zone is mostly haddop
ODS and EDWH are often RDBMS and appliances
Data Product Zone can be NO SQL (Not Only SQL – MongoDB, Hadoop) or Traditional RDBMS
Four key features of Trifacta have empowered RBS’s analysts to quickly and easily wrangle their webchat data.
<read/elaborate on slide text>
You’re going to see each of these features in action right now.