Accelerating Delivery of Data Products - The EBSCO Way

Accelerating Delivery of Data Products –
the EBSCO way
Mikhail Vaynshteyn
Ron Bennatan

About Our Parent Company –
EBSCO Industries
Twenty three (23) diverse businesses including:
– Subscription provider for more than 300,000 titles from more than 78,000 publishers
worldwide, publisher of full text databases and other online products
– Steel joist and metal roof deck manufacturing
– Insurance brokerage, national promotional
products supplier; Vitronic, Crown
– Real estate development; Alys Beach,
Mt Laurel
– Manufacturer of fishing lures & hunting products: Rebel, Yum, Booyah, Summit,
Moultrie, Knight & Hale, and more
EBSCO has nearly 5,000 employees - 1,000 outside the US
Among Forbes Top 200 Privately Held Companies

• EBSCO Information Services: Largest business unit in EBSCO
industries – over 3,200 employees
• The most highly-used online research platform to libraries
worldwide
• Largest subscription service agency in the world
• Over 150,000 library customers worldwide
• Relationships with more than 95,000 publishers internationally

Product Lines
Approaching 400 research and educational databases via
EBSCOhost with over 2.6 billion records, and 100 million
searches per day

Discovery Service Product Line
Leading discovery service provider to libraries
and institutions worldwide

Publisher &
Partner Research
Databases
Periodicals
Books
DissertationsCatalog
E-Books &
Audiobooks
EBSCO Discovery Service

EBSCO Discovery Service (EDS)
• Single “search box” experience with a single result list
• Search entire collection of electronic resources
– Library Catalog
– Research Databases
– Subscribed resources from various providers
• Fast, unparalleled relevancy ranking and quality
• Installed at 8,000+ institutions (mainly academic) in over 100
countries

EBSCO Host Integrated Search (mid-2000s)
• Federated Search
– API calls to each provider of electronic resources that library has
subscribed to
• Remote Calls
• Sometimes screen scraping
– Relevance ranking problem
• Slow

• Full text and metadata
used for construction of
the search index
• Access to full text
controlled by the
Provider
• Fast; relevancy ranking
problem addressed
EDS Today

Challenges
• Discovery Service users require a common search
experience across all data sources independent of quality of
data!
• Data comes in various formats (XML, MARC, CSV, etc…)
• Large data sets (10k to 156+ million documents)
• Data sources must be analyzed before indexing

2015 Goals for EBSCO Discovery Service
• Analyze and map one new data source a day
• Ingest additional 500 high-value data sources
• Analyze and map important data fields
• Identify and resolve variances with data
(2009-2014: the throughput varied from 4 to 10 data sources a month)

Day in a Life of a Database Designer
Data
Designer
Format Spec
Developer QA
Format
Loader Product
Product
Released

What changed for the Designers?
Data
Designer QA
Generic
Loader Product
Product
ReleasedJSON
transform Reports
Mapping
Rules
Developer
JSON Studio

Why did we use MongoDB
• Simple to setup and scale
• Documents stored as a whole
• Flexible schema
• Simple way to discovery paths/structure of the documents
• Ingest of 156 million document collection was a breeze

Why JSON Studio
• Ease of data exploration
• Wizard based aggregation pipeline builder
• Document structure discovery with ad hoc query capabilities
• Data visualization capabilities
• Ability to collaborate query strategy between team members
• We learned about JSON Studio at MongoDB World 2014!

SQL or JSON-Native?
• K / V stores
– Ad hoc querying is complex
– Steep learning curve for designers
– Complex loading process
• SQL / Columnar stores
– Ad hoc querying became complex quickly
– Complex loading process
• JSON was a good match for our documents

Example – Field Appearance
• Answer questions such as:
– What document types or record types
have both ISSN and ISBN, if any?
– How many documents with a doc-type
of “electronic media” do not have an
associated URL?
– ..
• Compare complexity:
– “Real” SQL:
– JSON-enabled SQL:
• select
"metadata.dc:type","metadata.dc:identifier
" from ebsco where not exists (select 1
from
json_array_elements_text("metadata.dc:id
entifier") as jsondata where
jsondata.value like '%URI%') and
'Electronic Media' = ……
– Native JSON analysis in JSON Studio

Example – Coverage Reports
• E.g. Min/max pub dates by title, ISSN, ISBN and source
• Librarians build their own aggregations and reports, publish
reports, download to excel, etc.

Next Step – Deep Analysis Initiative

Deep Analysis Initiative
• Deep inspection and analysis of very large and very
heterogeneous data sets
• Attributes needed:
– Support for very large and diverse data sets
– Very large number of fields – some across all data sets, some not
– Sophisticated, in-place analytics
• Data is too large to bring into app for operations
– Need for high-performing analytic queries

Why we became a SonarW Design Partner
• We like/use MongoDB – SonarW is
MongoDB compatible.
• SonarW is a data warehouse for
MongoDB data
• Many new aggregation operators that
are useful in our analysis
• As a design partner we could
influence and ensure that our
workloads were supported well
• What is SonarW
– Data warehouse for JSON data
– MongoDB compatible
– Columnar database
– Everything runs in parallel
– Optimized for very large data
sets and complex queries /
analytic workloads

Example 1 – Frequency Analysis
• Example:
– Per pubtype (which can sit in an array or
not), compute how many have title group
information and what percentage of them
have an article title
– <10s for 20M documents (each doc
having hundreds of fields and sometimes
two levels of arrays
"artinfo": {
"@vendorFT": "Y",
…
"pubtype": "Journal Articles",
…
"tig": {
…
"atl": "Breaking the Language Barrier…”
}
"urlIP": "http://www.eric.ed.gov/…"
},
{
"_id": {
"pubtype": "Review"
},
"count": 103983,
"tig_atl": 103983,
"pct": 100.0
}

Example 2 – Working with Arrays
• Data is full of arrays – sometimes multiple levels deep
• Arrays sometime represent “foreign patterns” – e.g. NV pairs
• We can work with them because:
– $unwind does not stop in docs
– Fast $unwind and fast $unwind+$unwind
– Fast in-place conversion of array to document
• Aggregation speed is very important
– 20M docs
– 2 Levels of arrays – 50x and 5x
– Aggregations run on 5 Billion documents

Business Impact
Datasets per month
PLUS: 50% improvement in quality of mapping
0
5
10
15
20
25
30
35
Pre 2015 Mid 2015 End 2015
~ 2x
~ 4.5x

Questions on Schema are also Important
• E.g. “What fields present in the whole dataset occur only for Opinion Papers doc-
types?”
• Combination of two approaches:
– Using JSON Studio’s Schema Analyzer
– Using pipelines build with JSON Studio

Example 2 – Working with Arrays
• Data is full of arrays – sometimes multiple levels deep
• Arrays sometime represent “foreign patterns” – e.g. NV pairs
• We can work with them because:
– $unwind does not stop in docs
– Fast $unwind and fast $unwind+$unwind
– Fast in-place conversion of array to document
"df" : [
{
"@tag" : "13",
"sf" : {
"@code" : "a",
"#text" : "ED024700"
}
},
{
"@tag" : "001",
"sf" : {
"@code" : "a",
"#text" : "New"
}
},
..
"_13" : {
"a" : {
"#text" : "ED024700"
}
},
"_001" : {
"a" : {
"#text" : ”New"
}
}, …
{
"@name" : "DateEntry”,
"@label" : "Entry Date",
"#text" : "1969”
}

Example 2 – Cont.
• What percentage of the documents have
more than 10 @Code values for tag 20650
and a value for #text in tag 28 but the #text
value for tag 204 does not match regex …
• <5 seconds on a collection of 10 Million docs
• $unwind becomes a query on 1.5 Billion
documents (df-50x; sf-3x)
"record": {
"@uid": "11288235",
"timestamp": "2015-03-16T17:49:47Z",
"df": [
{
"@tag": "13",
"sf": {
"@code": "a",
"#text": "11288235"
}
},
{
"@tag": "10245",
"sf": [
{
"@code": "a",
"#text": "Once Upon a California Christmas."
},
{
"@code": "b",
"#text": "English"
},
{
"@code": "c",
"#text": "EN"
}
]
},

Accelerating Delivery of Data Products - The EBSCO Way

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Accelerating Delivery of Data Products - The EBSCO Way

Ähnlich wie Accelerating Delivery of Data Products - The EBSCO Way (20)

Mehr von MongoDB

Mehr von MongoDB (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Accelerating Delivery of Data Products - The EBSCO Way