EBSCO Information Services (EBSCO) is the leading provider of electronic journals, magazines, eBooks, audioBooks, and online research content for libraries, including hundreds of research databases, historical archives, point-of-care medical reference, and corporate learning tools serving millions of end users at tens of thousands of institutions worldwide. The EBSCO platform is a widely used platform serving the needs of researchers at all levels in academic institutions, schools, public libraries, hospitals, medical institutions, corporations and government institutions. Data is our business, and delivering new products quickly is our competitive advantage. We build hundreds of data products and accelerating the analysis, transformation of new datasets translates to revenue and competitiveness. And since our data is so varied, using MognoDB to store data flexibly and JSON Studio to analyze this data allows us to deliver products to market faster. In this session we will describe this process that helped us expedite delivery of new datasets, and give real examples of how data is used, analyzed and processed.
2. About Our Parent Company –
EBSCO Industries
Twenty three (23) diverse businesses including:
– Subscription provider for more than 300,000 titles from more than 78,000 publishers
worldwide, publisher of full text databases and other online products
– Steel joist and metal roof deck manufacturing
– Insurance brokerage, national promotional
products supplier; Vitronic, Crown
– Real estate development; Alys Beach,
Mt Laurel
– Manufacturer of fishing lures & hunting products: Rebel, Yum, Booyah, Summit,
Moultrie, Knight & Hale, and more
EBSCO has nearly 5,000 employees - 1,000 outside the US
Among Forbes Top 200 Privately Held Companies
3. • EBSCO Information Services: Largest business unit in EBSCO
industries – over 3,200 employees
• The most highly-used online research platform to libraries
worldwide
• Largest subscription service agency in the world
• Over 150,000 library customers worldwide
• Relationships with more than 95,000 publishers internationally
4. Product Lines
Approaching 400 research and educational databases via
EBSCOhost with over 2.6 billion records, and 100 million
searches per day
5. Discovery Service Product Line
Leading discovery service provider to libraries
and institutions worldwide
7. EBSCO Discovery Service (EDS)
• Single “search box” experience with a single result list
• Search entire collection of electronic resources
– Library Catalog
– Research Databases
– Subscribed resources from various providers
• Fast, unparalleled relevancy ranking and quality
• Installed at 8,000+ institutions (mainly academic) in over 100
countries
8. EBSCO Host Integrated Search (mid-2000s)
• Federated Search
– API calls to each provider of electronic resources that library has
subscribed to
• Remote Calls
• Sometimes screen scraping
– Relevance ranking problem
• Slow
9. • Full text and metadata
used for construction of
the search index
• Access to full text
controlled by the
Provider
• Fast; relevancy ranking
problem addressed
EDS Today
10. Challenges
• Discovery Service users require a common search
experience across all data sources independent of quality of
data!
• Data comes in various formats (XML, MARC, CSV, etc…)
• Large data sets (10k to 156+ million documents)
• Data sources must be analyzed before indexing
11. 2015 Goals for EBSCO Discovery Service
• Analyze and map one new data source a day
• Ingest additional 500 high-value data sources
• Analyze and map important data fields
• Identify and resolve variances with data
(2009-2014: the throughput varied from 4 to 10 data sources a month)
12. Day in a Life of a Database Designer
Data
Designer
Format Spec
Developer QA
Format
Loader Product
Product
Released
13. What changed for the Designers?
Data
Designer QA
Generic
Loader Product
Product
ReleasedJSON
transform Reports
Mapping
Rules
Developer
JSON Studio
14. Why did we use MongoDB
• Simple to setup and scale
• Documents stored as a whole
• Flexible schema
• Simple way to discovery paths/structure of the documents
• Ingest of 156 million document collection was a breeze
15. Why JSON Studio
• Ease of data exploration
• Wizard based aggregation pipeline builder
• Document structure discovery with ad hoc query capabilities
• Data visualization capabilities
• Ability to collaborate query strategy between team members
• We learned about JSON Studio at MongoDB World 2014!
16. SQL or JSON-Native?
• K / V stores
– Ad hoc querying is complex
– Steep learning curve for designers
– Complex loading process
• SQL / Columnar stores
– Ad hoc querying became complex quickly
– Complex loading process
• JSON was a good match for our documents
17. Example – Field Appearance
• Answer questions such as:
– What document types or record types
have both ISSN and ISBN, if any?
– How many documents with a doc-type
of “electronic media” do not have an
associated URL?
– ..
• Compare complexity:
– “Real” SQL:
– JSON-enabled SQL:
• select
"metadata.dc:type","metadata.dc:identifier
" from ebsco where not exists (select 1
from
json_array_elements_text("metadata.dc:id
entifier") as jsondata where
jsondata.value like '%URI%') and
'Electronic Media' = ……
– Native JSON analysis in JSON Studio
18. Example – Coverage Reports
• E.g. Min/max pub dates by title, ISSN, ISBN and source
• Librarians build their own aggregations and reports, publish
reports, download to excel, etc.
20. Deep Analysis Initiative
• Deep inspection and analysis of very large and very
heterogeneous data sets
• Attributes needed:
– Support for very large and diverse data sets
– Very large number of fields – some across all data sets, some not
– Sophisticated, in-place analytics
• Data is too large to bring into app for operations
– Need for high-performing analytic queries
21. Why we became a SonarW Design Partner
• We like/use MongoDB – SonarW is
MongoDB compatible.
• SonarW is a data warehouse for
MongoDB data
• Many new aggregation operators that
are useful in our analysis
• As a design partner we could
influence and ensure that our
workloads were supported well
• What is SonarW
– Data warehouse for JSON data
– MongoDB compatible
– Columnar database
– Everything runs in parallel
– Optimized for very large data
sets and complex queries /
analytic workloads
22. Example 1 – Frequency Analysis
• Example:
– Per pubtype (which can sit in an array or
not), compute how many have title group
information and what percentage of them
have an article title
– <10s for 20M documents (each doc
having hundreds of fields and sometimes
two levels of arrays
"artinfo": {
"@vendorFT": "Y",
…
"pubtype": "Journal Articles",
…
"tig": {
…
"atl": "Breaking the Language Barrier…”
}
"urlIP": "http://www.eric.ed.gov/…"
},
{
"_id": {
"pubtype": "Review"
},
"count": 103983,
"tig_atl": 103983,
"pct": 100.0
}
23. Example 2 – Working with Arrays
• Data is full of arrays – sometimes multiple levels deep
• Arrays sometime represent “foreign patterns” – e.g. NV pairs
• We can work with them because:
– $unwind does not stop in docs
– Fast $unwind and fast $unwind+$unwind
– Fast in-place conversion of array to document
• Aggregation speed is very important
– 20M docs
– 2 Levels of arrays – 50x and 5x
– Aggregations run on 5 Billion documents
24. Business Impact
Datasets per month
PLUS: 50% improvement in quality of mapping
0
5
10
15
20
25
30
35
Pre 2015 Mid 2015 End 2015
~ 2x
~ 4.5x
26. Questions on Schema are also Important
• E.g. “What fields present in the whole dataset occur only for Opinion Papers doc-
types?”
• Combination of two approaches:
– Using JSON Studio’s Schema Analyzer
– Using pipelines build with JSON Studio
27. Example 2 – Working with Arrays
• Data is full of arrays – sometimes multiple levels deep
• Arrays sometime represent “foreign patterns” – e.g. NV pairs
• We can work with them because:
– $unwind does not stop in docs
– Fast $unwind and fast $unwind+$unwind
– Fast in-place conversion of array to document
"df" : [
{
"@tag" : "13",
"sf" : {
"@code" : "a",
"#text" : "ED024700"
}
},
{
"@tag" : "001",
"sf" : {
"@code" : "a",
"#text" : "New"
}
},
..
"_13" : {
"a" : {
"#text" : "ED024700"
}
},
"_001" : {
"a" : {
"#text" : ”New"
}
}, …
{
"@name" : "DateEntry”,
"@label" : "Entry Date",
"#text" : "1969”
}
28. Example 2 – Cont.
• What percentage of the documents have
more than 10 @Code values for tag 20650
and a value for #text in tag 28 but the #text
value for tag 204 does not match regex …
• <5 seconds on a collection of 10 Million docs
• $unwind becomes a query on 1.5 Billion
documents (df-50x; sf-3x)
"record": {
"@uid": "11288235",
"timestamp": "2015-03-16T17:49:47Z",
"df": [
{
"@tag": "13",
"sf": {
"@code": "a",
"#text": "11288235"
}
},
{
"@tag": "10245",
"sf": [
{
"@code": "a",
"#text": "Once Upon a California Christmas."
},
{
"@code": "b",
"#text": "English"
},
{
"@code": "c",
"#text": "EN"
}
]
},