Text mining is projected to dominate data mining, and the reasons are evident: we have more text available than numeric data. Microsoft introduced a new technology to SQL Server 2014 called Semantic Search. This session's detailed description and demos give you important information for the enterprise implementation of Tag Index and Document Similarity Index, and will also provide a comparison between what semantic search is and what Delve does. The demos include a web-based Silverlight application, and content documents from Wikipedia. We'll also look at strategy tips for how to best leverage the new semantic technology with existing Microsoft data mining.
6. Introduction
SQL Server has new Programmability Enhancements
Statistical Semantic Search
File Tables
Full-Text Search Improvements
These combined technologies make SQL Server 2014 a strong contender in text mining
7. Outline
Why Microsoft is competitive for data mining
Definitions: what is text mining?
History: how Microsoft’s semantic search was born
What is inside semantic search
Logical model
Demos
Performance
Microsoft Resources
8. Why Microsoft is Competitive for Data Mining
Based on 2012 and 2013 Surveys
9. Gartner 2013
Magic Quadrant for Business Intelligence and Analytics Platforms
Retrieved from http://www.gartner.com/technology/reprints.do?id=1-1DZLPEH&ct=130207&st=sb–February 5, 2013
10. Gartner 2013
Magic Quadrant for Data Warehouse Database Management Systems
Retrieved from http://www.gartner.com/technology/reprints.do?id=1-1DU2VD4&ct=130131&st=sb–January 31, 2013
13. Definition
Data mining is the automated or semi-automated process of discovering patterns in data
Text mining is the automated or semi-automated process of discovering patterns from textual data
Machine learning is the development and optimization of algorithms for automated or semi-automated pattern discovery
18. History
July 2008
Microsoft purchases Powerset for US$100 Million
Google Dismisses Semantic Searchhttp://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine- powerset-for-100m-plus/
http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel- cx_ag_0701powerset.html
19. History
March 2009
Google announces “snippets” as relevant to search
The media picks this story up as “semantic search” http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google- results.html#!/2009/03/two-new-improvements-to-google-results.html
20. History
February 2012
Google announces Knowledge Graph, an explicit application of semantic searchhttp://mashable.com/2012/02/13/google-knowledge-graph-change-search/
21. History
April 2012
Microsoft purchases 800+ patents from AOL for US$1 Billion
Among the patents are semantic search and metadata querying –older than Google
http://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/
22. What is inside Semantic Search
Text Mining introduced for SQL Server 2012
23. Future: Most data is Text
Two Research Types
•Quantitative research = data mining
•Qualitative research = text mining
The future is combining both
24. Statistical Semantic Search
Comprises some aspects of text mining
Identifies statistically relevant key phrases
Based on these phrases, can identify (by score) similar documents
25. FileTables
Built on existing SQL Server FILESTREAM technology
Files and documents
Stored in special tables in SQL Server
Accessed if they were stored in the file system
26. Full-Text Search Enhancements
Property search: search on tagged properties (such as author or title)
Customizable NEAR: find words or phrases close to one another
New Word Breakers and Stemmers (for many languages)
29. (iFilterRequired)
Documents
Full-Text Keyword Index
“FTI”
iFilters
Semantic Document Similarity Index “DSI”
Semantic Database
Semantic Key Phrase Index –
Tag Index “TI”
30. Languages Currently Supported
Traditional Chinese
German
English
French
Italian
Brazilian
Russian
Swedish
Simplified Chinese
British English
Portuguese
Chinese (Hong Kong SAR, PRC)
Spanish
Chinese (Singapore)
Chinese (Macau SAR)
31. Phases of Semantic Indexing
Full Text Keyword Index “FTI”
Semantic Key Phrase Index –
Tag Index “TI”
Semantic Document Similarity Index “DSI” http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
35. Integrated Full Text Search (iFTS)
Improved Performance and Scale:
Scale-up to 350M documents for storage and search
iFTSquery performance 7-10 times faster than in SQL Server 2008
Worst-case iFTSquery response times less than 3 sec for corpus
Similar or better than main database search competitors
(2012, Michael Rys, Microsoft)
36. Linear Scale of FTI/TI/DSI
First known linearly scaling end-to-end Search and Semantic product in the industry
Time in Seconds vs. Number of Documents
(2011 –K. Mukerjee, T. Porter, S. Gherman–Microsoft)
39. Text Mining References
Video
http://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic- Search
http://www.microsoftpdc.com/2009/SVR32
Semantic Search (Books Online) –explains the demo
http://msdn.microsoft.com/en-us/library/gg492075.aspx
Paper
http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf
40. Software
SQL Server Enterprise (includes database engine, Analysis Services, SSMS and SSDT)
http://www.microsoft.com/sqlserver/en/us/get-sql-server/try-it.aspx
Microsoft Office 365 Homehttp://office.microsoft.com/en-us/try
43. Connect
Data Mining Resources and blog http://marktab.net
Linked In
Twitter @marktabnet
44. Conclusion
SQL Server provides data mining and semantic search
The core technology allows document similarity matching
The results can be combined with SQL Server Data Mining (such as Association Analysis)