SlideShare a Scribd company logo
1 of 16
Data Mining for Libraries:
What are the Possibilities?
Elaine M. Lasda Bergman, MLS
Twitter: @ElaineLibrarian
elasdabergman@albany.edu
Subject Librarian for Social Welfare
University at Albany, SUNY
SUNYLA Midwinter Conference
January 30, 2015
What is Data Mining?
http://pixabay.com/en/helmet-mine-mining-headgear-155632/
Knowledge Discovery In Databases
(KDD)
Input data
Data
Preprocessing
Data Mining Postprocessing Information
Adapted from Tan, et al. (2006), p.3
A note about data collection
• It’s the kicker: GIGO
• Cleaning
• Preprocessing
What is Weka?
http://www.cs.waikato.ac.nz/ml/weka/
Weka for Prediction
Mackenzie, Ian: https://www.flickr.com/photos/madmack/165933656/
Decision Tree From Weka
Did Student
use Email/IM
reference
Did student
Receive
instruction
0 sessions
1-2 session
Time between
grad/undergrad
1-5 years
100% yes
None
45% yes
5+ years
100% yes
3+ sessions
Student’ s
residency
status
On campus full
time
Off campus full
time Part time
Likelihood of graduate
students
using library resources
based on survey questions
Yes
No
Weka for Classification
http://www.geograph.org.uk/photo/971476
Animal Clusters
Weka for Association Analysis
http://analytics-arena.blogspot.com/2012/12/the-famous-beer-diaper-planogram.html
Association Rules
(Anomaly Detection)
https://www.flickr.com/photos/fonalite/2780198933/
How Can Libraries Use Data Mining?
http://dlg.galileo.usg.edu/dahlonega/dahlonega_logo.jpg
Circling Back:
It All Starts With Data Collection
http://www.navigatingthetension.com/2012/02/circle-wagons.html
Questions?
Me:
Elaine Lasda Bergman, Subject Librarian for Social Welfare, University at Albany
email: elasdabergman@albany.edu
Twitter: @ElaineLibrarian
Resources used:
Tan, P. et al. (2006). Introduction to Data Mining. Boston: Pearson Education, Inc.
Newton, et al. (2012). Your Statistical Consultant: Answers to Your Data Analysis Questions. Thousand
Oaks: SAGE Publications.
Two good Weka Tutorials:
http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf
http://www.uh.edu/~smiertsc/4397cis/WEKA_Data_Mining_Tool.pdf
Data Mining for the Masses:
https://rapidminer.com/wp-content/uploads/2013/10/DataMiningForTheMasses.pdf

More Related Content

More from Elaine Lasda

Your Systematic Review: Getting Started
Your Systematic Review: Getting StartedYour Systematic Review: Getting Started
Your Systematic Review: Getting StartedElaine Lasda
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesElaine Lasda
 
The New Metrics: conference presentation
The New Metrics: conference presentationThe New Metrics: conference presentation
The New Metrics: conference presentationElaine Lasda
 
Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!Elaine Lasda
 
Scholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized SettingsScholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized SettingsElaine Lasda
 
Personal Time Management
Personal Time ManagementPersonal Time Management
Personal Time ManagementElaine Lasda
 
Early Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly ImpactEarly Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly ImpactElaine Lasda
 
Computers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly MetricsComputers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly MetricsElaine Lasda
 
Computers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics FreebiesComputers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics FreebiesElaine Lasda
 
Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2Elaine Lasda
 
Data Literacy for Librarians
Data Literacy for LibrariansData Literacy for Librarians
Data Literacy for LibrariansElaine Lasda
 
UAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER GrantUAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER GrantElaine Lasda
 
Open Educational Resources Faculty Workshop
Open Educational Resources Faculty WorkshopOpen Educational Resources Faculty Workshop
Open Educational Resources Faculty WorkshopElaine Lasda
 
Data and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheetData and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheetElaine Lasda
 
Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandAltmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandElaine Lasda
 
From Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly MetricsFrom Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly MetricsElaine Lasda
 
Open Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher EdOpen Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher EdElaine Lasda
 
Research Impact Roadshow
Research Impact RoadshowResearch Impact Roadshow
Research Impact RoadshowElaine Lasda
 
Gaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric AnalysisGaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric AnalysisElaine Lasda
 
Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!Elaine Lasda
 

More from Elaine Lasda (20)

Your Systematic Review: Getting Started
Your Systematic Review: Getting StartedYour Systematic Review: Getting Started
Your Systematic Review: Getting Started
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case Studies
 
The New Metrics: conference presentation
The New Metrics: conference presentationThe New Metrics: conference presentation
The New Metrics: conference presentation
 
Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!
 
Scholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized SettingsScholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized Settings
 
Personal Time Management
Personal Time ManagementPersonal Time Management
Personal Time Management
 
Early Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly ImpactEarly Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly Impact
 
Computers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly MetricsComputers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly Metrics
 
Computers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics FreebiesComputers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics Freebies
 
Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2
 
Data Literacy for Librarians
Data Literacy for LibrariansData Literacy for Librarians
Data Literacy for Librarians
 
UAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER GrantUAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER Grant
 
Open Educational Resources Faculty Workshop
Open Educational Resources Faculty WorkshopOpen Educational Resources Faculty Workshop
Open Educational Resources Faculty Workshop
 
Data and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheetData and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheet
 
Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandAltmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land
 
From Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly MetricsFrom Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly Metrics
 
Open Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher EdOpen Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher Ed
 
Research Impact Roadshow
Research Impact RoadshowResearch Impact Roadshow
Research Impact Roadshow
 
Gaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric AnalysisGaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric Analysis
 
Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!
 

Recently uploaded

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 

Recently uploaded (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 

Data Mining for Libraries

Editor's Notes

  1. Data mining is a way to make sense of large datasests. It borrows theoretical underpinnings from statistics as well as computer science, allowing us to generate new insights and knowledge. Data mining is very useful in many capacities, and it is increasingly easy to generate useful models to predict, classify, and describe new information. The results of data mining analytics can be utilized in administrative decision making, understanding user behavior, and identifying appropriate resources and services to meet the needs of customers or library patrons. We don’t have time today to get into the nitty-gritty of how all of these algorithms and models can be implemented, but I wanted to show you today some of the possibilities afforded to libraries and librarians through the use of data mining techniques.
  2. Data mining is part of a larger process known as “Knowledge Discovery in Databases” or KDD. Essentially, first, we have a dataset, or input data. Then we preprocess it – which means to make it ready for analysis. This could mean changing the format of the data, throwing out incomplete data, converting the data into the proper format for analysis. Then comes the data mining, which we will focus on in a minute. After data has been mined, analyzed and our conclusions identified, we create visualizations and present the results in a way that is understandable and makes sense. This is known as postprocessing. Finally , after all of these treatments, we have moved from raw data to useable, actionable information and new insights.
  3. Before I get into the meat and potatoes of data mining, I would like to spend a moment on the importance of data quality. It is of paramount importance not only to collect as complete and accurate data as possible but then we need to make certain the dataset is scrutinized for errors and omissions, typographical errors, formatting considerations. An example would be the value zero in a dataset. This could mean the mathematical concept of zero (Numerical), no value entered (Null), or even No to a yes/no type of question(non-mathematical or nominal). We’ve heard for years the acronym GIGO: garbage in, garbage out. It is vital that we take the time to get rid of the garbage in the preprocessing and formatting stages of our knowledge discovery process. The data I am using for the library related examples in this presentation are from data collected last fall through a user survey of graduate students at the University at Albany’s Downtown Campus.
  4. So, moving on to tools for Data Mining. For my current project with the grad student library user survey, I am using an open-source analytics tool known as Weka. It is free for downloading at the link above. I will focus on this tool, because it is the one with which I am most familiar, but there are other similar tools including RapidMiner, sciKit; and the statistical application R also has many data mining capabilities. If a researcher has a reasonably solid background in statistics she will find basic functionality in WEKA easy to grasp. I recommend the book Your Statistical Consultant as a reference, as well as Data Mining for the Masses. At the end of this presentation, there will be links to help you locate these books. (DMM is a free PDF, the other costs money). Next we will talk about the major types of data mining models and how they can be used. The main types of data mining tasks are: Prediction, Classification, and Association. My aim is to show you some possibilities and pique your interest to learn more!
  5. Prediction is exactly what it sounds like: we hope to reliably determine the value or outcome of a variable (known as the target variable), based on the values of other variables in the dataset (known as the explanatory variables). There are several ways to do predictive analysis, but the one I am going to show you today is known as a decision tree algorithm. What a decision tree does for us is to ask a series of questions in a heirarchical format, not unlike a flow chart. Decision trees are easy to interpret. They are resistant to data “noise” which is a term for outliers, less relevant variables, and so forth. The tricky part with decision trees remains the structure and preprocessing of the data, and there is a risk of “overfitting” your model. Overfitting can occur when the decision tree algorithm computes a high accuracy rate with your training set (or model), but does not work as well on test data or new data. On the next slide, I will show you a decision tree of a training set I did to predict the likelihood of a survey respondent to have answered that they frequently use library resources, based on answers to certain demographic questions.
  6. Ok, so this is what my decision tree looks like. I have a simplified version of a few of the branches on the next slide so you can see how this works.
  7. Here is a piece of the decision tree made prettier (sort of) through Microsoft Shapes. Hopefully it will make more sense to you. What we are looking at are the questions the decision tree asks in order to predict the likelihood of a student at my library who uses library resources. The first question it asks is: Did the student use email or Instant message reference. There is a branch for “YES” and a branch for “NO”. Let’s follow the right side for a minute. If they did NOT use email or IM reference, the next question it asks is about the student’s residency and full time/part time status, and there are 3 options for this variable. Be aware, each of those options has more branches below it in the real tree I showed you on the last slide, so the probabilities are not calculated. Going back up to the top, let’s follow if they answered “YES” to using electronic reference. The next question the tree asks is… Did the student attend any library workshops? And we have the values: none, 1-2 sessions, and 3 or more sessions. If the student took one or two sessions, the next criteria that matters is how much time the student took between graduate studies and undergrad. I should add that the other numbers of sessions also have lower branches, we are trying to simplify by following the shortest trail of branches. The interesting feature of the time between grad and undergrad is that those who have taken any sort of break are practically guaranteed to use library resources, provided they received instruction and used electronic reference. Those who do not take a break were less likely despite these library interventions. Hmmmm….
  8. Classification is a way of identifying similarities or patterns in a dataset based on comparable variable attributes in each case. There are a number of ways to do this, but I would like to show you clustering. Clustering is a very visual way of determining patterns in your data. Any cases with lots of similar values in their variables are grouped closer together, those with different values are grouped farther away. What this means is you can inspect the clusters and determine what the similar values are in each case. The similar values give you the pattern of each cluster, which in turn is a way of classifying your data. Unfortunately, my own data did not respond well to clustering, which I will discuss in a minute. For now I will show you a classic clustering example from zoology – predicting animal genus based on physical characteristics of the creature.
  9. I know this is hard to see, but there is a purple cluster, a blue cluster, a brown cluster, a yellow cluster, green cluster. Each of these represents a grouping the clustering algorithm determined based on characteristics of the animals. For example, worms and snakes have no legs, lay eggs, whereas seals porpoises and dolphins are aquatic mammals. The tricky part of cluster analysis is that unlike the decision tree, it IS very sensitive to “noise” in your data, notice that platypus, which is a mammal, is classified with the turtle type of creatures. As mentioned, in my case clustering the graduate student survey data was not particularly successful. This is because #1, my dataset is probably too small and #2 I may have asked the wrong questions or combinations of questions to generate clusters. One thing I intend to do is go back to the preprocessing stage of the data and see if there are ways to group responses to variables that reduce data “noise” and give us some sort of pattern.
  10. Association can be used for classification purposes as well. Association, however is based on “rules” rather than “clusters”. Association rules are if…then rules that show patterns of association variables. This allows for complex comparisons and generates some interesting associations. The famous example of association “rules” is the urban myth that, due to what is known as “marketbasket analysis,” Walmart (or whatever big box store) puts its beer and diapers in the same aisle. So the rule would go: If customers by diapers then they are likely to also purchase beer. The myth goes that this is because the young husbands get sent out to buy diapers and pick up some beer for themselves while they are out. Association rules is also how Netflix and Amazon determine what to recommend to you. Association rules are easy to interpret and describe, and they handle skewed data very well (for example, my survey results were 70% women, 30% men).
  11. Here is what the association rules look like in Weka. I understand that the variable names and values are not particularly descriptive on this screenshot, you need my survey “codebook” to explain what the variables mean and what each value signfies. This run of my association rules algorithm shows that if students are “somewhat” confident in finding the information they need (confidence =4), they are likely off campus, full time students (residency=2). This is interesting because survey gave 4 options: extremely confident, very confident, somewhat confident, and not confident at all. Zero respondents indicated that they were not confident at all. Our least confident students are “somewhat confident” and our least confident students are most frequently full time commuters, as opposed to part timers or full time on campus students. Hmmmm…..
  12. There is, actually a fourth data mining task known as anomaly detection. This is the opposite of something like classification or prediction, it is identifying the outliers which DON’T fit your model. Practical uses for anomaly detection are are: determining credit card fraud (your credit card was just used in Bali, and you are in Rochester) and email spam filtering algorithms. I don’t have a good example of this one, because the goal of my survey was to look for patterns and trends, and also because it works best with so-called “Big Data,” but you can see where this is a useful application in a business context.
  13. So what are some things we can do with data mining techniques to provide better user services, work processes, and administrative decisions? Like me you could take user survey and use the data to predict and associate certain characteristics with library resource use. (and try to classify!) You could try to determine the likelihood that a book will go missing by considering various circ stats as the explanatory variables (times circulated, publication year, call number range, etc.) Cluster the patterns of library use of subject groupings (call number ranges) by explanatory variables such as counts of : interlibrary loans, purchases on demand, circulation of books, journal article downloads Determine which academic majors or faculty departments are most associated with the use of various services: (reference, borrow a kindle, check out more than 5 books a semester)
  14. I hope this has talk has given you some ideas about the possibilities of what we can learn from mining library data for interesting insights, patterns, and information. Some things to consider – all of this interesting analysis is predicated on GOOD DATA, and getting “good” data may be more challenging than the analysis itself. Privacy concerns may keep us from mining data about our library users, for example, we don’t make circulation data available. But such data in the aggregate can be used to great effect, provided extreme care is taken to protect our users’ identities. Second, we may not currently be collecting the data we need to appropriately tell our stories; we may have to change what information we collect and how we collect it to get the “good stuff.” Do a data inventory of your library! What is missing to help you achieve your strategic goals? And, even if you have data that you’re ready to mine, you may not be ready to do the mining yourself. But now that you know what possibilities exist, why not ask around on campus for help? Computer science and statistics students may welcome the opportunity.