Innovaccer is a California based research acceleration firm assisting hundreds of researchers from Harvard, Stanford, Wharton, MIT etc.
This slide describes our capabilities of assisting research endeavors. To get in touch with us, please write to info@innovaccer.com
To know more, please visit our website : www.innovaccer.com
3. Data Mining
Mining longitudinal and linked datasets from web and other archives
4. Web Data Mining
Indian Patent Data
Task in hand was to write a code to harvest Indian Patent Data from Indian Patent Office Website on an ongoing basis
Code was prepared to extract information on all patents filed between date A and date B defined by user.
Further, a data quality tool was written to clean data from the harvested pool of data.
A scheduling code was prepared to run harvesting code followed by data quality code every week (as Indian Patent office publishes patents every week).
This master scheduler code runs every week and updates all fields of new patents in the master pool of database.
5. Task in hand was to write a code to harvest latest earning conference calls of all NASDAQ listed firms and parse each call at individual speaker level
Earning Conference calls were harvested from NASDAQ, Thomson StreetEvents and other sources on a daily basis
Automation code was written to parse those conference calls at individual speaker level with meta data like speaker name, designation, company and speech text
A scheduling code was prepared to run harvesting code and parsing code followed by harvesting every day (from NASDAQ)
This master scheduler code runs every day and updates all fields of new conference calls in the master pool of database.
Web Data Mining
Earning Conference Calls
6. Task in hand was to create a longitudinal data-set containing details of CEO Compensation
DEF 14-A Proxy Filings at SEC Edgar for each company contains details of how the CEO was compensated and how much
Automation code was written to parse those proxy filings and pull out total compensation as well as break-up including basic salary, bonus, EPS, etc.
A team of financial domain analysts manually studied proxy filings and created dummy variables for various kinds of compensation strategies used in practice to compensate CEO/Key Executives
Finally, the data-set was cleaned up, standardized and sent to quality team for vetting.
Archive Data Mining
CEO Compensation
8. Task in hand was to prepare, implement, and realize statistical models to optimize school transportation and logistics costs for real cities based on real data
A Professor was appointed by NPO and a government to prepare statistical models aiming at city planning for optimizing school costs
Our statisticians used clustering techniques, optimization techniques, and multi-processing to minimize total transportation + operational cost on schools for a city.
Various constraints and difficulties were taken care of like bus capacity, total riding time, mixed loading, land and sea routes, different types of buses and classroom sizes, and many more.
A graphical user interface with flexibility was delivered to client to view output on maps, spreadsheets, manually change routes, upload input data etc.
Statistical Modeling - Optimization
Logistics Optimization
9. Using Hidden Markov Models to predict various hidden layers of consumer behavior
Using a hypothesis that consumers happen to have sequential hidden phases of behavior towards a product ecosystem over time, our task was to validate that there are hidden layers and understanding their transitions with time.
Using data on adoption and rejection of various kinds of product in a product ecosystem by a consumer from a panel data having more than 10,000 products and 100,000 consumers, we prepared a hidden markov model with maximum 8 states and 20 time-periods.
Computation time for this analysis was of exponential order with each iteration of optimization had to store Gigabytes of data. Thus an Hadoop layer on R was used to reduce time and memory complexity on a single machine using HDFS on five machines.
Statistical Modeling
Hidden Markov Model
10. Data Analysis – Predictive Analysis Predicting Hospital Readmission Rates
Predicting and Benchmarking Re-admission rates using Re-Admission data and Performance and cause factors.
With the implementation of Obama care bill hospitals now have a pressing need to significantly reduce their readmission rates or end up paying huge penalties. A comprehensive solution requires collating all readmission data available from various sources and then identifying the causal factors affecting Re-Admission.
AIC + BIC Analysis was carried out for each disease to remove variables that share insignificant dependence with Re-Admission Rates, leaving us with key variables for each disease. After running Multiple Linear Regression with Re-Admission Rates as Response Variable and other variables as Explanatory Variables, the results met the expectation of common logic.
We further developed a BI tool using a comprehensive model to successfully predict readmission rates and their dependence on various KPIs.
11. Big Data Analysis
Implementing distributed computing methods and cutting edge statistical models to analyze huge chunks of data
12. Data Analysis – Contextual Analysis
Sentiment Analysis
Navigating through millions of tweets to understand sentimental behavior of people
Millions of tweets for various people were extracted using data harvesting techniques.
Using Naïve Bayes Analysis and a training set of 100,000 tweets, a supervised classifier was established to decide sentiment of any tweet. This analysis involved prior cleaning of stop words and substituting prolonged chat versions of words with their English word counterparts.
Sentiments of people were analyzed in various control groups.
It was not possible to complete this project in a record two month timeframe without use of big data technologies.
13. Data Analysis – Big Data Analysis
Analyzing high-frequency Trading data
Querying high frequency data from limit order book for all NYSE-traded securities
The project involved analyzing high frequency datasets obtained from NYSE Open Book and Open Book Ultra.
One of the biggest challenge in analyzing this data was the scale of data, the total size of data was 30 TB and increasing. Performing simple analysis involved querying the database which took several hours at time.
We distributed the computation using algorithms based on MapReduce Technology and implemented the same using python. This enabled near
real-time querying.
This setup was later used for analyzing price variation and impact on market liquidity. The distributed computing approach saved lot of analysis time.
14. Data Analysis – Prescriptive Analysis
Personality Analysis
Personality of Key Executives of various companies was examined using data on text-recorded public speeches of these key executives
Having access to parsed database of speeches of key executives of NASDAQ traded firms, we performed personality analysis of over 30,000 people (including chief executives and analysts) and 10 million words using pysho- linguistic databases of words and SVM technique.
Each person was given a score from 1 to 8 in big five personality traits.
These traits were correlated with following quarters conference calls and patterns were observed in personality of chief executives and their effect on financial (especially stock) performance in the following quarter.
16. Merging company names in financial and patent data-sets
Task was to merge US financial data-set having 15,000 unique standardized company names and US patent data-set having 800,000+ unstandardized unique company names.
After a thorough fine-tuning of our hybrid model for company name matching, we reduced those 800,000+ unstandardized unique company names to 200,000+ standardized company names and mapped all 15,000 company names in finance data-set to patent data-set.
Results were spectacular worth mentioning that we found 105 types of spelling mistakes of IBM which we grouped in one standardized name.
Data Standardization
Merging Patent and Finance Data
17. Standardizing names of plaintiffs and defendants across a litigation data- set containing approximately 4.7 million unique names
Used benchmark algorithms on Levenshtein & N-gram to perform standardization of names, and implemented them in Python
Prepared and optimized hybrid parameter models for the above algorithms
Implemented the above on a five-node cluster using Hadoop Distributed File System technology
Data Standardization
Standardizing Names of individuals
19. Developing Mobile and cloud based survey application for researchers for conducting, analyzing and reporting surveys
Technology Implementation
Developing Mobile Applications
The problem of capturing data in surveys, digitizing results, performing analyses was a long drawn error prone manual process.
We worked with researchers at the university to develop an android application for collecting data on the smartphones on the on ground surveyors.
The result – no skip logics, no digitization, audio and video responses, an analysis dashboard, GPS location reliability, and all within a click of a button.
20. Building a platform to assist in collecting, tracking, and analysis of large data-sets
Technology Implementation
Developing Desktop Applications
Developed a login based offline .NET framework technology which collects survey data, syncs local data with cloud, and allows users to analyze data, all on a continual basis using the most reliable and scalable cloud-based database technology, Google SQL
Considering physical constraints like electricity and internet shortages, our team developed a technology which can be installed at any remote terminal where they would operate their data collection activities
It was made sure that all local copies of the data is kept in their local machines until an internet connection is activated which then syncs data to a cloud database which can be accessed by client’s core team members from anywhere any time
21. Programmed the mail-engine to send and monitor distributed email campaigns
Technology Implementation
Developing Email Delivery and Analytics Engine
One of the major challenges faced by the researcher was the lack of a comprehensive solution to disseminate information, to the defined target segment. Conventional programs were cluttered, repeatedly hit by delivery failures and could not provide the key insights.
The application was hosted on appengine with a scalable infrastructure to enable real-time monitoring and delivery, whereas the database was centrally managed on Google Cloud SQL
The application was programmed using distributed loading techniques to send more than 1 million emails per day.
27. Contact
We like challenges
Keep on throwing questions and hear from our engagement managers on how we can help.
Write to us at info@innovaccer.com
Or have a look at us www.innovaccer.com
Or Call us +91 120 431 1139