2. Self-Introduction
I am Shihao Mao also go by Jackson, So after receiving
my undergraduate degree from W.P. Carey School
Business School of Arizona State University, I start my
internship at IDG Capital, learning and accumulating
experience in venture capital, mergers and acquisitions
and other projects.(During this period, those investment
projects I participate were mainly Internet companies in
the consumer finance field, mergers and acquisitions
loans for real estate companies, and learning theories of
financial leverage.). After that, I participated in the IPO
projects of two Electronic Chip companies and investing
in a freight logistics platform company which is expected
to be listed on the SSE STAR MARKET (Sci-Tech
innovation board) in 2021 ZTF Securities Co., Ltd. In
addition, I also have some basic understanding and
participation in consumer finance, auto finance(Chery
Automobile mixed reform), and biodegradable plastic
bag production fields.
B.S. in Entrepreneurship |W. P. Carey Business
School of Arizona State University
LinkedIn:
https://www.linkedin.com/in/%E4%B8%96%E8%B1%AA-
%E6%AF%9B-780a6010a/
GitHub Repo Link: https://github.com/Msh-
Jackson/NYU_Integrated_Marketing
Kaggle Notebook Link: https://www.kaggle.com/shihaomao
3. Part II:Summary
What I learned from this class is statistical analysis used in marketing
which including the sampling techniques, marketing test design and
analysis. Among them, what I have benefited most is the sample size
estimation and test evaluation through learning and how to connect
with modeling/joint analysis, and ranking correlation, etc. These
contents will be the real life daily work that if I participate market
department or doing marketing job in the future.
In my personal perspective, my biggest improvement is when I use
Kaggle and Github to conduct sample size estimation and test
evaluation. In this process, I can find data sources. In addition, I have
also learned the logic of these models.
4. Part III: Your own market research report
• Session 1:Find a new dataset
• Session 2: Reproduce-Capstone Project Milestone 2: Research Design and The
Data
• Session 3: Reproduce: Capstone Project Milestone 3- Hypothesis Testing, OR
Capstone Project Milestone 4- Regression, Capstone Project Milestone 5- Clustering
5. Session 1:Find a new dataset
Data source:
https://www.kaggle.com/ranja7/vehicle-insurance-customer-data
Context :
The socio-economic data of the customer with details about the insured
vehicle is the data content. Data contains both categorical and numeric
variables. The customer lifetime value based on historical data has also
been provided which is essential in understanding the customer purchase
behavior.
6. Vehicle Insurance Customer
Data VIC :5106
Total Claims and Vehicle Class
This is the customer data and its
vehicle insurance policy. This
provides us with detailed
information about the customer
and its vehicle insurance, which
can be subdivided to subdivide
similar customers.
VIC:VIC
7. Reproduce: Capstone Project Milestone 3-
Hypothesis
URL of GitHub Report:
https://colab.research.google.com/drive/1Y2q-ZWvUuwSDy-
N8Q2M0DDPzatRM6rWG#scrollTo=3G7BH6yqxh31
Data source:
https://www.kaggle.com/ranja7/vehicle-insurance-customer-data
8. One Sample T-test
I use the One Sample T-test to test the mean of the “Months Since Last Claim” since I want to verify my guessing
on the mean of Months Since Last Claim.
Null Hypothesis: The mean of balance is 30.
Result: The p-value<0.05. We conclude that at the significant level 0.05. We can reject the null hypothesis that
the mean of balance equals to 30.
9. Two Sample T-test
1. I use the Two Sample T-test to test the if the mean
balance of people who have Gender and mean
Income of those who do not have loan are the
same.
2. Null Hypothesis: The mean Income of people who
have Gender and mean Income of those who do
not have Gender are equal.
3. Result: The p-value>0.05. So that is not significant,
and so it reject the null hypothesis that the mean
balance of two groups are equal.
10. One-Way ANOVA
• I use the One Way ANOVA to test the if the
mean Income of people with different jobs
are the equal or not.
• Null Hypothesis: The mean Income of
people with different Education are equal.
• Result: The p-value<0.05. We conclude that
at the significant level 0.05. We can reject
the null hypothesis that the mean Income of
people with different Education are not equal.
11. Power Analysis and Final Remarks
• We want to find the appropriate sample size of the research, so we set Cohen d,
power and alpha to do the power analysis
• Result: For a Cohen d effect size of 0.1, a power of 0.8, and a p-value of 0.05, we
need a sample size of 1571.
12. conclusion
According to the One-sample-T-test test, my hypothesis is that the number of months since
the last claim is 30 months, but my P-Value is less than 0.05 so I need more time and I
rejected the null hypothesis.
According to the Two-sample-T-test test, I set that the income of customers of different
genders is also different, and my P-Value is greater than 0.05, so my hypothesis can be
agreed, and for One-Way ANOVA, I I chose different education levels to define different
income variables, and my P-value is less than 0.05, so my assumption of different incomes
for people with different education levels is acceptable.
13. Part VI: Appendix (include your revised and
polished version of previous submissions)
•Capstone Project Milestone 2: Research Design and The Data
•Capstone Project Milestone 3: Hypothesis Testing
•Capstone Project Milestone 4: Regression
•Capstone Project Milestone 5: Clustering
14. Capstone Project Milestone 2: Research
Design and The Data
In essence, UMA allows counterparties to digitize and
automate any real-world financial derivatives(such
as futures, contracts for difference(CFD) or total
return swaps)。 It can also create self-fulfilling
derivative contracts based on digital assets just like
other cryptocurrencies。 Traditional financial
markets have high barriers to entry in the form of
regulations and regulatory requirements, which
often prevents individuals from participating.
Prospective traders and investors often find it ult to
participate in markets outside their local financial
system。 This prevents the emergence of truly
inclusive global financial markets. The purpose of
analyzing the highest value of the transaction is to
maximize the benefits for customers in short-term
transactions, and to achieve the highest possible
predictable evaluation of benefits so that customers
can trade at the peak.
Abstract:
15. Page 1 Executive Summary
Bank Marketing Data
• https://data.world/data-society/bank-marketing-data
Data source for value
• https://stats.oecd.org/index.aspx?queryid=33940#
URL of My Github repo
• https://github.com/MshJackson/NYU_Integrated_Marketing/blob/main/%
E2%80%9CJackson%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipyn
b
Capstone Project Milestone 3: Hypothesis Testing
Capstone Project Milestone 3: Hypothesis
Testing
16. Summary
Our data is obtained from these two websites respectively
showing the EU G20 market data and direct marketing
activities with Portuguese banking institutions.
At the same time, market data can compare two or more
quarters to clarify the quarterly trend.
17. Paired T-Test
Conclusion:
As my data concluded that the
numbers for the 2018 and 2019
quarters are different, the epidemic
still affected GDP and because it was a
negative number, it increased in the
second year.
• Why you choose this test?
Countries Tested twice in the same dial country
• What is the null hypothesis?
2018Q2 lower than 2019Q2
18. Conclusion: Because the normal appears to equal False so I choose corr(Method=“Pearson ”)
• Why you choose this test?
Date normal distribute being no outliers, so we choose Pearson
• What is the null hypothesis?
2018Q2/2019Q2 are not correlated
20. Conclusion: For the degree of freedom of 0.783, effect size of 0.3, a power of 0.80, and a type I error of 0.05,
we need a sample size of 81.
21. Conclusion: For a 0.27 cohen d effect size, a power of 0.80, and a type I error of 0.05, we need a sample
size of 1571 (for each group).
22. Final Remark
Limitation and Improvement:
The paired test is used to identify and describe various tests of mean
differences. They have functional limitations. Short-term contingency
and overlap of repeated data in repeated tests cannot be obtained
when comparing more than two quarters. Therefore, I believe that the
paired test does not satisfy the completeness of all interval analysis and
better tolerance, so short-term samples cannot be obtained, and short-
term growth trends will not be able to obtain a complete analysis and
conclusions suitable for short-term.
24. Executive Summary
The Data come from Kaggle
• URL: https://www.kaggle.com/c/customer-churn-prediction-2020
In this reports I use Logit Model to analyze probability of
success.
In our analyze X includes Voice message, day calls, and
night call
P is Probability of success
Results appears precision is high through the whole test
and total day calls has greatest increase on churn.
25. Logit Regression Result
The three X elements within this result appears only the vmail message is less than 0.05 which is
0.000(significant), and the rest two which (day calls and night calls are not significant influence on
the number of churns which can not reject the null hypothesis )
26. Odds ratio
The total day calls has greatest influence
which increase the day calls would
increase 1.000907 units odds ratio of
churn
In the picture it appears:
• Total day calls
• Numbers of voice mail messages
• Total night calls
27. evaluate the result
The accuracy rate is 0.87 The precision is high
base on the test result.
29. Executive Summary
• Kaggle Notebook URL:
https://www.kaggle.com/shihaomao/customer-segmentation-
sm9555
• Data Set:https://www.kaggle.com/sunshineluyaozhang/customer-
segementation-lz2520
• The country I choose is Germanyto build a RFM clustering and choose
the best set of customers which the company should target. And the
Metric I use is Calinski_Harabasz, base on the result appears the best
K is 4, therefor is first “Cluster_Id ”I select 2, and for“Cluster_Labels”
the selection is also 2.
31. K-Means Clustering
By the RFM criteria, we should choose the customer clusters with a lower recency, a higher frequency and amount. From
the K-means clustering results, we can see that customers with Cluster Labels=2 best fit the criteria.
Lowest recency is 2 Highest frequency is 2 Highest amount is 2
34. visualize Cluster Id vs Frequency
By the RFM criteria, we should choose the customer clusters
with a lower recency, a higher
frequency and amount. From the K-means clustering results,
we can see that customers with
Cluster_Labels=1 best fit the criteria.
Recency (Low) Frequency (High)
Amount (High)
We can see that Hierarchical Clustering returns 2 target customer
for customer cluster 2, which is a much smaller group than the
one that K-Means Clustering return.