Assignment Content
About
This assignment must be completed in a group of minimum 3 students and maximum 4 students.
This assignment is a prelude to the third assignment. It aims at providing you with an authentic experience in carrying a simple data science project that covers all essential stages in a data science lifecycle. Since most professional science projects are performed by teams, you are therefore required to complete this assignment in a team.
Tasks
In short, you are required to complete the following tasks:
Pitch a public, open dataset of your choice.
Pitch 3 or 4 initial hypotheses to be pursued later in Assignment 3.
Profile the data using descriptive and/or inferential statistics techniques (which also requires that you demonstrate proficient data wrangling skills).
Present items 1, 2, and 3 above via a recorded presentation.
Your tasks are open-ended tasks, similar to most real data science projects. This means no two teams are likely to go to the same direction and produce similar results. You will find that your group will become experts in interpreting your own data and answering your own problems. Comparing performance across teams may not be meaningful and your team will be assessed solely against the rubric.
Your Python code base must be available on your Github repo. The extent of the group's collaboration and individual contribution will be evaluated solely based on Github.
General advice:
Select an open (publicly available) data - data that can be freely downloaded, preferably with an open license, allowing you to share the data freely. Choosing non-public data is not advisable as your instructor may be restricted from accessing the data.
Choose data in the domain for which team member(s) has some background.
Formulate open-ended hypotheses.
Carry out fresh data and/or analysis.
Where possible, choose a dataset and formulate problems pertaining to practical Australian contexts.
Data
Choose only 1 dataset.
It is fine to choose a dataset that has been analysed by others outside of the university. This is the natural consequence of selecting open data. However, you should either show that the analysis and exploration you plan has not been done before, or show that there is no code already available to do the analysis you intend. Your instructor is likely to view highly any original investigation.
Sources of open datasets include but are not limited to:
https://data.gov.au/
https://data.nt.gov.au/
https://data.worldbank.org/
https://www.data.gov/
https://datasetsearch.research.google.com/
https://www.kaggle.com/datasets
- Be careful. Many Kaggle datasets have published analyses. Choose something that has not been done before.
Github Classroom
Group work activities must be visible on Github Classroom.
The instructor will send an invitation to all students to join Github Classroom after all groups are formed. To accept this invitation, every student must have a free Git.
Assignment ContentAboutThis assignment must be complet.docx
1. Assignment Content
About
This assignment must be completed in a group of minimum 3
students and maximum 4 students.
This assignment is a prelude to the third assignment. It aims at
providing you with an authentic experience in carrying a simple
data science project that covers all essential stages in a data
science lifecycle. Since most professional science projects are
performed by teams, you are therefore required to complete this
assignment in a team.
Tasks
In short, you are required to complete the following tasks:
Pitch a public, open dataset of your choice.
Pitch 3 or 4 initial hypotheses to be pursued later in Assignment
3.
Profile the data using descriptive and/or inferential statistics
techniques (which also requires that you demonstrate proficient
data wrangling skills).
Present items 1, 2, and 3 above via a recorded presentation.
2. Your tasks are open-ended tasks, similar to most real data
science projects. This means no two teams are likely to go to
the same direction and produce similar results. You will find
that your group will become experts in interpreting your own
data and answering your own problems. Comparing performance
across teams may not be meaningful and your team will be
assessed solely against the rubric.
Your Python code base must be available on your Github repo.
The extent of the group's collaboration and individual
contribution will be evaluated solely based on Github.
General advice:
Select an open (publicly available) data - data that can be freely
downloaded, preferably with an open license, allowing you to
share the data freely. Choosing non-public data is not advisable
as your instructor may be restricted from accessing the data.
Choose data in the domain for which team member(s) has some
background.
Formulate open-ended hypotheses.
Carry out fresh data and/or analysis.
Where possible, choose a dataset and formulate problems
pertaining to practical Australian contexts.
3. Data
Choose only 1 dataset.
It is fine to choose a dataset that has been analysed by others
outside of the university. This is the natural consequence of
selecting open data. However, you should either show that the
analysis and exploration you plan has not been done before, or
show that there is no code already available to do the analysis
you intend. Your instructor is likely to view highly any original
investigation.
Sources of open datasets include but are not limited to:
https://data.gov.au/
https://data.nt.gov.au/
https://data.worldbank.org/
https://www.data.gov/
https://datasetsearch.research.google.com/
https://www.kaggle.com/datasets
- Be careful. Many Kaggle datasets have published analyses.
Choose something that has not been done before.
4. Github Classroom
Group work activities must be visible on Github Classroom.
The instructor will send an invitation to all students to join
Github Classroom after all groups are formed. To accept this
invitation, every student must have a free Github account. If
you do not already have it, please
sign up
. This is compulsory.
Marking
You should refer to the detailed marking rubric that appears on
the side panel of this window.
Submission item
Please submit the following via Learnline
by latest 11.59pm
on the due date:
1 URL to a recorded presentation per team published privately
on YouTube. Do not submit multiple recordings and do not
submit recording file unless requested specifically.
(optional) supplementary information, where applicable.
5. Latest Python code base on GitHub repo must be accessible to
your instructor. Snapshot of the repo will be taken at the time of
submission.
The duration of the presentation is commensurate with the team
size. Inline with the Unit Information, 2 to 3 minutes of
presentation
per team member
is required. Not complying with this requirement may attract a
mark penalty.
Example:
For a team of 3: the minimum duration is 6 minutes (2 minutes
x 3 members) and the maximum duration is 9 minutes (3
minutes x 3 members).
For a team of 4: the minimum duration is 8 minutes (2 minutes
x 4 members) and the maximum duration is 12 minutes (3
minutes x 4 members).
Academic integrity and assessment irregularities
Academic integrity is a core value at CDU and must be upheld
at all times when completing this assignment. You must not
plagirise the work of others. Please be referred to the
Students - Breach of Academic Integrity Procedures
.
6. Other assessment irregularities are governed by CDU's
Higher Education Assessment Procedures
.
Tips and example
Broadly speaking, your instructor is looking for evidence of
your demonstrative competency in the following key data
science skills implemented in Python:
(1) hypothesis formulation
,
(2) exploratory data analytics
,
(3) data wrangling skills
, and
(4) data visualisations
.
When pitching your dataset, consider addressing the following
concerns:
source of data
accesssibility of data
validity of data
why the dataset matters (in practical or academic terms)
domain knowledge
7. relevance to you
etc.
In profiling the data, consider addressing the following
concerns:
dimensionality
data types
centrality
spread
shape of data
distributions
etc.
The last task is to pitch 3 to 4 initial hypotheses. Consider
addressing the following concerns:
what might the data tells us
what would you like to explore first based on your initial data
profiling
what would you like to predict
what existing assumption you want to test previous finding
what new idea you want to test