Assignment-2_CS5710/Assignment-2_CS5710_Logistic-Regression.pdf
Introduction to Machine Learning (CS 4710/5710)
Assignment-2 (80 points)
Logistic Regression
Due by 23rd march (Monday) 11:59pm
You are allowed to discuss the problem and solution design with others, but the code you submit
must be your own. Your solution must include the certification of authenticity “I certify that the
codes/answers of this assignment are entirely my own work.”
Datasets
The training and test files will follow the same format as the text files in the UCI datasets.
Datasets and description of the datasets are uploaded with this assignment onto Blackboard. For
each dataset, a training file and a test file are provided. The name of each file indicates what
dataset the file belongs to, and whether the file contains training or test data. Your code should
also work with ANY OTHER training and test files using the same format as the files in the UCI
datasets.
Logistic Regression
You must implement a Python executable file called logistic_regression that uses logistic
regression to fit a function to the given UCI datasets. Your code should work for all the three
datasets (for different number of features). Your function should be invoked as follows:
logistic_regression with following three command line arguments: <training_file> <test_file>
• <training_file>: The first argument, <training_file> is the path name of the training file,
where the training data is stored. The path name can specify any file stored on the local
computer.
• <test_file>: The second argument, <test_file> is the path name of the test file, where the
test data is stored. The path name can specify any file stored on the local computer.
Training Stage for Logistic Regression
You need to apply gradient descent method at the training stage of your model. The hypothesis
of the logistic regression is:
Gradient Descent Method:
At the end of the training stage, your program should print out the values of the weights that you
have estimated. The output of the training phase should be a sequence of lines like this:
𝜃0=%.4f
𝜃1=%.4f
𝜃2=%.4f
...
Test Stage for Logistic Regression
After the training stage, you should apply the function that you have learned on the test data. For
each test object (following the order in which each test object appears in the test file), you should
print a line containing the following info:
• Object ID. This is the line number where that object occurs in the test file.
• Predicted class (the result of the classification).
• True class (the last column on the line where the object occurs).
• Accuracy. This is defined as follows:
o If the predicted class is correct, the accuracy is 1.
o If the predicted class is incorrect, the accuracy is 0.
The output of the test stage should be a sequence of lines like this:
ID=%5d, output=%14.4f, target value = %10.4f, Misclassification .
1. Assignment-2_CS5710/Assignment-2_CS5710_Logistic-
Regression.pdf
Introduction to Machine Learning (CS 4710/5710)
Assignment-2 (80 points)
Logistic Regression
Due by 23rd march (Monday) 11:59pm
You are allowed to discuss the problem and solution design with
others, but the code you submit
must be your own. Your solution must include the certification
of authenticity “I certify that the
codes/answers of this assignment are entirely my own work.”
Datasets
The training and test files will follow the same format as the
text files in the UCI datasets.
Datasets and description of the datasets are uploaded with this
assignment onto Blackboard. For
each dataset, a training file and a test file are provided. The
name of each file indicates what
dataset the file belongs to, and whether the file contains
2. training or test data. Your code should
also work with ANY OTHER training and test files using the
same format as the files in the UCI
datasets.
Logistic Regression
You must implement a Python executable file called
logistic_regression that uses logistic
regression to fit a function to the given UCI datasets. Your code
should work for all the three
datasets (for different number of features). Your function
should be invoked as follows:
logistic_regression with following three command line
arguments: <training_file> <test_file>
• <training_file>: The first argument, <training_file> is the path
name of the training file,
where the training data is stored. The path name can specify any
file stored on the local
computer.
• <test_file>: The second argument, <test_file> is the path name
of the test file, where the
test data is stored. The path name can specify any file stored on
the local computer.
3. Training Stage for Logistic Regression
You need to apply gradient descent method at the training stage
of your model. The hypothesis
of the logistic regression is:
Gradient Descent Method:
At the end of the training stage, your program should print out
the values of the weights that you
have estimated. The output of the training phase should be a
sequence of lines like this:
�0=%.4f
�1=%.4f
�2=%.4f
...
Test Stage for Logistic Regression
After the training stage, you should apply the function that you
have learned on the test data. For
4. each test object (following the order in which each test object
appears in the test file), you should
print a line containing the following info:
• Object ID. This is the line number where that object occurs in
the test file.
• Predicted class (the result of the classification).
• True class (the last column on the line where the object
occurs).
• Accuracy. This is defined as follows:
o If the predicted class is correct, the accuracy is 1.
o If the predicted class is incorrect, the accuracy is 0.
The output of the test stage should be a sequence of lines like
this:
ID=%5d, output=%14.4f, target value = %10.4f,
Misclassification error = %4d
Submission Guidelines and Requirements
• Please zip up logistic_regression.py (or
logistic_regression.ipynb). Submit the zip file via
5. Blackboard.
• Include your name, UCM ID and Certification statement in
your solution:
//Your name
// Your UCM ID
//Certificate of Authenticity: “I certify that the codes/answers of
this assignment are
entirely my own work.”
Assignment-2_CS5710/UCI_Dataset/Description of UCI
Datasets.docx
Description of
The files in the directory contain training files and test files for
three datasets. Both the training file and the test file are text
files, containing data in tabular format. Each value is a number,
and values are separated by white space. The i-th row and j-th
column contain the value for the j-th dimension of the i-th
object. The only exception is the LAST column, that stores the
class label for each object. Make sure you do not use data from
the last column (i.e., the class labels) as parts of the input
vector.
The datasets are copied from the . Here are some details on each
dataset:
· The pendigits dataset. This dataset contains data for pen-based
recognition of handwritten digits.
· 7494 training objects.
· 3498 test objets.
· 16 dimensions.
· 10 classes.
6. · The satellite dataset. The full name of this dataset is Statlog
(Landsat Satellite) Data Set, and it contains data for
classification of pixels in satellite images.
· 4435 training objects.
· 2000 test objets.
· 36 dimensions.
· 6 classes.
· The yeast dataset. This dataset contains some biological data
· 1000 training objects.
· 484 test objets.
· 8 dimensions.
· 10 classes.
For each dataset, a training file and a test file are provided. The
name of each file indicates what dataset the file belongs to, and
whether the file contains training or test data.
Note that, for the purposes of your assignments, it does not
matter at all where the data come from. The methods that you
are asked to implement should work on all three datasets, as
well as ANY other datasets following the same format.
Assignment-2_CS5710/UCI_Dataset/pendigits_test.txt
88 92 2 99 16 66 94 37 70 0 0 24 42 65 100 100 8
80 100 18 98 60 66 100 29 42 0 0 23 42 61 56 98 8
0 94 9 57 20 19 7 0 20 36 70 68 100 100 18 92 8
95 82 71 100 27 77 77 73 100 80 93 42 56 13 0 0 9
68 100 6 88 47 75 87 82 85 56 100 29 75 6 0 0 9
70 100 100 97 70 81 45 65 30 49 20 33 0 16 0 0 1
40 100 0 81 15 58 100 57 47 87 50 88 40 42 36 0 4