31. May 2023•0 gefällt mir•1 view

Downloaden Sie, um offline zu lesen

Melden

Bildung

case study

ssuser31398bFolgen

Predicting rainfall using ensemble of ensemblesVarad Meru

Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao

AIRLINE FARE PRICE PREDICTIONIRJET Journal

Ajila (1)akanksha kunwar

(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...Naoki Shibata

House Sale Price Predictionsriram30691

- 1. Group 4 Team members Ravi Richa Sabarish Vijay
- 2. Problem Statement Flight Delay Prediction
- 3. Dataset understanding and description No. of features =29, shape of data set = 484551 rows x 28 columns, Target Variable = ArrDelay
- 4. Dataset understanding and description Missing Values Org_Airport -1177 Dest_Airport -1479 Duplicate Data 2 rows are duplicate Duplicate columns Columns with same information i.e Org_Airport and Dest_Airport these are repeated information in data set Origin represented by three letter code for Org_Airport and Dest represents three letter code for Dest_Airport Categorical Variables 1.UniqueCarrier 2.FlightNum 3.TailNum 4.Origin 5.Dest
- 5. Outliers # -Columns having outliers :- arrdelay deepdelay taxiout carrier delay security delay late aircraft delay
- 6. Data Visualization Techniques used Box Plots Heat Maps Histogram Line graphs Pie chart PairPlot Used sweetviz library for more visualization
- 7. Screenshots of different visualizations When we took threshold as .9 then following features are correlated{'AirTime', 'CRSElapsedTime', 'DepDelay', 'Distance'}
- 8. EDA flight_eda_report.html
- 9. Data Processing and FeatureEngg. Imputation – Check for Null values in data set and handle it by diff techniques Categorical Encoding – target encoding is used as high cardinality Handling Outliers – Box plots to analyse outliers in data Scaling - Min Max scaler is used Feature Selection- Correlation helped in getting feature correlation with each other and target Feature Split – Derived features from Date and time features Data set post FE – 24 features for Modelling
- 10. Our Research Analysed the data deeply from domain perspective which gave very interesting insights. Different types of categorical handling we have researched on and then came up with target encoding As we know deletion of features are always most impactful decision we used both visualization and domain knowledge to do this part Missing Value handling we tried different approached and then finalized one We have derived attributes from given features which we really felt will be helpful for further analysing
- 11. Future Tasks More Feature Engineering Training the model on the selected features Model development Model assessment Take away from last Meet Group Dynamics Elements of Data Dynamic data
- 12. Elements of Data
- 14. Feature Engg. Logistic Regression – SFS for Feature Engineering
- 15. Logistic Regression – SFS for Feature Engineering
- 16. Splitting of data The test_size=0.2 It is split of test and training data as 80/20percent . X_train data shape after splitting (387639, 24) X_test data shape after splitting (96910, 24) y_train data shape after splitting (387639,) y_test data shape after splitting (96910,)
- 17. Linear Regression Model Interpretation - The R² represents how much variance of the data is explained by the model, the R2=0.90 means that 0.10 of the variance can not explain by the model, the logical case when R2=1 the model completely fit and explained all variance.
- 18. Y = a + bX b = slope a = intercept X= coefficients or features
- 19. Ridge Regression Model Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values. mean_squared_error with Ridge Regression with train data 0.0033528868720357806 R2 square with Ridge Regression with train data 0.9999999965474273 mean_squared_error with Ridge Regression with test data 0.0027463365607708775 R2 square with Ridge Regression with test data 0.9999999976478994
- 20. SVC
- 21. Random Forest Regressor
- 22. Neural Network
- 23. Future Recommendation 1. Regression vs Classification Problem 2. Dataset can have more records for delay = 0 3. Dataset can have more relevant features according to the domain knowledge/experience
- 24. Comparison of Models
- 25. Interpretation Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization that is Ridge Regression to overcome this issue. SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3 hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for problems with large datasets such as the one given. Random forest is also giving us good accuracy 98 % ANN is giving 98 %
- 26. Dynamic data Dynamic data out using linear regression and Ridge Regression