This slideshow dives into a data-driven analysis of NYC shootings. By employing cluster analysis, we uncover hidden patterns within these incidents, providing insights that can aid in crime prevention strategies. for more such analysis and management visit : https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
5. DATA CLEANING & FORMATTING
• FIRSTICREATEDACOPY OFTHEGIVENDATAINORDER TOPERFORMTHE
CLEANING PROCESSBYHAVING THE ORGINALDATAUNALTERED
• THENIIMPORTEDTHE DATAANDCHECKED FORTHEREQUIREDCOLUMNS
• THEN IDROPPEDTHE COLUMNS THATARENOTREQUIREDANDTHE
COLUMNS THATCONTAINEDMOREBLANKVALUES
• THENINTHEGIVEN DATASETICHANGEDTHE “NULL” AND“UNIDENTIFIED”
VALUESINTO“UNKNOWN”FOREASYIDENTIFICATION
6. DROPPING COLUMNS
• THESE WERE THELISTOF THECOLUMNS THATIHAVEDROPPED
• THESE COLUMNSWERE DROPPEDBECAUSE THESE COLUMNSCONTAINED EITHERLESS
DATANORUNWANTED DATA
7. DROPPING COLUMNS
• DROPPING COLUMNS REFERS TOTHEPROCESS OF REMOVING CERTAIN COLUMNS OR
VARIABLES FROMADATASET. THISISOFTEN DONEDURINGTHEDATAPREPROCESSING PHASE
WHENSOME COLUMNS AREDEEMED UNNECESSARY ORREDUNDANT FORTHEANALYSIS OR
MODELING TASK ATHAND.HERE ARESOME COMMON SCENARIOS WHERE DROPPING COLUMNS
MIGHTBENECESSARY:
• IRRELEVANT FEATURES: SOME COLUMNS MAYNOTCONTRIBUTE RELEVANTINFORMATION TO
THEANALYSIS ORPREDICTION TASK
• HIGHLYCORRELATED FEATURES: IFTWOORMORECOLUMNS AREHIGHLYCORRELATED,
MEANINGTHEY CONTAIN SIMILARINFORMATION, DROPPING ONEOFTHEMCAN REDUCE
REDUNDANCY ANDMULTICOLLINEARITYINTHE DATASET. THISCAN IMPROVE THESTABILITY
AND INTERPRETABILITYOF THEMODELS.
8. • MISSING VALUES: IFACOLUMNHAS AHIGHPERCENTAGE OFMISSING VALUESANDIMPUTATION
ISN'TFEASIBLE ORAPPROPRIATE, DROPPING THECOLUMN MIGHTBENECESSARY TOMAINTAIN
THEINTEGRITY OFTHEDATASET.
• DATA LEAKAGE: COLUMNS THATCONTAIN INFORMATIONABOUTTHE TARGETVARIABLE ORARE
DERIVED FROM THETARGETVARIABLE SHOULDBEREMOVED TOPREVENT DATALEAKAGE,
WHICHCOULD ARTIFICIALLYINFLATETHEMODEL'SPERFORMANCE DURING TRAINING.
• COMPUTATIONAL EFFICIENCY: LARGEDATASETS WITHALARGENUMBEROF COLUMNS CAN BE
COMPUTATIONALLY EXPENSIVE TOPROCESS ANDTRAINMODELSON. DROPPING IRRELEVANT
ORREDUNDANT COLUMNS CANHELPREDUCE THEDIMENSIONALITY OF THEDATASET AND
IMPROVE COMPUTATIONAL EFFICIENCY
9. FEATURE ENGINEERING
• FEATURE ENGINEERING FOCUSES ONCREATING NEW FEATURES ORMODIFYING EXISTING
ONES TOIMPROVE THEPERFORMANCE OFMACHINE LEARNING MODELS. THIS PROCESS
INVOLVES SELECTING, TRANSFORMING, ORCOMBINING FEATURES TOEXTRACT USEFUL
INFORMATION AND REPRESENT THE DATAMORE EFFECTIVELY.
• FEATURE ENGINEERING TECHNIQUES INCLUDECREATING POLYNOMIALFEATURES,
BINNING, DISCRETIZATION, DIMENSIONALITY REDUCTION (E.G., PCA), FEATURE SCALING,
AND CREATING INTERACTION TERMS.
• THEGOALOFFEATURE ENGINEERING ISTOENHANCE THEPREDICTIVE POWEROFTHE
MODEL BYPROVIDING ITWITHMOREINFORMATIVE ANDDISCRIMINATIVE FEATURES,
ULTIMATELYIMPROVING ITSACCURACY AND GENERALIZATIONABILITY.
11. CREATING DUMMY VARIABLES
• BY USING THEABOVE CODE I’VECREATED DUMMYVARIABLES FORCERTAIN
COLUMNS.
• SINCE THESE COLUMNS PLAYAMAJOR ROLEINDEVELOPING AMODELTHESE
SHOULD NOTBEDROPPED BUTCANNOT BEINSTRING FORMATEITHER.
• THUSDUMMYVARIABLES ARE CREATED.
12. • HERE INTHECOLOUMN“BORO”THE VALUES ARESTRING SINCE ITHASTO BE IN
NUMERICALFORMATTHEDUMMYVARIABLES ARE CREATED
• THESTRING VALUES BECOMES ACOLUMNAND THENTHE VALUE ARE GIVEN IN0 AND1
FORMAT BASED ONTRUEORFALSE
CREATING DUMMY VARIABLES
13. DATA SUMMARIZATION / DESCRIPTIVE
STATISTICS
• DATASUMMARIZATION ISUSEDTODESCRIBE THEPROCESSOFCONDENSING AND PRESENTING KEY
CHARACTERISTICSORINSIGHTSFROMADATASET. ITINVOLVESVARIOUSTECHNIQUESFOR
SUMMARIZING AND ANALYZING DATATOGAINABETTERUNDERSTANDING OFITSSTRUCTURE,
PATTERNS, ANDRELATIONSHIPS.
• THEDF.DESCRIBE() FUNCTIONISCOMMONLYUSEDINPYTHON WITHLIBRARIES LIKEPANDAS TO
GENERATE DESCRIPTIVESTATISTICSOFADATAFRAME. ITPROVIDESSUMMARY STATISTICSFOR
NUMERICAL COLUMNSINTHEDATAFRAME SUCHASCOUNT,MEAN, STANDARD DEVIATION,MINIMUM,
MAXIMUM, AND QUARTILE VALUES.
• USING DF.DESCRIBE() ISAQUICKWAYTOGETANOVERVIEWOFTHEDISTRIBUTIONAND CENTRAL
TENDENCY OFNUMERICAL DATAINADATA FRAME. ITHELPS INUNDERSTANDING THERANGE OF
VALUES, PRESENCE OFOUTLIERS,AND OVERALL SHAPE OFTHEDATA.
14. DATA SUMMARIZATION / DESCRIPTIVE
STATISTICS
HERE'S WHATEACHSTATISTICREPRESENTS :
• COUNT:NUMBER OFNON-NULL VALUES INEACHCOLUMN.
• MEAN: AVERAGE VALUE OFEACHCOLUMN.
• STD:STANDARD DEVIATION,AMEASURE OFTHE DISPERSION OFVALUES AROUNDTHEMEAN.
• MIN:MINIMUMVALUE INEACHCOLUMN.
• 25%:FIRSTQUARTILE, OR25THPERCENTILE.
• 50%:MEDIAN, OR50THPERCENTILE.
• 75%:THIRDQUARTILE, OR75THPERCENTILE.
• MAX: MAXIMUM VALUE INEACHCOLUMN.
15. FINDING K VALUE
• IMPORT LIBRARIES:"FROMSKLEARN.CLUSTER IMPORTKMEANS" THISLINEIMPORTSTHE KMEANS
CLUSTERINGALGORITHM FROM THESCIKIT-LEARN LIBRARY, WHICHISAWIDELYUSED MACHINE
LEARNING LIBRARY INPYTHON.
• INITIALIZEANEMPTY LIST:"WCSS=[]"THISLINEINITIALIZESAN EMPTYLISTCALLEDWCSS.ITWILLBE
USEDTOSTORE THEWITHIN-CLUSTERSUMOFSQUARES (WCSS) FORDIFFERENTVALUES OFK.
• LOOPOVERKVALUES: THISLOOPITERATESOVERARANGE OFVALUES FORKFROM1TO10.
• INSTANTIATEKMEANS MODEL:"KMEANS =KMEANS(N_CLUSTERS=K, INIT="K-MEANS++")"INSIDE THE
LOOP,AKMEANS MODEL ISINSTANTIATED WITHN_CLUSTERS=K, WHERE KISTHE CURRENT VALUE OFK.
• THENSPECIFYTHEINITIALIZATIONMETHODFORCENTROIDS,WHICHIS "K-MEANS++". THIS
INITIALIZATIONMETHODHELPSINCHOOSINGINITIALCLUSTERCENTROIDSINAWAYTHATSPEEDS UP
CONVERGENCE.
16. FINDING K VALUE
• FITKMEANS MODEL:THEKMEANS MODEL ISFITTEDTOTHEDATA USINGTHE FITMETHOD.THE DATA
USEDFORCLUSTERINGISOBTAINEDFROMTHEDATAFRAME DFBYEXCLUDINGTHE FIRSTCOLUMN.
THISASSUMES THATTHEFIRSTCOLUMNCONTAINSLABELSORIDENTIFIERSAND THEREMAINING
COLUMNS AREFEATURES USED FORCLUSTERING.
• COMPUTEWCSS:THE WITHIN-CLUSTERSUM OFSQUARES (WCSS)IS COMPUTED.WCSSREPRESENTS
THESUM OFSQUARED DISTANCESOFSAMPLES TOTHEIRCLOSESTCLUSTERCENTER.
• AFTERCOMPUTINGWCSSFORALL VALUES OFK,ALINEPLOTIS CREATED.
• THEX-AXIS REPRESENTS THE VALUES OFK(FROM1TO10), AND THEY-AXIS REPRESENTS THE
CORRESPONDING WCSSVALUES.THE PLOTVISUALIZESTHERELATIONSHIP BETWEENKVALUESAND
WCSS..
• FINALLY, THEPLOTISDISPLAYED.
24. CONCLUSION
• THUSACLUSTERING MODELWAS DEVELOPED ANDTHESE STEPS PLAYED A SIGNIFICANT
ROLE INDEVELOPING ACLUSTERING MODEL
• DATAPREPROCESSING
• FEATURE ENGINEERING
• CLUSTERING MODELDEVELOPMENT
• VISUALIZATION