Top Rated Pune Call Girls Hadapsar ⟟ 6297143586 ⟟ Call Me For Genuine Sex Se...
Meta Data and Quality of Data for OGD Platform India
1. Open Government Data Platform
India
(https://data.gov.in)
Meta Data and Quality of Data
By: Sunil Babbar, Scientist-C, NIC
2. Data Contributors and Their Role
• Nominated by Chief Data Officer
• Coordinate and Identify datasets which can be
contributed
• Preparing the datasets
– Getting them cleaned
– Metadata preparation for datasets in the predefined format
– Ensuring quality and correctness datasets of his/her
unit/division.
• Contributing Catalogs/Resources(Datasets) through pre-
defined workflow
(Data Contributor Chief Data Officer(CDO) for review and publish PMU to publish on OGD Platform)
3. Resources (Datasets / Apps)
A data set (or dataset) is a collection of data
A data set corresponds to the contents of a single table or
statistical data matrix, where
every column represents a particular variable, and
each row corresponds to a given member of the data set in
question
OpenDataFormats:
CSV
XLS
ODF
XML/RDF
JSON
RSS/Atom
KML/GML
4. Catalog
Catalogisgroupingof thesimilarresources(Datasets/Apps)
A catalog represents a collection of resources that you
group together
Acts like directory of information about resources
BenefitofCatalog
To facilitate data access by users who are first interested in a
particular kind of data
Cataloghelpsingroupingtheresourceswithsametheme/subjectand
thusfacilitatetheuserinsearching aspecificdataset/resourceeasily
Ministry/Departmentshavelessefforttouploadsamesetofresources
orupdatingthedatasetfornewperiodwithoutwritingthemetadata
againandagain
Tofacilitatetheusersforeasiernavigationandsearchingforrelevant
data.
5. Catalog Formation
Catalogwithsameresourcewithdifferenttimeperiod
(Annual,Quarterly,Monthly,WeeklyandDaily)
Eg.AnnualRainfallData
Catalogwithsameresourcebutwithdifferentjurisdiction
(India,States,Districts,Block,Village)
States/UTs-wise Forest and Tree Cover
Catalogwithsameresourcebutdifferentcategory
(ScheduleCaste,ScheduleTribe,General,Religionetc.)
District-wise crimes committed against Schedule Caste
CatalogwithSimilartypeofresourceundersamereport
(Resourcesofsimilarnature)fromthesamereport/survey
canbegroupedunderthesamecatalog
Primary Census Abstract 2011 - India and States
6. MetaData
• Is the information that describes the data
– What is that data (About Data)
– Data source
– Who Created
– When created
– Etc.
• Metadata allows the data to be traced to
a know its origin and quality
7. Metadata Elements for Catalogs
Title(Required):Auniquenameforthecatalog(groupofresources)
Shouldcontainthegeneraltermswhichdescribestheessential
properties/characteristicsofthedatasets/resources
Should be in plain English and include sufficient detail to facilitate search and
discovery
Time-periodshouldnotbementionedinthecatalogtitlenormallysothatforthesimilar
resources,containingsametypeofdataforthenexttime-period/periodicupdating,can
beaccommodatedinsamecatalog
Howeverinexceptionalcases,itcancontaintimeperiodparticularlyforperiodic
surveys/censuswhichcontainsahugenumberofdatasets/resourcesbelongingtothe
sameperiod/year
Eg.CurrentPopulationSurvey,ConsumerPriceIndex,VarietywiseDailyMarketPrices
Data,StatewiseConstructionofDeepTubewellsovertheyears,etc.
Description(Required):Provideadetaileddescriptionofthecatalog
Anabstractdeterminingthenatureandpurposeofthecatalog
Containsthenameofvariableswhichareavailableinthedatasets
Canalsocontainsthedefinitionofsomevariable
8. Metadata Elements for Catalogs
Keywords(Required):Itisalistof terms,separatedbycommas,
describingandindicatingatthecontentofthecatalog.Example:
rainfall,weather,monthlystatistics.
Help users discover your dataset; please include terms that would be used by
technical and non-technical users.
GroupName:ThisisanoptionalfieldtoprovideaGroupNameto
multiplecatalogsinordertoshowthattheymaybepresentedas
agroupora set.
Sector& SubSector(Required):Choosethe
sectors(s)/subsector(s)thosemostcloselyapply(ies)toyour
catalog.
AssetJurisdiction(Required):Thisisa requiredfieldtoidentify
theexactlocationorareatowhichthecatalogand
resources(dataset/apps)caterstoviz.entirecountry,
state/province,district,city,etc.
9. Example - Creation of catalog
Catalog Title:
CompanyMasterData2015
(Incorrect-Contains time frame, so in future if we want to add data under this
catalog e.g Company master data for 2016, it would be not be possible to upload
data under this catalog)
CompanyMasterData (Correct)
CatalogDescription:
GetdataofCompanymasterdata..??
(Incorrect-Does not contain detail information. Description should contain the name
of variables which are available in the datasets)
Get data on master details of any company registered with Registrar of Companies (RoC).
Data contains various information like Corporate Identification Number(CIN), Company Name,
Company Status, Company Class, Company Category, Authorized Capital in INR, Paid-up
Capital in INR, Date of Registration, Registered State, Registrar of Companies, Principal
Business Activity, Registered Office Address and Sub Category. (Correct)
Keywords:
CompanyMasterData,….??
(Incorrect-listoftermsdescribingandindicatingthecontentofthecatalog,allthe
possiblesearchkeywordsshouldbeincluded
RegisteredCompanies,CompanymasterData,CompanyData,IndianCompanies,
Company,CompanyDetails,CorporateIdentificationNumber,CIN,CompanyAddress
(Correct)
11. Metadata Elements for Resources
IfResource Categoryis Dataset
Granularity of Data:It mentions the time interval over which the
data inside thedatasetiscollected/updatedonaregularbasis(one-
time,annual,hourly,etc.)
Frequency (Required):It mentions the time interval over which the
dataset ispublishedontheOGDPlatformonaregularinterval(one-
time,annual,hourly,etc.).
Access Type:Itmentionsthetypeofaccessviz.Open,Priced,Registered
AccessorRestrictedAccess(G2G).
IfResource Categoryis App
App Type(Required):ItmentionsthetypeofAppbeingcontributedviz.
WebApp,WebService,MobileApp,WebMapService,RSS,APIsetc.
12. Metadata Elements for Resources
DateReleased:ItmentionsthereleasedateoftheDataset/App.
Note:It mentions the anymore information the contributor/ChiefData
Officer wishes to providetothedataconsumerorabouttheresource
Resourcenoteshouldcontainproperexplanationsofanyspecial
characters/notationslike*,#,NAetcwhichwasusedinthedatasets
Otherrelevantinformationregardingthisdatasetshouldalsobeprovidedinthenote
section.
Informationregardingfiguresinthedatashouldalsobeprovided,i.eFiguresarein
numbers,Unit:(Rs./qtl.)
FootnoteavailableunderareportshouldbepartofResourceNote
NDSAPPolicy Compliance: Thisfieldistoindicateifthisdatasetisin
conformitywiththeNationalDataSharingandAccessPolicyoftheGovt.of
India.
13. Example - Creation of Resource
Resource Title:
NumberofRegisteredMotorVehicles (Transport&Non-Transport)inDelhi
(Incorrect - Resource title should contain the time frame, so no duplication will
occur in future
NumberofRegisteredMotorVehicles(Transport&Non-Transport)inDelhiduring2009-2010
(correct)
• ResourceNote:
NIL
(Incorrect - No note but dataset contains some special notations like *, #
etc, There are some cells contain NA, some other relevant information are
also present for this particular dataset)
Figuresareinnumbers;NA:Notavailable;$:Category-wisedatanotreceived;*:Includedincars;
Totalsareprovisionalrepresentingsummationofavailabledata (Correct)
ResourceCategory:
Application
(Incorrect–Asit is dataset not application)
Datasets
14. Quality of Datasets
• Data Compositeness/Completeness/Consistency
– Check for the constituent elements (variables) within the
dataset
– The dataset should be well explained in terms of the variable
present therein the dataset through a descriptive metadata
– The metadata should well describe the time-period, units,
definitions, frequency, data source, jurisdiction and notes to
special mention in the dataset
– The time series data should be continuous in nature
• Data Coverage
– Dataset should be made available at the lowest possible levels
to allow users correctly describe the phenomena being
measured
15. Quality of Datasets
• Standard process of “data cleansing” :
– Assigning string, date, character and numbers to the required
fields
– Abbreviations and acronyms to be replaced by full forms.
– No special characters and blank spaces (replaced with NA) in
the matrix.
– Column header should be self-explanatory
– Similar font size with no formulas and merged columns.
– Dataset should be de-normalized without any merged column
– No formula of calculated column should appear in dataset like
Total or Average of available column or rows
– Above all it must be in machine readable format viz. CSV, XML,
JSON, ODS, XLS etc.
– File name should not contain special character except _ and -;
no blank space should not be present in file name.