SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere Nutzervereinbarung und die Datenschutzrichtlinie.
SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere unsere Datenschutzrichtlinie und die Nutzervereinbarung.
DWH ConceptsWhat is a DATA WAREHOUSE?A data warehouse is a relational database that is designed for query and analysisrather than for transaction processing. It usually contains historical data derived fromtransaction data, but it can include data from other sources. It separates analysisworkload from transaction workload and enables an organization to consolidate datafrom several sources. In addition to a relational database, a data warehouseenvironment includes an extraction, transportation, transformation, and loading (ETL)solution, an online analytical processing (OLAP) engine, client analysis tools, andother applications that manage the process of gathering data and delivering it tobusiness users.® A data warehouse is a database designed to support a broad range of decisiontasks in a specific organization. It is usually batch updated and structured for rapidonline queries and managerial summaries. Data warehouses contain large amountsof historical data. The term data warehousing is often used to describe the processof creating, managing and using a data warehouse.What are the characteristics of a DATA WAREHOUSE?The characteristics of a DWH are• Subject-Oriented: DWH’s are designed to help you analyze data. For example, to learn more about the company’s sales data, you can build a warehouse that concentrates on sales. This ability to define a DWH by subject matter, sales in this case makes the DWH subject oriented.• Integrated: It is closely related to subject orientation. DWH’s put data from desperate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said be integrated.• Nonvolatile: It means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred and whatever once happened never changes.• Time-Variant: In order to discover trends, analysts need large amounts of data. This is very much in contrast to OLTP systems, where performance requirements demand that historical data be moved to an archive. A DWH focus on change over time is what is meant by the term time variant.What are the goals of a DATA WAREHOUSE?
The goals of a DATA WAREHOUSE are• To provide a reliable, single integrated source of key corporate information.• To give end users access to their data without a reliance on reports produced by the information system department.• To allow analysts to analyze corporate data and even produce predictive “what if” models from that data.The data warehouse is simply one component of modern reporting architectures.The real goal of reporting systems are decision support –or its modern equivalentBusiness intelligence-to help people makes better, more intelligent decision.When should a company consider implementing a data warehouse?Data warehouses or a more focused database called a data mart should beconsidered when a significant number of potential users are requesting access to alarge amount of related historical information for analysis and reporting purposes.So-called active or real-time data warehouses can provide advanced decisionsupport capabilities.What are the uses of DATAWAREHOUSE?• It separates analysis workload and enables an organization to consolidate data from several sources.• It manages the process of gathering data and delivering to business users.• It is used to analyze data.• It puts data from desperate sources into a consistent format.What are the benefits of data warehousing?Some of the potential benefits of putting data into a data warehouse include:1. Improving turnaround time for data access and reporting;2. Standardizing data across the organization so there will be one view of the "truth";3. Merging data from various source systems to create a more comprehensive information source;4. Lowering costs to create and distribute information and reports;5. Sharing data and allowing others to access and analyze the data;6. Encouraging and improving fact-based decision-making.What are the limitations of data warehousing?
The major limitations associated with data warehousing are related to userexpectations, lack of data and poor data quality. Building a data warehouse createssome unrealistic expectations that need to be managed. A data warehouse doesntmeet all decision support needs. If needed data is not currently collected, transactionsystems need to be altered to collect the data. If data quality is a problem, theproblem should be corrected in the source system before the data warehouse isbuilt. Software can provide only limited support for cleaning and transforming data.Missing and inaccurate data can not be "fixed" using software. Historical data can becollected manually, coded and "fixed", but at some point source systems need toprovide quality data that can be loaded into the data warehouse without manualclerical intervention.What data is stored in a data warehouse?In general, organized data about business transactions and business operations isstored in a data warehouse. But, any data used to manage a business or any type ofdata that has value to a business should be evaluated for storage in the warehouse.Some static data may be compiled for initial loading into the warehouse. Any datathat comes from mainframe, client/server, or web-based systems can then beperiodically loaded into the warehouse. The idea behind a data warehouse is tocapture and maintain useful data in a central location. Once data is organized,managers and analysts can use software tools like OLAP to link different types ofdata together and potentially turn that data into valuable information that can be usedfor a variety of business decision support needs, including analysis, discovery,reporting and planning. Database administrators (DBAs) have always said thathaving non-normalized or de-normalized data is bad.What are the methodologies of Data Warehousing?Every company has methodology of their own. But to name a few SDLCMethodology, AIM methodology are sturdily used. Other methodologies are AMM,World class methodology and many more.How does my company get started with data warehousing?Build one! The easiest way to get started with data warehousing is to analyze someexisting transaction processing systems and see what type of historical trends andcomparisons might be interesting to examine to support decision making. See ifthere is a "real" user need for integrating the data. If there is, then IS/IT staff candevelop a data model for a new schema and load it with some current data and startcreating a decision support data store using a database management system(DBMS). Find some software for query and reporting and build a decision supportinterface thats easy to use. Although the initial data warehouse/data-driven DSSmay seem to meet only limited needs, it is a "first step". Start small and build moresophisticated systems based upon experience and successes.
What is the Data warehouse Implementation Schemes?What type of Indexing mechanism do we need to use for a typical datawarehouse?On the fact table it is best to use bitmap indexes. Dimension tables can use bitmapand/or the other types of clustered/non-clustered, unique/non-unique indexes.To my knowledge, SQLServer does not support bitmap indexes. Only Oraclesupports bitmaps.What are the steps to build the data warehouse?Gathering business requirementsIdentifying SourcesIdentifying FactsDefining DimensionsDefine AttributesRedefine Dimensions & AttributesOrganize Attribute Hierarchy & Define RelationshipAssign Unique IdentifiersAdditional conventions: Cardinality/Adding ratiosHow often should data be loaded into a data warehouse from transactionprocessing and other source systems?It all depends on the needs of the users, how fast data changes and the volume ofinformation that is to be loaded into the data warehouse. It is common to scheduledaily, weekly or monthly dumps from operational data stores during periods of lowactivity (for example, at night or on weekends). The longer the gap between loads,the longer the processing times for the load when it does run. A technical IS/ITstaffer should make some calculations and consult with potential users to develop aschedule to load new data.What are the different architectures of data warehouse? ® What are thedifferent approaches of a Data warehouse?There are two main things
Top down - (bill Inmon)Bottom up - (Ralph Kimball)What are the types of a data warehouse?What is the main difference between Inmon and Kimball philosophies of datawarehousing?Both differed in the concept of building the data warehouse.Kimball views data warehousing as a constituency of data marts. Data marts arefocused on delivering business objectives for departments in the organization. Andthe data warehouse is a conformed dimension of the data marts. Hence a unifiedview of the enterprise can be obtained from the dimension modeling on a localdepartmental level.Inmon beliefs in creating a data warehouse on a subject-by-subject area basis.Hence the development of the data warehouse can start with data from the onlinestore. Other subject areas can be added to the data warehouse as their needs arise.Point-of-sale (POS) data can be added later if management decides it is necessary.i.e., Kimball--First Data Marts--Combined way ---Data warehouse Inmon---First Data warehouse--Later----Data martsWhen should I consider a Data warehouse solution?What is the process of warehousing data?Explain the architecture of a data warehouse with the diagram.What is Staging Area?What is a general purpose scheduling tool?The basic purpose of the scheduling tool in a DW Application is to stream line theflow of data from Source to Target at specific time or based on some condition.What is real time data warehousing?Real-time data warehousing is a combination of two things:1. real-time activity and2. Data warehousing.Real-time activity is activity that is happening right now. The activity could beanything such as the sale of widgets. Once the activity is complete, there is dataabout it. Data warehousing captures business activity data. Real-time datawarehousing captures business activity data as it occurs. As soon as the businessactivity is complete and there is data about it, the completed activity data flows into
the data warehouse and becomes available instantly. In other words, real-time datawarehousing is a framework for deriving information from data as the data becomesavailable.What is ODS?ODS means Operational Data Store. A collection of operation or bases data that isextracted from operation databases and standardized, cleansed, consolidated,transformed, and loaded into enterprise data architecture. An ODS is used to supportdata mining of operational data, or as the store for base data that is summarized fora data warehouse. The ODS may also be used to audit the data warehouse toassure summarized and derived data is calculated properly. The ODS may furtherbecome the enterprise shared operational database, allowing operational systemsthat are being reengineered to use the ODS as there operation databases.What is Active data warehousing?An active data warehouse provides information that enables decision-makers withinan organization to manage customer relationships nimbly, efficiently and proactively.Active data warehousing is all about integrating advanced decision support with day-to-day-even minute-to-minute-decision making in a way that increases quality ofthose customer touches which encourages customer loyalty and thus secure anorganizations bottom line. The marketplace is coming of age as we progress fromfirst-generation "passive" decision-support systems to current- and next-generation"active" data warehouse implementations.® Active Data ware house means every user can access the database any time 24/7that is called Active DWH.® Active Transformation means data can change and pass.What is meant by OLTP?OLTP stands for On-Line Transaction Processing. This is a standard, normalizeddatabase structure. OLTP is designed for Transactions i.e., day-to-day transactions.OLTP database has hundreds of users connected to it. These databases arenormalized to reduce the redundancy of the data & increase the performance whileinserting the data. The ratio of no. of records being inserted is more than the ration ofno. of records being updated or deleted. OLTP systems are not designed foranalysis, reporting and decision support. Examples: ATM Machines, OnlineShopping, Online Application Filling, and Online Railway Reservations.Why OLTP database are designs not generally a good idea for a DataWarehouse?
Since in OLTP, tables are normalized and hence query response will be slow for enduser and OLTP doesn’t contain years of data and hence cannot be analyzed.Why is de-normalized data now ok when its used for Decision Support?Normalization of a relational database for transaction processing avoids processinganomalies and results in the most efficient use of database storage. A datawarehouse for Decision Support is not intended to achieve these same goals. ForData-driven Decision Support, the main concern is to provide information to the useras fast as possible. Because of this, storing data in a de-normalized fashion,including storing redundant data and pre-summarizing data, provides the bestretrieval results. Also, data warehouse data is usually static so anomalies will notoccur from operations like add, delete and update a record or field.Why should you put your data warehouse on a different system than yourOLTP system?A OLTP system is basically “data oriented” (ER model) and not “Subject oriented"(Dimensional Model) .That is why we design a separate system that will have asubject oriented OLAP system...Moreover if a complex query is fired on a OLTPsystem will cause a heavy overhead on the OLTP server that will affect the day-to-day business directly.What is Business Intelligence?Business intelligence (BI) is a broad category of applications and technologies forgathering, storing, analyzing, and providing access to data to help enterprise usersmake better business decisions.What are the important concerns of OLTP and DSS systems? OLTP DSSNo. of users Many FEW
Data 1. Stored in a Complex data format. 1. Stored in multidimensional structures (Normalized) e.g.: cube (3 dimensional). 2. Stored in a normalized form. Normally 3rd Normalized form. Normalization enhances 2. Stored in de-normalized format. performance. 3. Large volumes of data. 3. Small volumes of data. 4. Static in nature with periodic 4. Data is volatile in nature. loads.Operations Transactions. Reporting.Indexes Few Many.Joins Many(because it is normalized) Few (because it is de-normalized).Performanc Concurrency and availability are Response time is most imp.e more imp aspects. e.g.: ATMs.OLTP DSSComplex Data Multidimensional DataStructures StructuresFew INDEXES ManyMany JOINS SomeNormalized DBMS DUPLICATED DATA De-Normalized DBMSRare DERIVED DATA AND Common AGGREGATESMany NUMBER OF USERS Few
Predefined WORKLOAD AD-HOC queriesoperationsVolatile DATA MODIFICATIONS Update on a regular basisSmall Volumes DATA Large Volume (Historical Data)Availability Must be Response time must behigh goodWhat is the difference between ODS and OLTP?ODS: It is nothing but a collection of tables created in the Data warehouse thatmaintains only current data where as OLTP maintains the data only for transactions,these are designed for recording daily operations and transactions of a business® ODS: Having data with Data warehouse that will be stand alone. No furthertransaction will take place for current data which is part of the data ware house.Current data will be change once you upload through ETL on schedule basis.OLTP: Having data with on line system which connected to network and all updateon transaction happened in seconds. Every second data summarized value will getchanged.What is an OLAP? What are the types of OLAP?OLAP is software for manipulating multidimensional data from a variety of sources.The data is often stored in data warehouse. OLAP software helps a user createqueries, views, representations and reports. OLAP tools can provide a "front-end" fora data-driven DSS.® OLAP: On-Line Analytical Processing: On-Line Analytical Processing (OLAP) isa category of software technology that enables analysts, managers and executivesto gain insight into data through fast, consistent, interactive access to a wide varietyof possible views of information that has been transformed from raw data to reflectthe real dimensionality of the enterprise as understood by the user.® OLAP stands for On-Line Analytical Processing. OLAP system stores data inmultidimensional databases. U then accesses these databases to perform financialand statistical analysis on different combinations of the data. An OLAP database isgenerally used to analyze data. It is optimized so that u can quickly retrieve data. AnOLAP database is generally created from the information u have put in an OLTPdatabase. OLAP products can be grouped into 3 categories.
MOLAP: (Multidimensional OLAP)o Data is stored multidimensional arrays in order to be viewed in a multidimensional manner.o Multidimensional arrays provide efficiency in storage and operations.o Examples: ORACLE Express Servers, Essbase by Hyperion Software, Power play by Cognos.o MOLAP does not support ad-hoc queries because it is optimized for multidimensional operationso Retrieval is Fasto Storage is very efficientROLAP: (Relational OLAP)o Data is stored in a Relational model because OLAP capabilities are best provided against the relational database.o Examples: Oracle, SQL Server… etc.o ROLAP integrates naturally with existing technology and standards.o ROLAP can readily take advantage of parallel relational technology.HOLAP: (Hybrid OLAP)o These products combine MOLAP and ROLAP.o With HOLAP products, a relational database stores most of the data.o A separatable multidimensional database stores a small portion of the dataoIs OLAP databases are called decision support system??? True/false?TrueWhat does the term ‘Metadata’ mean?Very loosely, it is documentation about data; it is how you provide context for datapeople might be using. Metadata is basically the wrapping you put around data youuse in everyday life to transform it into meaningful information.What is the difference between data warehousing and OLAP?The term’s data warehousing and OLAP are often used interchangeably. As thedefinitions suggest, warehousing refers to the organization and storage of data froma variety of sources so that it can be analyzed and retrieved easily. OLAP deals withthe software and the process of analyzing data, managing aggregations, andpartitioning information into cubes for in-depth analysis, retrieval and visualization.Some vendors are replacing the term OLAP with the term’s analytical software andbusiness intelligence.® Data warehouse is the place where the data is stored for analyzing where asOLAP is the process of analyzing the data, managing aggregations, partitioninginformation into cubes for in-depth visualization.What is OLAP, MOLAP, ROLAP, DOLAP, and HOLAP?
OLAP - On-Line Analytical Processing: Designates a category of applications andtechnologies that allow the collection, storage, manipulation and reproduction ofmultidimensional data, with the goal of analysis.MOLAP - Multidimensional OLAP: This term designates a Cartesian data structuremore specifically. In effect, MOLAP contrasts with ROLAP. In the former, joinsbetween tables are already suitable, which enhances performances. In the latter,joins are computed during the request. Targeted at groups of users because its ashared environment. Data is stored in an exclusive server-based format. It performsmore complex analysis of data.ROLAP - Relational OLAP: Designates one or several star schemas stored inrelational databases. This technology permits multidimensional analysis with datastored in relational databases. Used for large departments or groups because itsupports large amounts of data and users.DOLAP - Desktop OLAP: Small OLAP products for local multidimensional analysisDesktop OLAP. There can be a mini multidimensional database (using PersonalExpress), or extraction of a data cube (using Business Objects). Designed for low-end, single, departmental user. Data is stored in cubes on the desktop. Its likehaving your own spreadsheet. Since the data is local, end users dont have to worryabout performance hits against the server.HOLAP: Hybridization of OLAP, which can include any of the above.What is meant by metadata in context of a Data warehouse and how it isimportant?Meta data is the data about data; Business Analyst or data modeler usually captureinformation about data - the source (where and how the data is originated), nature ofdata (char, varchar, nullable, existence, valid values etc) and behavior of data (how itis modified / derived and the life cycle) in data dictionary a.k.a metadata. Metadata isalso presented at the Data mart level, subsets, fact and dimensions, ODS etc. For aDW user, metadata provides vital information for analysis / DSS.What is difference between MOLAP, ROLAP? ROLAP MOLAPTactical Strategic • Detailed Data • Summary Data • Simple calculations • Complex • Analyze past trends • Predict future trends
Data storage structure Data storage structure • Tables • CubeAdvantages Advantages • Requires less memory storage • Data access is faster space DisadvantagesDisadvantages • Requires more memory storage • Data access is slow space. • Is sparsely filled as the number of dimensions in the cube increasesWhat is the Difference between OLTP and OLAP?Main Differences between OLTP and OLAP are:-1. User and System OrientationOLTP: customer-oriented, used for data analysis and querying by clerks, clients andIT professionals.OLAP: market-oriented, used for data analysis by knowledge workers (managers,executives, analysis).2. Data ContentsOLTP: manages current data, very detail-oriented.OLAP: manages large amounts of historical data, provides facilities forsummarization and aggregation, stores information at different levels of granularity tosupport decision making process.3. Database DesignOLTP: adopts an entity relationship(ER) model and an application-oriented databasedesign.OLAP: adopts star, snowflake or fact constellation model and a subject-orienteddatabase design.4. ViewOLTP: focuses on the current data within an enterprise or department.OLAP: spans multiple versions of a database schema due to the evolutionaryprocess of an organization; integrates information from many organizationallocations and data stores
What types of Metadata are there and when will they be available?Metadata will be made available on the Decision Support website as each incrementgoes live. We have two classifications of metadata: one that is business and onethat is technical. Technical metadata is fairly clear-cut: where did the data come fromor how was it transformed along the way? Business metadata deals more with thepossible meaning of the data and how it can be used.Why is Metadata important to the DWH User?Metadata is what makes the data in the Data Warehouse meaningful. The DataWarehouse is very different from an operational application. When youre using anoperational application, you can get clues from the screen that tells you to update aparticular field on the window. If I’m processing a new employee, I know exactly whatneeds to be updated for that new employee record, and can move through theprocess based on the context that the application provides. In a data-warehousingenvironment, you don’t have that context or workflow. You have data that isinterrelated, and it is raw out there in a form, but there is no application between youand the data. Basically, you have a number of tables and structures that you haveaccess to without a business layer, without a definition on top of it. So metadata isvery important to be able to provide that context to people so they know how to gobetween subject areas or how data within a subject area is related and what itdefines and represents.Is Metadata a description of what the data represents?In the simplest terms it is. As an example, if a user of the Data Warehouse isinterested in a field called "campus code", then the metadata might have a definitionof what the campus code represents, such as "an indicator for one of the threecampuses". That is a form of metadata, although it is not a complete picture of whatmetadata can be.What types of Metadata will be made available to the User?Decision Support has identified several kinds of metadata that will be published onthe website. Some basic categories are the data model, source-to-target mapping,and the logical & physical model. The logical model gives more of a grouping oridentifies logically what would be expected from the business side. The physicalmodel goes into more detail with more of the data dictionary definition, but it givesthe user a pictorial representation of the data, not just a list of columns and tables. Itprovides a visual so people can see how data elements relate to each other. There isalso a category of metadata that we call usage notes. These go into expanding onhow someone might query the Data Warehouse or use a query against a data mart.Based on going through the requirements process and working with the focusgroups, as data is available, we expect to expand the metadata categories.
Is Metadata also useful to the average User of the DWH, in addition to adepartment’s technical staff?Yes. For an "ad hoc" user, there may be questions as to what a field represents.Another form of metadata at a business user level would be sample queries thatDecision Support’s Services area would publish based on findings from therequirements process and focus groups. These queries provide samples of relatingdata to answer a business question.What Challenges are involved when providing Metadata?Historically organizations find it a challenge to manage metadata over time. So Ithink the biggest challenge that we face at Decision Support is learning from thosemistakes and from what we’ve read in the industry. We need to make sure themetadata we have is ‘live’; that it’s not something that is static and put on the shelf.Decision Support has formed a Custodial Data Council that will take ownership inmaking sure we have business definitions and work with the user community. I thinkwe also need to technically streamline those processes as much as possible, publishthe metadata, and make it as consistent as possible.What is the difference between DWH and BI?There may be a Feature film (movie) without a Trailer. But there will be no trailerwithout a movie. Similarly Data warehousing is a concept related to extracting clientsbusiness data and applying business processing features on that data according touser needs and finally loading the processed data into a database, this database iswhat we call a warehouse or data warehouse. After the completion of a datawarehouse the business user ultimately want to view his data (a precise andsummary data) but as a business person he may dont have knowledge of accessinga database (a computer person can access the database with SQL). So there comesOLAP tools (which help that person to access the database) we can call these OLAPtools as Business Intelligence tools (Intelligence in sense they generate SQL queriesinternally and provide lot of facilities and privileges for a reporting developers informatting the data and presenting it in a highly convenient manner). So datawarehouse (movie) is a database and business intelligence tools (trailers) presentthe content of a database in an efficient manner.® Simply speaking, BI is the capability of analyzing the data of a data warehouse inadvantage of that business. A BI tool analyzes the data of a data warehouse and tocome into some business decision depending on the result of the analysis.® Data warehouses deals with all aspects of managing the development,implementation and operation of a data warehouse or data mart including meta datamanagement, data acquisition, data cleansing, data transformation, storagemanagement, data distribution, data archiving, operational reporting, analyticalreporting, security management, backup/recovery planning, etc. Business
intelligence, on the other hand, is a set of software tools that enable an organizationto analyze measurable aspects of their business such as sales performance,profitability, operational efficiency, effectiveness of marketing campaigns, marketpenetration among certain customer groups, cost trends, anomalies and exceptions,etc. Typically, the term “business intelligence” is used to encompass OLAP, datavisualization, data mining and query/reporting tools. Think of the data warehouse asthe back office and business intelligence as the entire business including the backoffice. The business needs the back office on which to function, but the back officewithout a business to support, makes no sense.® DATAWAREHOUSE: Data warehouse is integrated, time-variant, subject orientedand non-volatile collection data in support of management decision making process.BUSINESS INTELLIGENCE: Business Intelligence is the process of extracting thedata, converting it into information and then into knowledge base is known asBusiness Intelligence.® A data warehouse is a database geared towards the business intelligencerequirements of an organization. It integrates data from the various operationalsystems and is typically loaded from these systems at regular intervals.BI - It is category of technologies that allows for gathering, storing, accessing andanalyzing data to help business users make better decisions.® To make Business Analysis effective and efficient we require specialized form ofstorage. This special form of storage of data is called Data Warehouse and theprocess Data Warehousing.Business Intelligence, is the mechanism of using data according to type of industryfor predictive analysis, fault findings, process improvement etc.What is a Data Dictionary?A data dictionary is a kind of metadata. A data dictionary explains how dataphysically resides in an environment. A data dictionary identifies the type of column itis, whether it is character or numeric or some other value. It identifies the width of acolumn as well as the name of the column. Sometimes in data dictionaries you seedescriptions; sometimes you don’t. But basically it is how that field is physicallyrepresented in Oracle or Sybase or some other platform, if that’s where the dataresides. Its difficult to do any meaningful query or report without basic metadata.What are the possible data marts in Retail sales?Product information, sales information.What are data validation strategies for data mart validation after loadingprocess?
Data validation is to make sure that the loaded data is accurate and meets thebusiness requirements.Strategies are different methods followed to meet the validation requirements.What is a Data Mart?A Data Mart is a focused subset of a DWH that deals with a single area of data andis organized for quick analysis. It contains the summarized data of the warehousesand is referred as High Performance Query Structures. They consist ofMaterialized Views and Special Indexes. In some businesses these data marts maybe maintained within the warehouses whereas, in some other scenario’s they maybe maintained apart from the DWH’s.® A data mart is a repository of data gathered from operational data and othersources that is designed to serve a particular community of knowledge workers.® The systems designed for a particular line of business.What are Data Marts?Data Marts are designed to help manager make strategic decisions about theirbusiness. Data Marts are subset of the corporate-wide data that is of value to aspecific group of users.There are two types of Data Marts:1. Independent data marts – sources from data captured form OLTP system,external providers or from data generated locally within a particular department orgeographic area.2. Dependent data mart – sources directly form enterprise data warehouses.What are the levels of Data mart?What are the difference between Database, DATAWAREHOUSE and DataMarts?A Database is an organized collection of data.A DWH is a very large database with special set of tools to extract and cleanse datafrom operational systems and to analyze data.A Data Mart is a focused subset of a DWH that deals with a single area of data andis organized for quick analysis.What is Data Sampling?What is Data Scrubbing?
What is Data Acquisition Process?What is data mining?Data mining is a process of extracting hidden trends within a data warehouse. Forexample an insurance data warehouse can be used to mine data for the most highrisk people to insure in a certain geographical area.What is a transformation?It is a repository object that generates, modifies or passes data.Transformations: Transformations are the manipulation of data from how it appearsin the source systems into another form in the DWH or data mart in a way thatenhances or simplifies its meaning. In another way, you transform data intoinformation. This includes the following:Data Merging: It is a process of standardizing data types and fields. Suppose onesource system calls integer type data as smallint whereas another calls same dataas decimal. The data from the two source systems needs to rationalize when movedinto the oracle data format called number.Cleansing: It is the process of validating the data brought from multiple sources.This involves identifying any changing inconsistencies or inaccuracies.• Eliminating inconsistencies in the data from multiple sources.• Converting data from different systems into single consistent data set suitable for analysis.• Meets a standard for establishing data elements, codes, domains, formats and naming conventions.• Correct data errors and fills in for missing data values.Aggregation: The process where by multiple detailed values are combined into asingle summary value typically summation numbers representing dollars spend orunits sold.Generate summarized data for use in aggregate fact and dimension tables.What are the advantages of data mining over traditional approaches?Data Mining is used for the estimation of future. For example, if we take acompany/business organization, by using the concept of Data Mining, we can predictthe future of business in terms of Revenue (or) Employees (or) Customers (or)Orders etc.Traditional approaches use simple algorithms for estimating the future. But, it doesnot give accurate results when compared to Data Mining.What is ETL?
ETL stands for extraction, transformation and loading.ETL provide developers with an interface for designing source-to-target mappings,transformation and job control parameter.• Extraction: Take data from an external source and move it to the warehouse pre-processor database.• Transformation: Transform data task allows point-to-point generating, modifying and transforming data.• Loading: Load data task adds records to a database table in a warehouse.Explain the classification of Tables in a Data warehouse?What is Fact table?Fact Table contains the measurements or metrics or facts of business process. Ifyour business process is "Sales”, then a measurement of this business process suchas "monthly sales number" is captured in the Fact table. Fact table also contains theforeign keys for the dimension tables.Why fact table is in normal form?Basically the fact table consists of the Index keys of the dimension/look up tablesand the measures. So when ever we have the keys in a table. That itself implies thatthe table is in the normal form.What is a level of Granularity of a fact table?Level of granularity means level of detail that you put into the fact table in a datawarehouse. For example: Based on design you can decide to put the sales data ineach transaction. Now, level of granularity would mean what detail you are willing toput for each transactional fact. Product sales with respect to each minute or youwant to aggregate it up to minute and put that data.What does level of Granularity of a fact table signify?Granularity: The first step in designing a fact table is to determine the granularity ofthe fact table. By granularity, we mean the lowest level of information that will bestored in the fact table. This constitutes two steps:Determine which dimensions will be included.Determine where along the hierarchy of each dimension the information will be kept.The determining factors usually go back to the requirementsWhat is aggregate fact table?
Aggregate table contains the [measure] values, aggregated /grouped/summed up tosome level of hierarchy.What is fact less fact table? Where you have used it in your project?Factless table means only the key available in the Fact there is no measuresavailable.What is the common use of creating a Factless Fact Table?What are the different types of Fact Table? Explain with an example.1. Cumulative Fact Table:2. Snapshot Fact Table:What are the types of Facts?Additive: A Fact that can be summed up with any of the dimensions is called AdditiveFacts.® A measure can participate arithmetic calculations using all or any dimensions. Ex:Sales profitSemi additive: A Fact that can be summed up with some of the dimensions is calledSemi-additive Facts.® A measure can participate arithmetic calculations using some dimensions. Ex:Sales amountNon Additive: A Fact that can be summed up with none of the dimensions is calledNon-additive Facts.® A measure can’t participate arithmetic calculations using dimensions. Ex:temperatureWhat are Semi-additive and factless facts and in which scenario will you usesuch kinds of fact tables?Snapshot facts are semi-additive, while we maintain aggregated facts we go forsemi-additive. EX: Average daily balanceA fact table without numeric fact columns is called factless fact table. Ex: PromotionFacts
While maintain the promotion values of the transaction (ex: product samples)because this table doesn’t contain any measures.What are non-additive facts in detail?A fact may be measure, metric or a dollar value. Measure and metric are nonadditive facts.Dollar value is additive fact. If we want to find out the amount for a particular placefor a particular period of time, we can add the dollar amounts and come up with thetotal amount.A non additive fact, for e.g. measure height(s) for citizens by geographical location ,when we rollup city data to state level data we should not add heights of thecitizens rather we may want to use it to derive count.What is conformed fact?Conformed dimensions are the dimensions which can be used across multiple DataMarts in combination with multiple facts tables accordingly.What is a continuously valued fact?What is Centipede Fact Table?What is Fact Constellation?What are the categories of Snapshot Fact Table Grains?What is a dimension table?A dimensional table is a collection of hierarchies and categories along which the usercan drill down and drill up. It contains only the textual attributes.How are the Dimension tables designed?Most dimension tables are designed using Normalization principles up to 2NF. Insome instances they are further normalized to 3NF.Find where data for this dimension are located.Figure out how to extract this data.Determine how to maintain changes to this dimension (see more on this in the nextsection).Change fact table and DW population routines.What are the Different methods of loading Dimension tables?
Conventional Load: Before loading the data, all the Table constraints will be checkedagainst the data.Direct load: (Faster Loading) All the Constraints will be disabled. Data will be loadeddirectly. Later the data will be checked against the table constraints and the bad datawont be indexed.Can a dimension table contain numeric values?What is hierarchy relationship in a dimension? Whether it is:1. 1:12. 1: m3. M: mWhat are the different types of dimensions? Explain with examples.1. Regular Dimensions2. Shared dimensionsWhat are the different types of dimension tables? Explain with examples.Why dimensions are de-normalized in nature?Can 2 fact tables share same dimension tables?What is junk dimension?Junk dimension: Grouping of Random flags and text attributes in a dimension andmoving them to a separate sub dimension.® A dimension, which does not change the grain level, is called junk dimension.Grain- lowest level of reporting.(Or) The junk dimension is simply a structure that provides a convenient place tostore the junk attributes(Or) A junk dimension is a convenient grouping of flags and indicators.What are Conformed Dimensions?A dimension that is used in more than one cube.® The use of conformed dimensions and shared measures is the primary way a setof data marts can be united into one consolidated data warehouse.® Conformed dimensions are dimensions which are common to the cubes.(cubesare the schemas contains facts and dimension tables)
Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2, D1, D2, D4 arethe Facts and Dimensions. Here D1,D2 are the Conformed Dimensions® Conformed dimensions mean the exact same thing with every possible fact tableto which they are joined. Ex: Date Dimensions is connected all facts like Sales facts,Inventory facts. EtcWhat is degenerated dimension?Degenerate Dimension: Keeping the control information on Fact table ex: Considera Dimension table with fields like order number and order line number and have 1:1relationship with Fact table, In this case this dimension is removed and the orderinformation will be directly stored in a Fact table in order eliminate unnecessary joinswhile retrieving order information.What is degenerate dimension table?Degenerate Dimensions: If a table contains the values, which r neither dimensionnor measures is called degenerate dimensions. Ex: invoice id, empno.What is Audit dimension? Explain with an example.What is a Fact Dimension?What is a Mini Dimension?What are Role-playing dimensions?What is a Mystery Dimension?How do you connect the facts and dimensions in the tables?1. Smart Matching columns2. Manually you can linkWhich columns go to the fact table and which columns go the dimensiontable?The Primary Key columns of the Tables (Entities) go to the Dimension Tables asForeign Keys.The Primary Key columns of the Dimension Tables go to the Fact Tables as ForeignKeys.What is Associate Table?What is Bridge Table?What is crass reference table?
What is Event-Tracking Table?What is a lookup table?A lookup table is the one which is used when updating a warehouse. When thelookup is placed on the target table (fact table / warehouse) based upon the primarykey of the target, it just updates the table by allowing only new records or updatedrecords based on the lookup condition.What is the data type of the surrogate key?Data type of the surrogate key is either integer or numeric or number.What is a Schema?What is a Star Schema?Star schema is a type of organizing the tables such that we can retrieve the resultfrom the database easily and fastly in the warehouse environment. Usually a starschema consists of one or more dimension tables around a fact table which lookslike a star, so that it got its name.Differences between star and snowflake schemas?Star schema: A single fact table with N number of Dimension.Snowflake schema: Any dimensions with extended dimensions are known assnowflake schema.® Star schema - all dimensions will be linked directly with a fat table.Snow schema - dimensions maybe interlinked or may have one-to-many relationshipwith other tables.What is Snow-Flake Schema?When do U go for Star Schema? & when do U go for Snow-Flake Schema?What is the main difference between schema in RDBMS and schemas in DataWarehouse?RDBMS Schema
• Used for OLTP systems• Traditional and old schema• Normalized• Difficult to understand and navigate• Cannot solve extract and complex problems• Poorly modeledDWH Schema• Used for OLAP systems• New generation schema• De Normalized• Easy to understand and navigate• Extract and complex problems can be easily solved• Very good modelWhy did u choose STAR SCHEMA only? What are the benefits of STARSCHEMA?Because it’s de-normalized structure, i.e., Dimension Tables are de-normalized. Whyto de-normalize means the first (and often only) answer is: speed. OLTP structure isdesigned for data inserts, updates, and deletes, but not data retrieval. Therefore, wecan often squeeze some speed out of it by de-normalizing some of the tables andhaving queries go against fewer tables. These queries are faster because theyperform fewer joins to retrieve the same record set. Joins are also confusing to manyEnd users. By de-normalizing, we can present the user with a view of the data that isfar easier for them to understand.Benefits of STAR SCHEMA:Far fewer Tables.Designed for analysis across time.Simplifies joins.Less database space.Supports “drilling” in reports.Flexibility to meet business and technical needs.Difference between Snow flake and Star Schema. What are situations whereSnow flake Schema is better than Star Schema to use and when the oppositeis true?
Star schema contains the dimension tables mapped around one or more fact tables.It is a denormalised model. No need to use complicated joins. Queries results fastly.Snowflake schema: It is the normalized form of Star schema. It contains in-depthjoins, because the tables r splitted in to many pieces. We can easily do modificationdirectly in the tables. We have to use complicated joins, since we have moretables .There will be some delay in processing the Query.Which is preferable? Star Schema or Snow-Flake Schema?If U have 2 fact tables connected in the schema, do U know the name of theschema?What is Galaxy Schema?What is Multi-Star Schema?How do you load the time dimension?Time dimensions are usually loaded by a program that loops through all possibledates that may appear in the data. It is not unusual for 100 years to be representedin a time dimension, with one row per day.What are slowly changing dimensions?SCD stands for Slowly changing dimensions. Slowly changing dimensions are ofthree typesSCD1: only maintained updated values.Ex: a customer address modified we update existing record with new address.SCD2: maintaining historical information and current information by usingA) Effective DateB) VersionsC) Flags Or combination of theseSCD3: by adding new columns to target table we maintain historical information andcurrent information® Type-1: Most Recent ValueType-2(full History)i) Version Numberii) Flag
iii) DateType-3: Current and one Previous value® Type 1: overwrite data is to be there.Type 2: current, recent and history data should be there.Type 3: current and recent data should be there.What is BUS Schema?BUS Schema is composed of a master suite of confirmed dimension andstandardized definition if facts.What is hybrid slowly changing dimension?What are Critical columns?What is a surrogate key? Why is it used? What is its need? Give an example.Explain in detail what do you mean by Slicing and Dicing?Slicing and dicing refers to the ability to combine and re-combine the dimensions tosee different slices of the information. Picture slicing a three-dimensional cube ofinformation, in order to see what values are contained in the middle layer. Dicing isthe ability to view the cube from different perspectives. Slicing and dicing a cubeallows an end-user to do the same thing with multiple dimensions.What is a Measure? What are the types of Measures?How can U create Measures & Dimensions?Can we group a measure?What do U mean by Multi-dimensional Analysis?What is a Grain?What is Drill-up, Drill-down & Drill-Across?Differentiate between Level and Category?Level is a logical subdivision of a dimensione.g.: if orderdate is a dimension, the levels are year, quarter, month, week, day etc.Category is the different instances of a levelE.g. if year is a level, the category are 1996, 1997, 1998 etc.What is a CUBE in data warehousing concept?
Cubes are logical representation of multidimensional data. The edge of the cubecontains dimension members and the body of the cube contains data values.What is a Virtual Cube?Difference between filter and condition?Parameter is the only difference® The difference between Filter and Condition: Condition returns true or false Ex: ifCountry = India then ...Filter will return two types of results.1. Detail information which is equal to where clause in SQL statement2. Summary information which is equal to Group by and having clause in SQLstatement® I filter we just create a parameter on which we can filter the fields. but in conditionwe can have the static functions like if yes then color it green, if no then color it asred etc. so here we can create conditions for filtering in the report. Mean we canmake different filtering function at the same time by using conditional formatting.What is snapshot?You can disconnect the report from the catalog to which it is attached by saving thereport with a snapshot of the data. However, you must reconnect to the catalog if youwant to refresh the data.What is a linked cube?Linked cube in which a sub-set of the data can be analyzed into great detail. Thelinking ensures that the data in the cubes remain consistent.What is VLDB?VLDB stands for Very Large Database.It is an environment or storage space managed by a relational databasemanagement system (RDBMS) consisting of vast quantities of information. VLDBdoesn’t refer to size of database or vast amount of information stored. It refers to thewindow of opportunity to take back up the database.Window of opportunity refers to the time of interval and if the DBA was unable totake back up in the specified time then the database was considered as VLDB.What is batch processing?What is incremental loading?
Incremental loading means loading the ongoing changes in the OLTP.Explain the advantages of RAID 1, 1/0, and 5. What type of RAID setup wouldyou put your TX logs.Transaction logs write sequentially and dont need to be read at all. The ideal is tohave each on RAID 1/0 because it has much better write performance than RAID 5.RAID 1 is also better for TX logs and costs less than 1/0 to implement. It has a tadless reliability and performance is a little worse generally speaking.RAID 5 is best for data generally because of cost and the fact it provides great readcapability.What is BAS? What is the function?The Business Application Support (BAS) functional area at SLAC providesadministrative computing services to the Business Services Division and HumanResources Department. We are responsible for software development andmaintenance of the PeopleSoft applications and consultation to customers with theircomputer-related tasks. It’s called Broadcast Agent Server. Its function is to run thejobs or reports scheduled and can be monitored using Broadcast Agent Console.What are modeling tools available in the Market?There are a number of data modeling toolsTool Name Company NameErwin Computer AssociatesEmbarcadero Embarcadero TechnologiesRational Rose IBM CorporationPower Designer Sybase CorporationOracle Designer Oracle CorporationWhat are the various Reporting tools in the Market?1. MS-Excel2. Business Objects (Crystal Reports)3. Cognos (Impromptu, Power Play)4. Microstrategy5. MS reporting services
6. Informatica Power Analyzer7. Actuate8. Hyperion (BRIO)9. Oracle Express OLAP10. Proclarity® Some of the standard Business Intelligence tools in the market According to theirperformance1) MICROSTRATEGY2) BUSINESS OBJECTS, CRYSTAL REPORTS3) COGNOS REPORT NET4) MS-OLAP SERVICESOr1. Seagate Crystal report2. SAS3. Business objects4. Microstrategy5. Cognos6. Microsoft OLAP7. Hyperion8. Microsoft integrated services and some more.What are the various ETL tools in the Market?Various ETL tools used in market are:Informatica.Data Stage.Oracle Warehouse Builder.Ab Initio.Data Junction.Name some of the real time data-warehousing tools?What is Outsourcing, Offshoring & Insourcing? And what is the differencebetween them.
Outsourcing is not strictly IT. Any function of an organization that is executed by non-employees is essentially an Outsourced task.Insourcing is the use of external resources (not employees of the Organization) toaccomplish some function, but they are predominately carrying out the function atthe client’s site. So, the function is “sourced” but not “out” sourced. These resourcesare also typically managed more closely by the client directly with little managementinvolvement from the supplier.Offshoring is a subset of Outsourcing which is generally understood to involve acountry in which cost remain lower than the clients country of operations.While most Offshoring situations are indeed an example of Outsourcing, for thosecompanies (HP for example) who now own their offshore operations and have foldedthem into the company, the line gets blurred. In other words, Offshoring is not alwaysoutsourcing anymore.What is ER Diagram?The Entity-Relationship (ER) model was originally proposed by Peter in 1976[Chen76] as a way to unify the network and relational database views. Simply statedthe ER model is a conceptual data model that views the real world as entities andrelationships. A basic component of the model is the Entity-Relationship diagramwhich is used to visually represent data objects.Since Chen wrote his paper the model has been extended and today it is commonlyused for database design for the database designer, the utility of the ER model is:It maps well to the relational model. The constructs used in the ER model can easilybe transformed into relational tables.It is simple and easy to understand with a minimum of training. Therefore, the modelcan be used by the database designer to communicate the design to the end user.In addition, the model can be used as a design plan by the database developer toimplement a data model in specific database management software.What Oracle tools can be used to build and design a warehouse?What Oracle features can be used to optimize my warehouse system?What is Data Modeling?Data modeling represent information in the entities, attributes and relationships.Visual representation of the information.What are the different steps for Data Modeling?1. Define the problem and scope of the problem.
2. Information gathering.3. Analysis(normalization)4. Create a logical data model (independent of platform).5. Decision about physical platform like oracle or SQL etc.6. Create a physical data model, which is platform specific.7. Database creation.What is Dimensional Modeling?Dimensional Modeling is a design concept used by many data warehouse designersto build their data warehouse. In this design model all the data is stored in two typesof tables - Facts table and Dimension table. Fact table contains thefacts/measurements of the business and the dimension table contains the context ofmeasurements i.e., the dimensions on which the facts are calculated. Data modelingis probably the most labor intensive and time consuming part of the developmentprocess. Why bother especially if you are pressed for time? A common response bypractitioners who write on the subject is that you should no more build a databasewithout a model than you should build a house without blueprints. The goal of thedata model is to make sure that the all data objects required by the database arecompletely and accurately represented. Because the data model uses easilyunderstood notations and natural language, it can be reviewed and verified ascorrect by the end-users. The data model is also detailed enough to be used by thedatabase developers to use as a "blueprint" for building the physical database. Theinformation contained in the data model will be used to define the relational tables,primary and foreign keys, stored procedures, and triggers. A poorly designeddatabase will require more time in the long-term. Without careful planning you maycreate a database that omits data required to create critical reports, produces resultsthat are incorrect or inconsistent, and is unable to accommodate changes in theusers requirements.What is Logical Modeling?The Logical Model: In Erwin, the logical model is the version of the model thatrepresents all of the logical business requirements of an organization. There arethree levels of logical models that are used to capture these requirements:The Entity Relationship Diagram A high-level data model that includes all majorentities and relationships. The Entity Relationship Diagram does not contain muchdetail and is often used in the initial planning phase.The Key Based Model A model that describes major data structures such asentities, primary keys, and sample attributes.
The Fully Attributed Model A complete model that includes all required entities,attributes, key groups, and relationships.In Erwin, a logical model can be created in conjunction with the physical model, orindependent of the physical model. Logical models can also be derived from othermodels using the Derive Model Wizard.In addition, Erwin supports the definition of model objects in a logical model aslogical only and in a physical model as physical only. These options allow for thelogical model to be fully normalized and for the corresponding physical model to bede-normalized. Erwin also allows for the automatic conversion of many-to-many andsuper type/subtype relationships when you change from a logical model to a physicalmodel.What are the types of Dimensional Modeling?What is Conceptual Modeling?What is Physical Modeling?Comparing Logical and Physical Models in a Logical/Physical Model:In an Erwin logical/physical model, each model that you create automaticallyincludes both a logical and a physical model. By default, the logical model is closelyrelated to the physical model. If you make a change in the logical model, the changeis automatically reflected in the physical model and vice-versa.You can use either the logical model or the physical model to define and documentdatabase structures; although the model you use typically depends on the type ofwork you want to perform. You can use the logical model to represent businessinformation and define business rules in a fully normalized model, while the physicalmodel supports the needs of the database administrator, who focuses on thephysical implementation of the model in a database.Comparing Logical and Physical Model Objects:Most of the objects in the logical model correspond to a related object in the physicalmodel. For example, the logical model contains entities, attributes, and key groups,which are represented in the physical model as tables, columns, and indexes,respectively. The following table compares the logical and physical components inan Erwin model.What is Difference between E-R Modeling and Dimensional Modeling?Basic diff is E-R modeling will have logical and physical model. Dimensional modelwill have only physical model.E-R modeling is used for normalizing the OLTP database design.
Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.What is Entity, Attribute and Relationship?Entity: Entity is an object of which an organization wants to maintain the informationE.g.: Employee.Attribute: Is an object that maintains the information.Key attribute: A key attribute consists of one or more attributes of an entity, whichuniquely identify the entity. e.g.; Bank account no identifies for account.Relationship: Defines the association between different entities.one to one, one to many, many to one, many to many.What is meant by De-Normalization?What is the definition of normalized and denormalized view and what are thedifferences between them?Normalization is the process of removing redundancies.Denormalization is the process of allowing redundancies.Why Denormalization is promoted in Universe Designing?In a relational data model, for normalization purposes, some lookup tables are notmerged as a single table. In a dimensional data modeling (star schema), thesetables would be merged as a single table called DIMENSION table for performanceand slicing data. Due to this merging of tables into one large Dimension table, itcomes out of complex intermediate joins. Dimension tables are directly joined to Facttables. Though, redundancy of data occurs in DIMENSION table, size ofDIMENSION table is 15% only when compared to FACT table. So onlyDenormalization is promoted in Universe Designing.What is Cardinality?What is Referential Integrity?What are Integrity Constraints?What is the difference between view and materialized view?View - store the SQL statement in the database and let you use it as a table. Everytime you access the view, the SQL statement executes.Materialized view - stores the results of the SQL in table form in the database. SQLstatement only executes once and after that every time you run the query, the storedresult set is used. Pros include quick query results.
What is Normalization, First Normal Form, Second Normal Form , Third NormalForm?1. Normalization is process for assigning attributes to entities–Reduces dataredundancies–Helps eliminate data anomalies–Produces controlled redundancies tolink tables2. Normalization is the analysis of functional dependency between attributes / dataitems of user views. It reduces a complex user view to a set of small and stablesubgroups of fields / relations1NF: Repeating groups must be eliminated, Dependencies can be identified, All keyattributes defined, No repeating groups in table2NF: The Table is already in1NF,Includes no partial dependencies–No attributedependent on a portion of primary key, Still possible to exhibit transitive dependency,Attributes may be functionally dependent on non-key attributes3NF: The Table is already in 2NF, Contains no transitive dependencies.What is a Table space? What does it contain?What is a Composite Key or Concatenated Key? What is its use?What are Unique Identifiers?What is an Index? What are the types of Indexes?What do U mean Partitioned Indexes?What is partitioning? What are the methods of partitioning?What is Parallelism?What are the advantages and disadvantages of reporting directly against thedatabase? Do you always need to copy the data before reporting on it?(Example, real-time & on-demand reporting is a requirement)There isn’t any need to copy the data before reporting on as long as the data isclean. But if the data is not clean it should be cleansed and so go for ETLprocess.Adv of reporting directly against the database (OLTP): No need to separatelymaintain a Database for it. (Space consumption is reduced).Disadv of reporting directly against the database (OLTP): It slows down theprocess bcoz OLTP system is designed for the online application but a DataWarehouse application which requires to do analysis and hence takes the same databut takes a long time.
What are the most frequent data errors that slow down data input process?Data mining is the process of data selection, exploration and building modelsusing vast data stores to uncover previously unknown patterns. What doesthis mean to you?You can produce new knowledge to better inform decision makers before they act.Build a model of the real world based on data collected from a variety of sources,including corporate transactions, customer histories and demographics, evenexternal sources such as credit bureaus. Then use this model to produce patterns inthe information that can support decision making and predict new businessopportunities. Text mining capabilities enable you to apply such analyses to text-based documents. With SASs rich suite of text processing and analysis tools, youcan uncover underlying themes or concepts contained in large document collections,group documents into topical clusters, classify documents into predefined categoriesand integrate text data with structured data for enriched predictive modelingendeavors.
Before you begin, you should know the answers for the following questions. what is Data? D what is a Database? D what is an RDBMS? R What is a Data Model? D Why we follow Normalization while designing data model? What is an OLTP systemWHAT IS A DATAWAREHOUSING: • A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. • In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.• A Data warehouse is a complete set of Subject Oriented Integrated Time variant Nonvolatiledata which helps business in taking organization decisionSubject OrientedData warehouses are designed to help you analyze data. For example, to learn moreabout your companys sales data, you can build a warehouse that concentrates onsales. Using this warehouse, you can answer questions like "Who was our bestcustomer for this item last year?" This ability to define a data warehouse by subjectmatter, sales in this case, makes the data warehouse subject oriented.
IntegratedIntegration is closely related to subject orientation. Data warehouses must put datafrom disparate sources into a consistent format. They must resolve such problemsas naming conflicts and inconsistencies among units of measure. When they achievethis, they are said to be integrated.NonvolatileNonvolatile means that, once entered into the warehouse, data should not change.This is logical because the purpose of a warehouse is to enable you to analyze whathas occurred.Time VariantIn order to discover trends in business, analysts need large amounts of data. This isvery much in contrast to online transaction processing (OLTP) systems, whereperformance requirements demand that historical data be moved to an archive. Adata warehouses focus on change over time is what is meant by the term timevariant.When an organization should create a Data Warehouse? Once an organization have too much of information where it becomes too difficult toget the meaning full information for the business to take the strategic decisions. Thedecisions we make using the Data warehousing data will affect the entireorganization instead of one customer or one employee. Example of decisions wemake in DW is, should we continue with the specific product offerings to ourcustomers or not. Should we move the customer support department to a differentlocation for a cost saving, etc etc.Data warehouses and OLTP systems have very different requirements. Here aresome examples of differences between typical data warehouses and OLTP systems: • Workload Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. • Data modifications
A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction. • Schema design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency. • Typical operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." • Historical data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.END USER OF APPPLICATION: What you mean by end user in OLTP system ? • An end user is who is entering data or reading a particular report from the system. • For a Bank teller he/she should enter the account number see the balance or deposit the cheque etc • For a customer representative job he/she must see the cust information to be more effective What kind of information management wants to know, because the DW data is primarily used by management.
• Which are our lowest/highest margin customers? • What is the most effective distribution channel? • What product promotions have the biggest impact on revenue? • What impact will new products/services have on revenue and margins? • Which customers are most likely to go to the competition? • Who are my customers and what products are they buying?In OLTP applications, end users are individuals who takes care of day to dayoperations.In DW applications, end users are managers and above who takes decisions basedon the trend, history, predictions etcIf end users are not satisfied with the application, then the product is considered tobe failure even though the technology wise its a great achievement.Data Warehouse Architecture:Source Data:
An organization will have many OLTP applications, all these operational databecomes the source for the Data Warehouse database.ETL: (Extract Transform and Load) We extract data from various operational systems and clean the data so that we getonly the information make sense to have in Data Warehouse. While cleansing thedata we may reject some records or we fill in the missing information. Once wetransform the operational data to the format in which DW expects, then we load thedata to DW. This process takes most of the time while developing DW applications.DW Database This is the area where we store the data which is required by the business so thatthey can run any report against the data. In data warehouses we will have currentand history information which is very useful for trend analysis, behavioral analysisetc.What is Data Mart?A data mart is a simple form of a data warehouse that is focused on a single subject(or functional area), such as Sales or Finance or Marketing. Data marts are oftenbuilt and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sourcescould be internal operational systems, a central data warehouse, or external dataDifference between Data Warehouse and Data Mart Data Warehouse Data Mart D Enterprise-wide Departmental Structure for corporate view of Star Schema based (Facts and data dimensions) d Organized E-R Model or d Quick turn around (up and Galaxy of Star (Multiple Star running as there are less schemas in the Data Model) stakeholders) s Long turn around timeData Granularity
What is Granularity of your DW? Granularity is the level of details we want to store in the data warehouse. For a retail store, Point of Sale (POS) is the lowest granularity information available. For banking its the account level details based on every day transactions. As DSS is learning towards analyzing the data as a whole, not necessarily the data warehouse will have all the details up to daily transactions.t Daily sales by date, product and customer Weekly sales by product and customer Monthly sales by product and customer Quarterly sales by product and customer Yearly sales by product and customer Usually in Data Warehouses (EDW) we will tend to have POS where as in Datamarts we will have it aggregated by week or month so that we never loose thedetailed information. This detailed level data can be used to get the micro behaviorsof our customers (especially in Data Mining)Data Warehousing Objects: Data ware housing consists only two objects Fact DimensionFact Tables:A fact table typically has two types of columns: those that contain numeric facts(often called measurements), and those that are foreign keys to dimension tables. Afact table contains either detail-level facts or facts that have been aggregated. Facttables that contain aggregated facts are often called summary tables. A fact tableusually contains facts with the same level of aggregation. Though most facts areadditive, they can also be semi-additive or non-additive. Additive facts can beaggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive
facts can be aggregated along some of the dimensions and not along others. Anexample of this is inventory levels, where you cannot tell what a level means simplyby looking at it.Dimension Tables:A dimension is a structure, often composed of one or more hierarchies, thatcategorizes data. Dimensional attributes help to describe the dimensional value.They are normally descriptive, textual values. Several distinct dimensions, combinedwith facts, enable you to answer business questions. Commonly used dimensionsare customers, products, and time.Dimension data is typically collected at the lowest level of detail and then aggregatedinto higher level totals that are more useful for analysis. These natural rollups oraggregations within a dimension table are called hierarchies.Hierarchies:Hierarchies are logical structures that use ordered levels as a means of organizingdata. A hierarchy can be used to define data aggregation. For example, in a timedimension, a hierarchy might aggregate data from the month level to the quarterlevel to the year level. A hierarchy can also be used to define a navigational drill pathand to establish a family structure.Within a hierarchy, each level is logically connected to the levels above and below it.Data values at lower levels aggregate into the data values at higher levels. Adimension can be composed of more than one hierarchy. For example, in theproduct dimension, there might be two hierarchies--one for product categories andone for product suppliers.Dimension hierarchies also group levels from general to granular. Query tools usehierarchies to enable you to drill down into your data to view different levels ofgranularity. This is one of the key benefits of a data warehouse.When designing hierarchies, you must consider the relationships in businessstructures. For example, a divisional multilevel sales organization.Hierarchies impose a family structure on dimension values. For a particular levelvalue, a value at the next higher level is its parent, and values at the next lower levelare its children. These familial relationships enable analysts to access data quickly.YEAR QUATER WEEK
How to handle Slowly Changing Dimensions (SCDs) in data model design?Posted by Dylan Wan on January 13, 2007There are multiple methods to handle the slowly changing dimensions. Whichtechnique to use depends on your business requirements. The choice among thesethree methods are not a technical design decision since their behaviors are different.Type One: Overwite the old data with new dataUsing this method, you do not store the histoy. For example, that say each customercan have one salesrep at any given point in time. When the salerep of ABC Inc.,changes from Sandy to Laura, Sandy was a salerep of ABC will not be keptanywhere. Any report by salesrep will assume that Laura is the salereps of ABC Inc.forever and count all the sales done by Sandy as Lanura’s.The above example may not sound making business sense. However, if you onlyreport the sales of the current period, and salesrep does not change during theperiod, this method is ok to be used.Mary OLTP tables does not need to track the history of changes and thus thismethod may be used by the source application. However, if you want to report thehistorical data, even your OLTP does not track history, the data warehouse can stilluse other methods to track the history.Type Two: Add a new record at the timeof the change Using this method, all priorhistory are saved. There MONTH are two alternative methods to model the key of thistable.Method A – No surrogate key – Use timestampWhen a change happens, a new record is added into the table. All the attributes arecopied from the previous record except the changed values. The nature key iscopied as well so the timestamps is used to differentiate the records.When a fact table is joined with the dimension, if you are interested in the historicaldata, the timestamp will be used as part of the join condition. To ease the join, therecord typically use two date columns – the effective start date and the effective enddate.
Method B – No surrogate key – Use version numberInstead of using the date column, a version number is used to differentiate thedifferent versions of the records.This technique requires the fact table store both nature key and the version numberto retrive a given version of the dimension date.Method C – Use a surrogate keyWhen an attribue is change, a sequence generated key is used, the fact table willalso use this key column as the foreign key.Type Three: Track changes using a separate columnUsing this method, you use a separate column of dimension table to store the valuesof previous years, in addition to the current year data.This method does not track all the history, but just one prior version.If the data is changed, the old value need to be moved from the current value columnto the prior column and the new value overwrites the current column.This method is used when the changes is not randon but a predefined interval suchas annual.
Structured Query LanguageSQL is a database language used to create, manipulate and control the access tothe Database objects. SQL is a non procedural language used to access relationaldatabases. It is a flexible, efficient language with features designed to manipulateand examine relational data.SQL is only used for definition and manipulation of database objects. It cannot beused for application development like form definitions, creation of proceduresetc...For that you need to necessarily have some 3gl languages such as cobol or 4gllanguages such as Dbase to provide front-end support to the database.Key features of SQL are: • Non procedural language • Unified Language • Common language for all Relational databases. ( Syntax may change between different RDBMS )SQL is made of Three sub-languages such as: • Data Definition language (DDL) • Data Manipulation language (DML) • Data control language (DCL)Data Definition Language (DDL): allows you to define database objects at theconceptual level. It consists of commands to create objects and alter the structure ofobjects, such as tables, views, indexes etc.. Commonly used DDL statements areCREATE, DROP etc..If you want to create a table Student,then use the following syntaxCREATE TABLE STUDENT( STUDENT_ID INTEGER PRIMARY KEY,STUDENT_NM VARCHAR(30),COURSE_ID VARCHAR(15) ,PHONE VARCHAR(10) ,ADDRESS VARCHAR(50) );To drop a table from the databaseDROP TABLE STUDENT;Data Manipulation language(DML): Allows you to retrieve or update data within adatabase. It is used for query, insertion, deletion and updating of information storedin databases. Eg: Select, Insert, Update, Delete.
STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-888-90 888, North Central Exp,1001 JAMES Oracle 18 Dallas, TX- 75089 972-678-89 567, Preston Road, Dallas,1002 JIM MSSql Server 09 TX - 75240 214-571-15 1234, Elm Street, Dallas,1003 BRUCE Java 67 TX - 75039Select statement:Select statement in SQL language is used to display certain data from the table.Forexample:- if you want to know what course Jim is taking; Select statement fetchesyou the information you want,when you use the information you have. So,in theabove scenario the information you have is student_nm as Jim and and theinformation you want is course_id, the intersection of those two columns in thattable is what you are looking for.SELECT (what you want)FROM (which tables)WHERE (what you have )Now the select statement to know the course_id Jim looks like this:SELECT COURSE_IDFROM STUDENTWHERE STUDENT_NM = JIMYou will get the result as:COURSE_IDMSSql ServerIf you want to see all the rows in the table then your select will be:SELECT * FROM STUDENT;If you would like to show student_nm and address who is attending Oracle course inthe form of a report then your select will look like:SELECT STUDENT_NM, ADDRESSFROM STUENTWHERE COURSE_ID = OracleThe result will beSTUDENT_NM ADDRESS
JAMES 888, North Central Exp, Dallas, TX- 75089Insert StatementInsert statement is used to insert a new row into the table. For example:- If a newstudent DAVE is joining Java course then,use the INSERT SQL statement.INSERT INTO STUDENT (STUDENT_ID, STUDENT_NM, COURSE_ID,PHONE,ADDRESS ) VALUES(1004, DAVE, Java,972-912-4008, 567, Washington Ave, Dallas - 75543 )after executing the insert statement,your table should look like below when you issuea select from student table:STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-888-90 888, North Central Exp,1001 JAMES Oracle 18 Dallas, TX- 75089 972-678-89 567, Preston Road, Dallas,1002 JIM MSSql Server 09 TX - 75240 214-571-15 1234, Elm Street, Dallas, TX1003 BRUCE Java 67 - 75039 972-912-40 567, Washington Ave,1004 DAVE Java 08 Dallas - 75543Update Statementis used to change the existing information in the table.For example:-If DAVE movedto another address then we need to change the ADDRESS column for DAVEsrecord.If the new address is 146, Dallas Parkway, Dallas - 75240 then your updateshould be:UPDATE STUDENT SET ADDRESS = 146, Dallas Parkway, Dallas - 75240WHERE STUDENT_NM = DAVEIn order to make sure you updated the Address column for DAVE issue followingSQLSELECT * FROM STUDENT WHERE STUDENT_NM = DAVEthen you should see the following resultSTUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-912-40 146, Dallas Parkway, Dallas1004 DAVE Java 08 - 75240Delete Statement
is used to delete a row from the table ie remove records from the table.Forexample:JAMES moved to different city, and he does not want to take the course.Inorder to remove JAMESs record from the table we use the DELETE statementDELETE STUDENTWHERE STUDENT_NM = JAMESonce you delete the record and you select all the information from the student tableyou should see the following information:STUDENT_ID STUDENT_NM COURSE_ID PHONE ADDRESS 972-678-89 567, Preston Road, Dallas,1002 JIM MSSql Server 09 TX - 75240 214-571-15 1234, Elm Street, Dallas,1003 BRUCE Java 67 TX - 75039 972-912-40 567, Washington Ave,1004 DAVE Java 08 Dallas - 75543If you dont include where clause in delete statment then it will remove all the rowsfrom the table.Data control language(DCL)In RDBMS one of the main advantages is the security for the data in the database.You can allow some user to do a specific operation or all operations on certainobjects. Examples for DCL statements are GRANT, REVOKE statements.GRANT is used to Grant a permission to an user so that the user can do thatoperation.REVOKE is used to take back that permission from that user on that object.For example we have two users JAMES and DAVIDIf JAMES created a table called ITEMS then JAMES becomes the owner of thattable.DAVID cannot access ITEMS table because he is not the owner of that table.DAVID can access ITEMS if JAMES gives the permission on his table.JAMES can give different types of access like Select, Update, Delete and Inserton ITEMS table to DAVID.For example:-If JAMES wants to provide only Select on ITEMS to DAVID then he can issue:GRANT SELECT ON ITEMS TO JAMESIf JAMES wants to provide only Select and Insert on ITEMS to DAVID then he canissue: GRANT SELECT, INSERT ON ITEMS TO JAMESIf JAMES wants to provide all the operations on ITEMS to DAVID then he can issue:GRANT ALL ON ITEMS TO JAMESOnce you provide all permissions on an object to an user then indirectly he becomesthe owner and can do any manipulation to the table.
Oracle datatypesData in a database is stored in the form of tables. Each table consists of rows andcolumns to store the data.A particular column in a table must contain the same type of data.For example:PLAYER_NAME(char COUNTRY DATE_OF_BIRTH(date ROOM_NO(number)) (char) )AGASSI USA 10/12/1969 1004WILLIAM USA 01/15/1975 1006JIM RUSSIA 05/25/1980 1007 SWITZERLANHINGIS 06/25/1979 1009 DEvery column has certain information, PLAYER_NAME is a char column.DATE_OF_BIRTH is a Date column, ROOM_NO is a number column.Different datatypes available in Oracle database:CHAR: To store character type of data,for example: name of a person (you can saveanything in character field)VARCHAR: Same as CHAR. The only difference between CHAR and VARCHAR isthe way the database saves the data.To understand the difference better we will take the following example.CREATE TABLE EMPLOYEE (EMP_NO NUMBER(4), ENAME CHAR(15))EMP_NO ENAME888 CLARK889 KING890 DAVID COOPERAs Ename column defined as CHAR(15) every value you put it that column willoccupy all 15 bytes ie CLARK is 5 bytes string,so the database pads 10 spaces.CREATE TABLE EMPLOYEE (EMP_NO NUMBER(4), ENAME VARCHAR(15))EMP_NO ENAME888 CLARK889 KING890 DAVID COOPER
Here as Ename is defined as VARCHAR(15) it occupies only the required space. soin the above table ename CLARK occupies only 5 bytes in the database.So what are the advantages and disadvantages?.The thumb rule here is that if youare using a char column as primary key then it better be a char field. If you are usinga column to have comments then you must use varchar.NUMBER: Used to store the numbers, for example:If you want to store employeenumbers then you define the columns data type as number. If you want to define acolumn to store currency then you can define the column as NUMBER(7,2).DATE: Used to store the date,like Date of birth of a person, join date in a companyetc.LONG: to store the variable char length.RAW:LONG RAW: store binary data of variable length.LOB: Large objects to store binary files.In addition oracle 8 supports CLOB, BLOB and BFILECLOB - A table can have multiple columns of this type.BLOB - can store large binary objects such as graphics, video and sound files.BFILE - stores file pointers to LOB managed by file systems external to the database ConstraintsWhen you bind a business rule to a column in the table then those rules are calledthe Constraints. Constraints are defined while creating the table. Say for example,you cannot have an employee who does not have a name, then employee namecolumn in employee table should be a NOT NULL column. The NOT NULL is aconstraint.The following table shows the constraint types and short descriptions.Constraint Type Description you must provide the value in that column. you cannot leave thatNOT NULL column blankPRIMARY KEY No duplicate values allowed, for example Empno in Employee table
should be uniqueCHECK checks the value and controls the inserting and updating values.DEFAULT Assigns a default value if no value is given.REFERENCES To maintain the referential integrity (Foreign Key)Examples for some of the rules usually implement through the business rules.NOT NULLIf we have a business rule saying that all customers should have a name, we cannothave any customer without a name. So to implement that business rule we cancreate customer table and specify customer name column as NOT NULL (constraint)ExampleCREATE TABLE EMPLOYEE (EMPNO NUMBER(4) PRIMARY KEY, ENAMEVARCHAR(4) NOT NULL);CHECKCheck constraint is used where we define a condition on a column. Check constraintconsists of the keywordcol_name datatype CHECK (col_name in(value1, value2))ExampleIf you have a business rule saying that all employees in the organization should getatleast $500 then we can use CHECK constraint while creating table.CREATE TABLE EMPLOYEE ( EMPNO NUMBER(4) PRIMARY KEY, ENAMEVARCHAR(4) NOT NULL, SALARY NUMBER(7,2) CHECK (SALARY > 500) );DEFAULTWhile inserting a row into a table without giving values for every column, SQL mustinsert a default value to fill in the excluded columns, or the command will be rejected.The most common default value is NULL. This can be used with columns not definedwith a NOT NULL.Default value assigned to a column while creating the table using CREATE TABLEoperation.ExampleCREATE TABLE ITEM (ITEM_ID NUMBER(4) PRIMARY KEY, ITEM_NAMEVARCHAR(15),ITEM_DESC VARCHAR(100), QOH NUMBER(4) DEFAULT 100)Assigning a default value 0 for numeric columns makes the computation.PRIMARY KEYPrimary Key in a table is a unique identifier of a row. For example,if you aremaintaning the customer profiles, you should assign particular number to each one.So customer_number should be defined as a Primary key in Customer table.