Basic Concepts

"The set of data structures and tools useful to obtain, starting from the legacy data, information that can be used by manager in the business decision process.

In more detail, a DW can be seen as a process composed by three steps:

Schema DW

  1. Data extraction: it is a batch operation carried out periodically (every day/week/month); it produces summary data by scanning operational data. This step is usually automatized by using tools that are able to select useful information and map them into the DW framework (calculating possibly derived data).

  2. Data storage and reorganization: Data are stored within a strongly denormalized and indexed database. In comparison with a conventional Information System, a DW entails limited transactional issues. In fact, only read-only accesses are allowed to the users; the primary target in DW applications is to provide results in a really short time.

  3. Information evalutation:: different classes of tools can be used. In particular, OLAP tools are widely recognized as the most common way to access DW information. Newertheless, some kinds of analysis require an enourmous amount of data that cannot be analyzed by a human user. To overcome this problem, DATA MINING tools have been introduced. These tools are able to automatically analyze the data stored in a DW searching for correlations and clusters.



Consequently, the basic requirements for DWs can be defined as follows:

Design

The database comunity is devoting increasing attention to the research themes concerning DW; in fact, the development of decision-support systems will probably be one of the leading issues for the next years. The enterprises, after having invested a lot of time and resources to build huge and complex information systems, ask for support in obtaining quickly summary information wich may help managers to acquire and integrate information from different sources and to query efficiently very large databases.

Due to the importance of speeding up information retrieval DW-related research mostly concern physical design. The lack of interest in issues concerning conceptual design can be explained by examining the DW history. In fact, the early trials in organising the enterprise data so that they could be easily accessed are due to the practical requirements of firms; this kind of forum is obviously interested in short period results and is not worried about the medium/long-period problems which will arise due to an inadequate conceptual design.

The word DW is really young and not all the issues and the potentialities related to it have been discovered. We claim that the problem of design will assume main relevance in the next years. Experience shows that complex systems can survive only if the design phase has been carried out accurately. Our considerations are confirmed by the results coming from United States where significant investments have been made in DW Systems: leading figures of this reality agree on the necessity of adding a DW system to 'conventional' information Systems in order to better exploit data. Nevertheless, the high number of failures in terms of results/investments results/time rates poses many questions. Statistics show that the main cause of DW project failures can be found in the design phase. In the following we will point out the factor thats we consider fundamental for the success of a DW project.

Choosing the design metodology: Early DWs were designed using the TOP-DOWN metodology in order to obtain systems caracterized by high consistance and completeness. The entire information system was involved at the same time. This approach requires strong investments and cannot produce results within reasonable time (6 months or 1 year). These drawback causes lack of interest and mistrust in the project, that eventually is often abandoned. A more careful approach follows the BOTTOM-UP metodology: a first data-mart is developed and then an incremental approach is adopted. Also this solution presents many risks, in fact it could determine partial views of the reality hiding inconsistencies and incompatibilities between modules. Choosing the first data mart is fondamental in order to achieve a compromise between technical and political intersts:

System design: During design, the specific aims of the DW must be taken into account. These are totally different with respect to a conventional information system and thus they require an ad hoc approach. In particular let us consider the following table:



Data warehouse VS Operational System
90% Extemporary queries 90% Predefined queries
Read only accesses Read/Write accesses
Denormalized Normalized
Maintains historical versions of data Not always maintains historical versions of data
Optimized for accesses that involve a great part of the data Optimized for accesses that involve a little part of the data
Contains mostly numerical data Contains both numerical and alphanumerical data
Based on synthesis data Based on elementary data

These fundamental differencies affect strongly the designer choices, the sequence of steps and the relevant information.

Tool selection: A data warehouse is a process and not only a collection of software. This means that, in order to built a complete system, several tools are necessary. Usually, they are produced by different companies and it not always easy to integrate them. Choosing a component means to adapt the system to the company characteristics; this could be a very hard work since scalabity requirements must also be taken into account. In fact, as resulting from some statistics, the DW workload usually becomes, within a few years, 4-5 times the one extimated during design.