Acquire new customers; develop a more intimate picture of those you already have. Optimize your current process by backing your decisions with data. Harness the potential of big data or predictive modelling. And complement all of these with external data sources that will give you more meaningful insights.
Whatever your challenge, we’ll propose the right data strategy for your business.
Logikview’s data mining services ensure timely, reliable results by supporting the Cross-Industry Standard Process for Data Mining (CRISP-DM). Created by industry experts, CRISP-DM provides step-by-step guidelines, tasks, and objectives for every stage of the data mining process. CRISP-DM is the industry-standard process for data mining projects.
CRISP–DM demands that data mining be seen as an entire process, from communication of the business problem through data collection and management, data pre-processing, model building, model evaluation, and finally, model deployment. Mastering the methodology therefore requires the combination of abilities ranging from data affinity through quantitative reasoning and a sound business acumen to well-developed communication skills.
The methodology consists of six steps, each of them equally important in the generation of meaningful analytical insights and the production of actionable results.
Know “who, what, when, where, why, and how” from a business perspective.
Develop a thorough understanding of the project parameters: the current business situation, the primary business objective of the project, the criteria for success, and who will determine the success of the project.
Make sure the data is available.
Gather all of the data you will need for your project. If your data will come from more than one source, make sure your data mining tool can integrate the data. The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypotheses for hidden information.
Produce a Project Plan
The project plan describes the intended plan for achieving the data mining goals, including outlining specific steps and a proposed timeline, an assessment of potential risks, and an initial assessment of the tools and techniques needed to support the project. Generally accepted industry timeline standards are: 50 to 70 percent of the time and effort in a data mining project involves the Data Preparation Phase; 20 to 30 percent involves the Data Understanding Phase; only 10 to 20 percent is spent in each of the Modeling, Evaluation, and Business Understanding Phases; and 5 to 10 percent is spent in the Deployment Planning Phase.
Select your data
Decide what data to use for analysis and list the reasons for your decisions. This involves:
- Performing significance and correlation tests to determine which fields to include
- Selecting data subsets
- Using sampling techniques to review small chunks of data for appropriateness
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modelling tool(s) from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.
Assess the Situation
In this step, the data analyst outlines the resources, from personnel to software that are available to accomplish the data mining project. Particularly important is discovering what data is available to meet the primary business goal. At this point, the data analyst also should list the assumptions made in the project— assumptions such as, “To address the business question, a minimum number of customers over age 50 is necessary.” The data analyst also should list the project risks, list potential solutions to those risks, create a glossary of business and data mining terms, and construct a cost-benefit analysis for the project.
Determine the Data Mining Goals
The data mining goal states project objectives in business terms such as, “Predict how many widgets a customer will buy given their purchases in the past three years, demographic information (age, salary, city, etc.), and the item price.” Success also should be defined in these terms—for instance, success could be defined as achieving a certain level of predictive accuracy.
Collect the Initial Data
Here a data analyst acquires the necessary data, including loading and integrating this data if necessary. The analyst should make sure to report problems encountered and his or her solutions to aid with future replications of the project. For instance, data may have to be collected from several different sources, and some of these sources may have a long lag time. It is helpful to know this in advance to avoid potential delays.
Describe the Data
During this step, the data analyst examines the “gross” or “surface” properties of the acquired data and reports on the results, examining issues such as the format of the data, the quantity of the data, the number of records and fields in each table, the identities of the fields, and any other surface features of the data. The key question to ask is: Does the data acquired satisfy the relevant requirements? For instance, if age is an important field and the data does not reflect the entire age range, it may be wise to collect a different set of data. This step also provides a basic understanding of the data on which subsequent steps will build.
Explore the Data
This task tackles the data mining questions, which can be addressed using querying, visualization, and reporting. For instance, a data analyst may query the data to discover the types of products that purchasers in a particular income group usually buy. Or the analyst may run a visualization analysis to uncover potential fraud patterns. The data analyst should then create a data exploration report that outlines first findings, or an initial hypothesis, and the potential impact on the remainder of the project.
Verify Data Quality
At this point, the analyst examines the quality of the data, addressing questions such as: Is the data complete? Missing values often occur, particularly if the data was collected across long periods of time. Some common items to check include: missing attributes and blank fields; whether all possible values are represented; the plausibility of values; the spelling of values; and whether attributes with different values have similar meanings (e.g., low fat, diet). The data analyst also should review any attributes that may give answers that conflict with common sense (e.g., teenagers with high income).
In this phase, various modelling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, several techniques exist for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase may be necessary. Modeling steps include the selection of the modeling technique, the generation of test design, the creation of models, and the assessment of models.
Select the Modeling Technique
This task refers to choosing one or more specific modeling techniques, such as decision tree building with C4.5 or neural net- work generation with back propagation. If assumptions are attached to the modeling technique, these should be recorded.
Generate a Test Design
After building a model, the data analyst must test the model’s quality and validity, running empirical testing to determine the strength of the model. In supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the data set into train and test set, build the model on the train set, and estimate its quality on the separate test set. In other words, the data analyst develops the model based on one set of existing data and tests its validity using a separate set of data. This enables the data analyst to measure how well the model can predict history before using it to predict the future. It is usually appropriate to design the test procedure before building the model; this also has implications for data preparation.
Build the Model
After testing, the data analyst runs the modeling tool on the prepared data set to create one or more models.
Assess the Model
The data mining analyst interprets the models according to his or her domain knowledge, the data mining success criteria, and the desired test design. The data mining analyst judges the success of the application of modeling and discovery techniques technically, but he or she should also work with business analysts and domain experts in order to interpret the data mining results in the business context. The data mining analyst may even choose to have the business analyst involved when creating the models for assistance in discovering potential problems with the data.
For example, a data mining project may test the factors that affect bank account closure. If data is collected at different times of the month, it could cause a significant difference in the account balances of the two data sets collected. (Because individuals tend to get paid at the end of the month, the data collected at that time would reflect higher account balances.) A business analyst familiar with the bank’s operations would note such a discrepancy immediately.
In this phase, the data mining analyst also tries to rank the models. He or she assesses the models according to the evaluation criteria and takes into account business objectives and business success criteria. In most data mining projects, the data mining analyst applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he or she also compares all results according to the evaluation criteria.
Evaluate your data mining results
Determine whether and how well the results delivered by a given model will help you achieve your business goals. Is there any business reason why the model is deficient?
At this stage in the project the data analyst has built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Before proceeding to final deployment of the model built by the data analyst, it is important to more thoroughly evaluate the model and review the model’s construction to be certain it properly achieves the business objectives. Here it is critical to deter- mine if some important business issue has not been sufficiently considered. At the end of this phase, the project leader then should decide exactly how to use the data mining results. The key steps here are the evaluation of results, the process review, and the determination of next steps.
Previous evaluation dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and determines if there is some business reason why this model is deficient. Another option here is to test the model(s) on real-world applications—if time and budget constraints permit. Moreover, evaluation also seeks to unveil additional challenges, information, or hints for future directions. At this stage, the data analyst summarizes the assessment results in terms of business success criteria, including a final statement about whether the project already meets the initial business objectives.
It is now appropriate to do a more thorough review of the data mining engagement to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues (e.g., did we correctly build the model? Did we only use allowable attributes that are available for future deployment?).
Determine Next Steps
At this stage, the project leader must decide whether to finish this project and move on to deployment or whether to initiate further iterations or set up new data mining projects.
Create a deployment plan
Take the project results and decide how best to use them to address your business issue:
- Summarize deployable models or software results
- Develop and evaluate alternative deployment plans
- Confirm how the results will be distributed to recipients
- Determine how to monitor the use of the results and measure the benefits
- Identify possible problems and pitfalls of deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. It often involves applying “live” models within an organization’s decision-making processes. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
In order to deploy the data mining result(s) into the business, this task takes the evaluation results and develops a strategy for deployment.
Plan Monitoring and Maintenance
Monitoring and maintenance are important issues if the data mining result is to become part of the day-to-day business and its environment. A carefully prepared maintenance strategy avoids incorrect usage of data mining results.
Produce Final Report
At the end of the project, the project leader and his or her team write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences (if they have not already been documented as an ongoing activity) or it may be a final and comprehensive presentation of the data mining result(s). This report includes all of the previous deliverables and summarizes and organizes the results. Also, there often will be a meeting at the conclusion of the project, where the results are verbally presented to the customer.
The data analyst should assess failures and successes as well as potential areas of improvement for use in future projects. This step should include a summary of important experiences during the project and can include interviews with the significant project participants. This document could include pitfalls, misleading approaches, or hints for selecting the best-suited data mining techniques in similar situations. In ideal projects, experience documentation also covers any reports written by individual project members during the project phases and tasks.
All Logikview Analytics consultants are systematically trained on this methodology, putting it into services whenever appropriate. This has allowed us to develop unparalleled excellence in its application, enabling us to reach maximum analytical precision and efficiency when offering solutions to our clients.