Data Science Project - College Credit Refund Process using the CRISP-DM

Data Science Project - Analysis of a College Credit Refund Process using the CRISP-DM Process

23rd Jan 2024
00:05 am
Admin

The student credit refund eligibility process is a critical component of any educational institution's financial management strategy. It can affect the institution's liquidity, revenue projections, and cash budgeting, among other factors. A deep-dive analysis of this process can uncover essential insights that can improve the efficiency of the process and lead to better financial outcomes for the institution. In this paper, we present an innovative deep-dive analysis of a student credit refund eligibility process.

I think it’s safe to say that colleges have been faced with the daunting task of effectively managing their Billing and Collections functions to ensure real-time accountability and transparency of their revenue streams. The challenges are due to data complexities and the need to deal with thousands of students and several third-party entities involved in the dynamics of student billing and revenue cycle management. One particular area of concern in the effectiveness of the management the billing cycle processes is the credit refund aspect.

Credit refund focuses on determining a student's credibility for a credit balance refund. A credit balance occurs when the sum of credits posted in a student's account in payments, student loan disbursements, and scholarships is higher than the sum of charges applicable to that particular student account. Effectively assessing the types of students that would want to and are eligible for credit refunds would there be considered valuable information to determine its impact on of the cash flow of the college.

Phase 1 - Business Understanding

Business Understanding focuses on understanding the objectives and requirements of the project. It is used to get a clear understanding of the problem to solve, how it impacts the organization, and the goals for addressing it.

Task 1.1 Determination of business objectives

Aim and objectives - The objective of the project is to undertake a substantive analysis of demographic types of students requesting refunds and to draw attention to the impact this may have on the school’s net cash flow and extend its liquidity. The project aims to decipher the data of college students to determine those who are credit refund eligible and have received financial aid payments. This data will be further analyzed to define the categories of students who have collected the refund payments.

Background - The entire process of credit refunds for students is a critical component in the functionality of the school’s administration. Third parties are involved in this process and for this reason, it is necessary to deal with credit refunds effectively. Determining the students who are eligible for credit refunds is a complex process that depends on the critical information of each student. Data mining will be used in this project to complete the process successfully.

Business goals - The school’s business goal is to better manage the cash-out flow that results from refunding students enrolled. Understanding and being able to predict the students on roll call more likely to request credit refunds.

Success criteria of business - The success of the proposed project will depend on the resulting net-positive cash flow from curtailing the amount that is subject to be credit refund. This will be measured through both qualitative and quantitative eligibility criteria. The data is to be collected to gain insight into the project that helps to measure results. A huge amount of data is involved in this project to detect eligibility criteria for students for credit refunds.

Constraints - Multiple policies determine credit refund eligibility and for this reason the entire process is complex. Considering the complexity, a selected number of categories have been identified in this project. Individual tasks have limitations and for data mining, the limitations are described in the constraints part. The usage of modern technology such as IoT helps to operate data mining smoothly (Alsrehin et al., 2019). There is limited time to complete this project, however, every effort will be made to utilise the most effective data-mining process to make it relatively successful.

Impact on business - This project helps to define the eligibility criteria of individual students for credit refunds. The project has been approved after considering its success in finding out credit refunds among students.

Task 1.2 Accessibility situation

Deliverables - Accessibility of data is considered one of the important facts of a data mining project. Huge amount of data is collected under the process to find out expected outcomes.
Information - Information is an important aspect of a project that helps to measure the continuity of it. Information related to individual activities of the project is to be identified to check workflow. A database system has been used in this project to find out the criteria for credit refunds. Microsoft Access has been used as a database system to manage gathering large amounts of data. Information related to several policies of credit; and payment categories have been collected to define the credit refund criteria among the students. Involvement of third parties, over payment, student loans, and nonfederal source-associated information has been gathered to make categories of eligible students. Data based on different criteria such as third-party involvement, nonfederal regulations, and multiple policies on credit and student loans have been gathered through legal procedure to achieve the goals of the project. The collected data has helped to define the eligible criteria of individual students for credit refunds.
Energy resources - Energy is an essential resource to operate activities of a project. Electricity is to be used in this project and for this purpose, a local electricity office is to be approached to ensure the supply of enough energy for the project.
Time - Time management is necessary to complete the project with allocated capital investment. The entire project based on finding out the criteria of eligible students for credit refunds has been completed within a limited time.
Needs, constraints, and assumptions - Security and legal factors are important to set up a project for students. This initiative helps to remove obligations from proposed projects that play an important role in completing a project on time smoothly. The legal approach provides security to complete a project (Ryan, 2020). The legal procedure has been followed in data usage. Approaching legal procedures helps to ensure security in data usage for the project.
Risks and contingencies - Delays can affect the effectiveness of the proposed project. In data usage, several kinds of issues can occur and these are responsible for delaying the entire project. In the procedure of data usage, slow internet can be an effective contingency that can delay projects.
Terminology - Different kinds of criteria are being considered in this process. To identify students based on their demographic profiles to determine who is eligible for credit refunds, several conditions will have to be identified. The conditions such as financial disposition that must be based on tax return, independent or dependent students, part-time or full-time enrolment, and GPA of individual students of semester.
Benefits and costs - Cost benefits are an important factor in the case of data mining projects. The minimum cost has been used in this project to complete it provides cost benefits.

Task 1.3 Determination of data mining goals

Deliverables - Data mining goals have been used to identify reasons for completing the credit refunds project. Goals of data mining projects was to find out criteria of eligible students for refund credit.

Data mining goals - Data mining goals model is used to identify attrition rates of several customer segments. Clustering model can be used to store a large amount of data that helps to perceive an individual's customer for purchasing products. Clustering models can divide customers into various segments such as size, brand and cost and this approach is beneficial for attracting customers.

Success of data mining goals - In this project, qualitative data mining both quantitative and qualitative data mining criteria have been used. Quantitative methods help to make accurate predictions on proposed projects (Thota et al., 2020). On the other hand, a qualitative method helps to identify the particular person for the proposed project to collect required data.

Task 1.4 Producing project plan

Deliverables - Planning is important to execute a project properly. Various tools are used to identify and rectify risks for management that help to operate project activities smoothly. In this task, the project plan and required techniques and tools have been described.
Project plan - A proper plan has been used to operate the proposed project through data mining. The input step is involved in starting which is used to ensure resource management. Data has been collected to meet the criteria of data mining projects. On the other hand, the output step is used to execute the entire project through data mining. Collected data has been cleaned and then the cleaned data has been used in the model to measure. The measurement helps to make reports on the project and in this way, the data mining project is managed. Cleaned data has been used to find out the eligible criteria for credit refunds among students.
Initial tools for assessment and techniques - Data mining tools are used to measure collected data to find out results. Hadoop will be used as for data collection on this project.

PHASE 2.0: Data Understanding

During the second phase of the project, the focus was on data gathering, description, exploration, and verification of data quality with focus on the data that would be used to achieve project goals. Additionally, the data understanding phase entailed identification of issues that could require an update in the business understanding and the project goals, which would also affect the project plan. The phase has four deliverables, which are described in the subsequent subsections.

Task 2.1: Data Collection Report

2.1.1. Data Requirements

The objective of this project was to conduct a substantive analysis of the demographics of students who request credit refunds and to draw attention on how such credit refunds affect schools net cash flows and the degree of liquidity in school accounts. The data requirements for attaining this objective included diverse types of student data and of diverse data types. Some of these data include student registration details, the address, program name, student semesters, the campus, billing details, study terms, admission types, office grants, and student dorm and housing program, among others. The data was of diverse data types, including text, numerical data, date and time, and alphanumerical characters.

2.1.2 Data Availability

The data required for the project has been confirmed available from the school database. The school authorities have approved use of their student data based on request, and a total of three semesters' data has been gathered. To ensure that the right data was targeted in the request, a review of the initial project proposal was conducted, and it was confirmed that the available data would suit the project needs. The data spans 3 semesters and has more than 6900 student entries. The data source has been the university students’ database, which has all these details. From a preliminary review, the data seems to be quite extensive and there were temptations to reduce the scope of the study. However, it was realized that some of the students listed in the data did not satisfy the criterion on reporting student credit refunds and some of the data in the various columns were missing.

2.1.3 Selection Criteria

All the data for the project were sourced from the institutional databases. The data were obtained from linked tables, including student details- student ID, age, gender, date of birth, and home address; student finance tables containing data such as term tuition fees, housing costs, dormitory costs, grant amounts, loan and sub-loan amounts, and refund due; student admission tables with data such as student admission number, admission status, and admission type; and student service tables including student service status and veteran status. All these data were selected for their relevance to the research project. Part of the objectives of the project included comparing across student categories and type of fee payments to determine which category of students were most likely to have credit refunds and which were most likely to ask for the refunds. All the data were selected because they had a differentiation effect on the student refund status.

Task 2.2: Data Description Report

The data gathered for this project was all from the university database and was considered adequate and effective for data mining through Hadoop. The data comprises of 43 fields of diverse data types. Each of the fields represents an aspect of student life, ranging from admission details and personal details to the billing details for each student. However, not all the student entries have all the details corresponding to all of the fields for entry. For instance, there are fields for military participation (veteran type and duty status), which only have data corresponding to specific students. Also, there are students at different positions in the educational journey (Masters, bachelors, and academic Majors).

A total of 6919 records are available for students, and these records cut across three semesters. That data runs across three semesters is considered a potential issue for the project since it increases the probability of subject duplication, whereby a single student may be studied over three semesters while other students only have data relevant to a single semester. Despite the absence of data in some of these fields for some of the students, the available data is considered just adequate as the missing data does not affect the data processing outcomes. The data is considered suitable for the data mining goals for the project. To determine whether the data was suitable, the principles and standards recommended by Fluxicon (2022) were used. The principles include data adequacy, structure, relevance, and consistency in data characteristics. Regarding adequacy, it was established that the available data is adequate for the project. Since, this is a research project, only a sample is required. To ensure that the sample size selected is neither too small nor too large for project analysis, a total sample population that is nearly one quarter of the total student population in the institution was selected. Notably, it is assumed that such a sample size has sufficient diversity to represent the entire student population in the university.

The data was confirmed to be well-structured and consistent. In terms of structure, the data was organized using Hadoop and found to be well-structured. Structure is often a major concern when handling institutional data since most of the data is never contained in clear columns and rows (Fluxicon, 2022). In this project, the challenge of obtaining structured data was addressed through a combination of diverse data sources and the use of an efficient data mining tool to structure the data. In terms of consistency, some of the examples noted included the data types for all student IDs being numerical, all fields with fees being categorized as currency data types, and all fields with addresses were text data types. As such, it was possible to group specific data categories.

The data was largely relevant to the planned project objectives. From the start, the objective of the project was clearly defined and required the utilization of diverse student data categories. The project background and objectives clearly identify distinctions between students and further describe how those distinctions cause discrepancies in the need for fee refunds. By bearing these differences in mind, it was possible to determine the scope of data required, as defined by the fields included in the data collected. Since the data satisfies all the conditions of adequacy, relevance, structure, and consistency, it was concluded that it is suitable for this project.

Task 2.3: Data Exploration Report

Data collected for this project was majorly subjective, limiting the possibility of using descriptive statistics as part of data analysis. For instance, there is extensive qualitative data, such as student addresses, admission numbers, program details, and semesters, which cannot be summarized. Similarly, data such as school fees for the different programs cannot be averaged since the present study is subjective. The study focuses on individual student details first and then draws general conclusions regarding the impacts of refunds on institutional financial status, which is qualitative information. Although the data is largely suitable and adequate for the current project, a few issues were noticed that could lead to data quality concerns. The most glaring issue is the inclusion of irrelevant data types in some fields. For instance, the student’s gender field has some data that is not expected to be there, such as automatic date entries, which may not be interpreted easily and will require cleaning. Additionally, there are concerns about probable data duplication, which may also pose data quality issues.

Task 2.4: Data Quality Report

To determine whether the data obtained is good enough to support the goals of the project, data quality issues have to be identified and rectified where necessary. Fluxicon (2022) provides a checklist comprising of 10 items based on which a dataset for data mining is to be evaluated for quality; the current project utilized the checklist in data quality review. The first item in the checklist is the absence of errors during data importation. The current dataset satisfies this condition since the data was imported from the institution’s database to Hadoop without any error messages. Secondly, there are no gaps in the data timelines; all data collected were for specific sequential semesters with no skips/blank periods in the data collection duration. Third, the expected amount of data was imported- the total number of line items indicated in the source database was also observed in the destination database. The fourth, fifth, and sixth conditions that the data satisfies includes that the data has no cases with unexpected steps, the data corresponds to the specific timeframe for which the data was requested as determined by the semesters, and the data has no unexpected ordering issues. Markedly, these items validate the data quality.

On the downside, the data may pose quality issues on one of the checklist items. The distribution of attribute values in some of the fields is unexpected, for example, the student gender field. However, there are no unexpected empty values in the dataset. The solution to this issue is to go back to the data pre-processing and extraction stages to determine the possible reason for the unexpected values. Some of the possible issues include missed steps in extraction, as well as absence of those values in the data source files. The data with problems will, however, not be excluded as it could prove essential in predicting refund amounts based on the moderating effects of other factors, such as loan type and student admission details. All the quality issues have been documented and will be considered in the data preparation phase.

Conclusion

The data understanding phase of the project has been completed. The project data was collected from a university database. The data is extensive, adequate, relevant, and effectively structured for the project. However, some quality concerns have been highlighted, which will be considered during the data preparation phase.

There were some data quality concerns identified, such as irrelevant data types and possible data duplication. However, these concerns will be addressed during the data preparation phase. Overall, the data understanding phase has been completed, and the data can be used for further analysis and modeling to predict refund amounts based on various factors such as loan type and student admission details.

Data Science Project - Analysis of a College Credit Refund Process using the CRISP-DM Process

Data Science Project - Analysis of a College Credit Refund Process using the CRISP-DM Process

Phase 1 - Business Understanding

PHASE 2.0: Data Understanding

Share this post

Recent Blogs

The Programming Assignment Help - Instant and Affordable Coding Help from the Top-Rated Tutors!