The Epic Journey of a Data Scientist (Data Science for dummies : Part 2)

Data Science has many steps and pathways. Here’s some of the journey that data scientist take to finish his/her jobs.

This is data science for dummies part 1. For part 1 you can click here. In part 1, i explain to you what is data science using a venn diagram. In part 2, we are going to take the journey of a data scientist.

From : http://ltue.net/taking-first-step-writing-journey/

Planning

Data science is a job of a team. It can be a solo job, but it takes a unicorn (a very special someone than can do all of the data science skills). Data science team have to have a leader. This leader is responsible for all the planning and the division of the job or labor. When a project comes up, data scientist have to define clearly the goal of the project. After the goal has been define, the leader or the head start to organize the resources that the team had. This is for knowing what are the resources needed to achieve the goal. The next step of planning is to coordinate the people. The leader or the head coordinate the data science team in order to achieve the goal effectively. After the head coordinate the people, he/she create a timeline or schedule for the project so that the team has the motivation and the deadline finish every part of the project. The planning steps are over, now let’s go to the preparation of the data.

Preparation of the Data

This steps is so crucial and very important for data scientist. Data is everything for data scientist. A great source of data can make a significant better insight and model. That’s why the preparation of the data is include to the big 4 steps of data scientist journey. The first thing to do is get the data. The data can be from anywhere. The data can be from websites, application, survey, etc. The good thing to do is to get as many data as possible. The larger or bigger the data, the better the insight and the model. After the data has been acquired, clean the data. This step is also very important. Many times, the data that we acquired are not a good data. The data can have an outlier, a bias, or blank. This step is to prep the data so it can fit for getting an insight and a model. Sometimes, this steps is the most difficult and time consuming. After the data has been cleaned, we can explore the data. This step is the first step of getting the insight of the data before making a model. This step is use to take a quick glance of the data that we are dealing with. After we explore the data, we now what attribute of the data that have to be use or not. This step is filtering the data. Sometimes not all the attribute of the data is useful to get the insight or the model. That’s why we have to filter the attribute that is being use for getting an insight and model.

Modeling from the Data

Modeling is also a very important part of the data science journey. Modeling is use to predict the outcome of a new data from the old data. In modeling, we use a training and test data set. The model can be a regression or a classification depends on the type of the data. The first thing to do is to create the model from the data. This step required a machine learning knowledge. After you create the model, you have to validate the model. Does the model can predict an accurate outcome or not. When validating the model, you just test the model with a new data and look at the score. If the score is high enough, you can assume your model is quite accurate. The next step is evaluate the model. When evaluating the model, you have to look at the model and look at the attribute of the data. Can the model be more accurate or not. After you evaluate the model, you have to refine the model. This means that you have to look again or transform the data again to build a better model. You can create a benchmark so you have a goal for building a better model.

Review or Follow Up the Insight or Model

The final big step is to review our model. The first thing to do is test the model. This means that the model is deploy to the real world. The model is use to predict the outcome from the new data. After the model has been tested, you have to revisit the model. Sometimes, the data can be change overtime. So, you have to constantly update your model so it can predict accurately overtime. The last step is to save the model. The model has to be track and save so in the future, you can altered and change the model without confusion.

Just a “keep learning” guy who wants to share his knowledge — Programming enthusiast