Data scientists are a precious resource. So I thought I'd ask some basic questions to try and shed a little light on the basics.
(I thought I'd bust some of the hype too, but - some the hype is true.)
So here's an interview with Pranshuk Kathed, machine and deep learning enthusiast...
Questions? Let me know, and we’ll get answers on a future interview!
Everyone says you spend most of your time prepping data. Is that really true?
It’s true, 80% of my time goes into data munching. For simple reporting projects, I might spend 8 hours getting the right data and then just a couple of hours producing the needed visualizations.
Why is there so much data prepping?
We need to integrate data from different sources like accounting system, ticketing system and other sources – they all have data, different sources. Getting all the data from disparate sources and making one view of all those data sources, perform cleaning, transformation, reduction etc. – that’s what 80% of my time goes into.
Before working as a data scientist, did you expect to spend that much time in “data munching” as you call it?
Absolutely, that’s an expected part of the job.
Is spending that much time a problem?
Not really, actually. In order to run a business efficiently, we know that not a single solution can provide all the value to us. Since we want to look at all the data, there’s no single place data is going to reside in. There are always different applications, databases, and servers.
We know this is the reality. That’s why it’s important to create value with the data regardless of where it is stored. No software application will ever serve every facet of the organization effectively and efficiently.
Why can’t the systems just talk to each other?
Their infrastructure, architecture and table layout is different.
How do you sort it all out?
First, I have to understand the business model to actually see. I look at the tables, and find the unique IDs. Each system’s data has their own unique IDs. I need to understand why the IDs are selected for each system and create a consolidated view.
If I’m using Power BI, I can pull data from databases, cloud – Azure, AWS, any of those, and understand the business value of the data. Once I see how the data is laid out and find the unique keys, I can create one version of truth.
OK, so once you do all that work, you don’t have to do it again for the same data, do you?
No, once we create the connections and define the relationships, we could set up a job on a refresh schedule.
So, no more spending 80% of your time on prep?
Data preparation is different than creating the connections. We must have a thorough business understanding of the data. Take total revenue for example. I must know how revenue is recognized. All companies include / exclude different things, like freight, transportation costs, etc.
We need to understand the business first to dive into the financial data. Once we have the right business process understanding, then we can easily relate the tables and make accurate sense out of the data.
Do you have to do all the business discovery yourself?
Yes, business acumen is difficult to collect. You must engage right people for business understanding and perform proper requirements gathering.
What are the metrics that business wants to see and why it is valuable? Then we discover other metrics we can create to provide value along the line based on gathered requirements and interviewing right people and understanding business.
How often is the first question you’re asked to answer the right question?
Haha, well, I ask them the importance of the answer and look at current reports to see how they are getting their current figures. Then I ask more questions to generate an understanding of how important that metrics and data is to that person or department.
Sounds like data scientists have to be great communicators. Were you prepared for that?
It’s complex – you build a model, predict outcomes, now you have to convince business leaders to trust and believe in your model. And understand how it is working.
This includes technical terms that might be hard to understand. We should use laymen terms to explain model and build trust in it. Then we can implement and deploy. I have learned that trust is required to take data modeling in production.
What is machine learning?
Machine learning. Split it. Machine. Learning. We’re basically making the machine learn by itself. The ways we do that are called machine learning algorithms. They’re based on the type and nature of the data. There are different algorithms that will help the machine understand the data.
What is an algorithm?
Kind of like the step by step process that will view the data from all perspectives to understand what data says – the most important elements of the data. It will teach the machine to make predictions better, accurate and precise.
What are some examples of effective machine learning?
Let’s take the example of Amazon. You order one thing and see recommendations. Then you order again. You see recommendations for a new product, but then others based on past purchases.
Based on real time actions, the machine is learning what you are thinking and needing, based on all the data it has, trying to recommend what would be helpful to you to drive more business for Amazon. Same thing goes for Netflix and others.
How do you get machine learning to become more accurate?
The data scientist does this. First, we have the data. We’ve cleaned, transformed, reduced, consolidated and put the data into right form with a data science workbench. Once we have right data, we do some descriptive analytics which tells us column’s mean, median, mode, standard deviation, variance, bias, some skewness – how the data is spread.
We plot visualizations and look at the data to understand how occurrences of different categories of data columns. We get a better understanding of what the data is saying.
Then we can dive into creating a model. For example, picking one column – the target variable, and assume all other columns are predictor variables. Our aim is to predict the target based on the predictors. Divide data into two parts – train (70% of data usually) and test (30%) data sets.
Considering training data set, train the data of the predictors to try and align that data in such a way that we can actually see what the target is about, and it can predict those and get better with more data over time.
I can see that more data equals higher accuracy?
Yes. Also, in all different languages there are some libraries we use for learning.
To make the model learn what the target (dependent) variable is about based on predictor (independent) variables. It also gives us a strong relationship between the target and predictor variables.
So, once we have that, we test model on different data and check the accuracy of how the model is performing. Using ROC curve analysis, we can understand how accurate and precise the model is predicting.
We also use things like Mean Squared Error (MSE) and Root Mean Square Error (RMSE) based on predicted and Actual values of target variable in testing data set to determine how accurate the model is in its prediction. We also use a confusion matrix which will tell the actuals vs predicted in the prediction to assess false positives and true positives.
All these things tell how accurate data model is, and whether we can go live with it or not. It may need more training. These are the steps to create a model and deploy it.