The huge rise of data, both its collection and usage, has fuelled the growth of the data analytics sectors. The skills of data analysts are increasingly in demand, and so is the need to find meaning in the legions of data that we collect today.
With this increased pressure and higher stakes, the tolerance for mistakes in this industry is quite low. Data analysts are expected to crunch the numbers and find profitable meanings and trend amongst it in a short space of time. And yet, there are a few common mistakes that many data analysts continue to make that hampers their work-
Starting the analysis without a clear goal/hypothesis
Before starting to work on the data, there must be a clarity on the different theories that one wishes to pursue and for what reason. The assumptions must be specified, and so must be the hypothesis. Without any clarity in this step, further analysis or interpretation will not reveal any clear insights.
Not cleaning up or normalizing the data
Data cleansing and normalizing is perhaps the most tedious and boring part of this whole analysis process. And it is quite time consuming as well. However, if skipped over, the analysis will be polluted and the inferences drawn from it may be absolutely wrong.
Allowing bias to creep in
Bias must not be allowed to dictate the analysis. Now that the assumptions and the hypothesis has been specified, it is important to not let any bias creep in. If intuition is applied, issues like confirmation bias, selection bias, selective outcome reporting, and outlier bias may skew your analysis.
Overfitting or Underfitting the Data
A common error, overfitting refers to the model being needlessly complex and fitting the existing limited data points so well that the noise term goes down significantly. However, if overfitting does exist in the model, the inferences that it draws for real-world data points would be widely off the mark, and the prediction from the analysis would not be a good one.
Underfitting lies on the other end of the spectrum. This is said to occur when the model is so simply defined that it has a large noise for both the existing data points and the real-life data points.