Every fourth person I meet talks about big data and data analytics. In fact, we are in times where people are investing in big data and data analytics like never before. Yet, some of the questions that continue to haunt us are:
- Do we really need big data solution for the given problem?
- Is it being done, right?
- What problem are we solving really?
These questions majorly stem from the fact that most people from the top management aren’t seeing the expected outcome from their big data investments.
Who is to blame? The intent or the content?
For starters, having terabytes of data doesn’t make a company eligible to consider investing in Big data, especially if the data isn’t good enough or detailed enough. You are as good as your data.
When data analytics doesn’t yield the expected returns, the first factor to look at is: the data itself. It can be overfit or underfit. Overfitting occurs when a statistical model or machine learning algorithm beings to capture the noise of the data. More specifically, overfitting occurs if the model or algorithm shows low bias but high variance. On the other hand, underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It occurs when the model or algorithm shows low variance but high bias.
Data Decay is another key issue to look into. Sameer, in his blog post, brilliantly describes the issue and its mitigations. Hence, I would not delve deeper on that aspect.
Data analytics: Now, the analytical part
If you have good data, analytics can do wonders for you. You can slice and dice the data and unearth some amazing insights from customer data and make your team rethink their current strategy. Recently, one of our customers were awestruck by the insights our platform deciphered about their cross-sell revenue across different countries. The point in contention is – always consider the larger picture, and break the process into logical steps, and connect the dots.
For example, let us consider a retail / eCommerce scenario. Profiling large customer base into selected persona is an initial step for Ad recommendation / offering discounts etc. It is understandable that buying patterns of every user will be different. Hence, clustering the similar buying pattern will improve the Ad targeting and thereby the response rate.
The cluster differentiator signals could be anything from your location, the type of Operating System you are using to the detailed transaction history (buying patterns).
Let’s assume that we are looking to identify “Influential Buyers”. They typically care more about the longevity of the product than the brand, color or price. One of the key identifiers could be to check if the user has read all reviews about the product. When we do the analytics part (here clustering),it could lead us to some common problems like data source, data validations, data transformations and domain knowledge.
- Data Source: Source of the data is so very critical. In the above problem, data source is going to be the retail / ecommerce company. So, there is no need to validate the source for this problem. Fiind Smart signals library comprises 2M+ companies (and growing) across different countries makes us look for new data sources proactively since old data sources may sometimes end abruptly.
- Data Validation: Data validations are generally done by cross checking data across different sources. Not all data need validation from more than one source. But a signal like ‘leadership change in a company’ needs to be validated across multiple sources, unless there is an update in the company press release.
- Data transformations: This is the phase where data is converted into customer signals and is prone to errors. This is where we tend to use proxy signals. Imagine a situation where there is no valid data source for a specific signal for a specific country, we tend to use proxy signal for that country based on data from other countries. Documenting the transformations using tools or simple acyclic graphs could help us assess the analytics part.
- Domain knowledge: Be it e-commerce customer profiling, Ad targeting or financial models, sound domain knowledge is a must for successive data analysis operation. Once you have tried and tested signals across different domains, you get a fairly strong idea of what signals will work and what doesn’t. When you add the customer feedback, the result could be a well framed analytics that you will rejoice.
To sum up, it all starts with data. Analytical and data science models are as good as data. The data are as good as how actionable they are. Actionability is as good as how it can be put to use.
Looking forward to know what you feel about it!