Data Scientists in every sector are grappling with the implications of Big Data. Data Scientists are facing it difficult to deal with the increasing volume, type and detail of information captured by enterprises. The use of video, emojis, text in social media, and the range of information Internet of Things emit every minute will fuel exponential growth in data for the foreseeable future.
In my day to day job, I interact closely with the data scientists. These are some of the brightest people, looking to solve the next set of business challenges. Unfortunately, many of them are a frustrated lot today because of three key challenges they face:
Poorly defined business use-case
The data scientist is expected to do some magic, identify hidden patterns in data and provide ground-breaking insights. Unfortunately there is no magic in data sciences – that is why it is called a “science”. The basis of a successful project is a well defined business use-case. The selection of data or the statistical techniques are secondary. Anything the customers, business users can do to concretely define the scope of work and expected outcome, the more effective the data scientist can be. So always start with a business use-case and not a statistical technique. Avoid being a hammer looking for a nail.
This is probably the most non-value-add activity a Data Scientists, and something they are not trained in the first place. The foundation of clean connected data is critical for a successful outcome. Data cleansing, connecting keys across data sources, and managing the data set is a significant amount of work that requires special skills. For example, In the customer analytics world the ability to draw signals from customer data from one channel and predict their behavior in another channel is the holy grail. People want to know what is the value of a positive review on their brand page. If the social marketer could quantify that a customer who leaves a positive review is twice as likely to become a high value customer, he will be able to justify the value of social marketing. The data scientist can do this, but they need a single-view-of-the-customer to start their work. Today the data scientist is spending 70 to 80% of their efforts getting the data and not doing their real job. Avoid "garbage in-garbage out" analytics.
Once the data has been reasonably organized the data scientist can develop model that are suitable for the use case at hand. Now the challenge of moving the model into production becomes a bottleneck. Typically advanced data scientists work with their own preferred tools. Some of these tools are flexible and allow quick development, but are not meant for a large scale deployment. They lack the ability to run at scale and in production environment. With multiple models in play, the data scientist also finds it hard to manage all of them. This is where industry standards like PMML and advanced customer analytics platforms come into action. Once the model is developed it can be converted into PMML format and can run in any modern customer analytics platform. Using such formats and platforms enable the data scientist to quickly move models into production without any bugs creeping in at the last stage. It also helps them in sharing results with the business users and automating the usage of their models in a salable manner.
Today, businesses who believe analytics can be a key competitive advantage should focus not just focus on buying the shiniest piece of technology in the market place but should spend effort on addressing these 3 challenges. Fixing them will have a dramatic improvement in the accuracy, development times and adoption of analytics across the organization.