Good data scientist hunting – the sexiest job of the 21st century
It may be the “sexiest job of the 21st century”, but beyond that there isn’t a great deal of consensus on how to define a data scientist.
Part of the reason it’s so hard to pin down a meaningful definition of a data scientist, is because the scope of what a data scientist does seems impossibly broad. The organizations that make some headway on developing their data science capabilities are the ones that emphasize the cultivation of data science teams – which come together with a variety of backgrounds and technical capabilities – instead of trying to identify and recruit the proverbial “unicorn,” single rock star all-round data scientist. But even so, how does one begin to approach the task of putting together a data science team without simply resorting to a simplistic approach of peppering job descriptions with “Hadoop”, “machine learning”, and other buzzwords de jour?
These have become pressing questions in the big data scene in Malaysia. In May 2012, Big Data Malaysia, a professional networking group, was set up to help connect the ‘supply and demand’ of big data talent. Amongst our network we now see many startup, corporate and government initiatives being pursued. In mid-2013 we organized a broad survey of our community to get a better sense of interests, activities, and challenges faced by the group. Amongst other things, we looked into the issue of skills that were in demand.
The analyzers, analyzed
Part of our study was inspired by Analyzing the Analyzers, a report published in 2012 based on survey findings of the data science community in Washington DC. From the broad number of professionals they surveyed in the data science space, the authors identified four classes of data scientist, who may roughly be characterized as follows:
• ‘Data Businesspeople’ assume a leadership role and focus on bridging the gap between business imperatives and how data assets may be leveraged for business impact.
• ‘Data Creatives’ wear many hats to quickly conceptualize and nail out a prototype of an idea, then iterate on it with teammates to create a sustainable solution.
• ‘Data Developers’ have more of a ‘back-end’ focus but their role goes far beyond system administration to include designing and implementing policies and mechanisms to manage data.
• ‘Data Researchers’ bring academic practices into business-oriented data science, adding rigor and sanity checking to analytical methods and models, which is crucial to lend credibility to any findings arising from a data project.
Organizations will always differ on their specific needs, which is why greater specificity in data scientist requirements can only be a good thing. Besides providing at least a basic framework to guide recruitment efforts, these four data scientist tracks may also serve as a guide for targeted training initiatives. However even if this four-track model is accepted as canonical, the right mix of data scientist types remains a question that can only be resolved by recognizing specific organizational requirements.
Wanted: distributed data analysis
Overall, the survey by Big Data Malaysia demonstrated that the greatest demand for skills is in the area of specialized data analysis, modeling, and simulation. This encompasses specialist expertise in areas like operations research and machine learning. The next in-demand skill is distributed systems deployment and administration, which includes capabilities in Hadoop and similar systems.
There are some differences in perspective apparent here. Respondents in the ICT sector were more likely to suggest that distributed systems is a slightly higher priority than specialized data analysis skills. However, for non-ICT respondents, specialized data analysis is clearly where the greatest demand is, with far less interest in distributed systems, which is only a distant second. Both segments recognize the important of strong computer science fundamentals, though ICT respondents place a slightly higher importance on seeking domain specialists, likely to aid in application development for their client markets.
Heatmap comparing Malaysian respondents in ICT vs. non-ICT industries for skills in demand
In response to the question “To deliver your Big Data initiatives, how much need is there to recruit the following skills?”, respondents scored each of the following skills on a 5 point scale, ranging from “No need” to “Critical need”.
We correlated the data underlying our skills heatmap against the data scientist type models presented in the Analyzing the Analyzers report; first mapping our skills list to the prescribed skills groups, then performing some simple correlation to discern the target data scientist type pursued by each respondent. As expected, not all responses correlated onto the prescribed model; consequently some responses did not map to any model, and some responses mapped to multiple models. We found that 36% of responses matched exactly 1 model, and 46% of responses matched 2 models. We discarded responses that matched 3 models (12% of responses) and 0 models (5% of responses) as noise. No responses matched all 4 models.
The results indicate a very large demand for Data Creatives. This is unsurprising, given that most data initiatives in Malaysia are still rather nascent, and will therefore benefit tremendously from the bootstrapping abilities of Data Creatives. Our findings also indicate significant interest in Data Researchers and Data Businessmen – the latter more so from ICT respondents. Despite all the hype surrounding Hadoop, our findings indicate relatively little current demand for pure Data Developers. This may be due to the fact that most projects are in their infancy, and therefore there isn’t yet a need for the kind of “at scale” capabilities that a Data Developer might provide.
Recruit vs. Outsource
Once organizations have figured what mix of data scientists they need, they face yet another curveball: outsource or build internal capacity? Outsourcing may simply be a necessity. “Big Data experts are few and far between”, observes Robin Woo, Senior IT Manager at Western Digital Corporation, who was one of our survey respondents.
Despite the scarcity of experts, our findings indicate some reluctance towards outsourcing amongst Malaysian organizations. Respondents were asked, “How willing would you be to outsource high-skill tasks in your Big Data initiatives to external consultants?” On a scale from “Not at all willing” to “Extremely willing”, fewer than half of our respondents answered “Moderately willing” or above.
Drilling into the data, we found that this opinion was not evenly spread through various segments. Although in the overall survey sample we found that only 45% of respondents were willing to outsource high-skill tasks in their Big Data initiatives, this percentage climbs to 50% when considering only respondents who identify as being in the ICT industry.
In light of the fact that data science is a multi-faceted discipline, and especially if one assumes a four-track model like the one proposed by Analyzing the Analyzers, the issue of attitudes towards outsourcing of data science roles needs more careful study. A blanket outsource-vs-recruit rule is far from ideal, which is perhaps why the single question posed in our survey was not sufficient to capture the nuances of the issue.
For instance, despite the fact that Robin Woo recognizes the scarcity of talent in this area, he actually cautions against outsourcing everything; in particular he feels any analytics operations that are tied to business imperatives needs to be kept internal. However, for the more technology-centric bits (perhaps analogous to the role of Data Developers), he argues that it makes sense to outsource where possible, especially given the shifting technological landscape. “For practical purposes it is better to engage proven experts.”
While there is no ‘one-size-fits-all’ specification for a data scientist or data science team, and best practice is only just emerging, it can be useful to bear in mind the experiences of those at the cutting edge. Although our understanding of data science as a discipline will surely evolve over time, having a four-track model provides a place to start, and at least in the case of Malaysia, it’s clearly a good time to be a Data Creative.
The authors would like to acknowledge Olygen (organizers of the Big Data World Show), Merlien Institute (organizers of Market Research in the Mobile World), and Revolution Analytics, for their support of our survey lucky draw.
About the authors
Tirath Ramdas (firstname.lastname@example.org) is the founder of Big Data Malaysia. If he had to pigeonhole himself, he would say that he is a Data Developer, though he has some experience that may let him pretend to be a Data Researcher or Data Businessman. Prior roles include academic research in computational chemistry, non-profit strategy consulting, and software storage system engineering. Currently he is a consultant working in the field of information security.
Sandra Hanchard (email@example.com) is an industry analyst with expertise in consumer adoption of new technologies. Formerly, Sandra was the Asia Pacific lead for custom research with global Internet measurement firm, Experian Hitwise. Currently, Sandra is affiliated with The Swinburne Institute. She is the recipient of an Australian Postgraduate Award for her doctoral thesis which investigates social media information use in Malaysia.
For more information on Big Data Malaysia and our survey “Big Data: Emerging Sector Profile” please visit www.bigdatamalaysia.org or email either of the authors.