THE NEED FOR DATA SCIENCE
Organisations worldwide are increasingly looking to data science teams to provide business insight, understand customer behaviour and drive new product development. The broad field of Artificial Intelligence (AI) including Machine Learning (ML) and Deep Learning (DL) is exploding both in terms of academic research and business implementation. Some of the world’s biggest companies including Google, Facebook, Uber, Airbnb, and Goldman Sachs derive much of their value from data science effectiveness. These companies use data in very creative ways and are able to generate massive amounts of competitive advantage and business insight through the effective use of data.
Have you ever wondered how Google Maps predicts traffic? How does Facebook know your preferences so accurately? Why would Google give a platform as powerful as Gmail away for free? Having data and a great idea is a start – but the likes of Facebook’s and Google’s have figured out that a key step in the creation of amazing data products (and the resultant business value generation) is the formation of highly effective, aligned and organisationally-supported data science teams.
EFFECTIVE DATA SCIENCE TEAMS
How exactly have these leading data companies of the world established effective data science teams? What skills are required and what technologies have they employed? What processes do they have in place to enable effective data science? What cultures, behaviours and habits have been embraced by their staff and how have they set up their data science teams for success? The focus of this blog is to better understand at a high level what makes up an effective data science team and to discuss some practical steps to consider. This blog also poses several open-ended questions worth thinking about. Later blogs in this series will go into more detail in each of the dimensions discussed below.
Drew Harry, Director of Science at Twitch wrote an excellent article titled “Highly Effective Data Science teams”. He states that “Great data science work is built on a hierarchy of basic needs: powerful data infrastructure that is well maintained, protection from ad-hoc distractions, high-quality data, strong team research processes, and access to open-minded decision-makers with high leverage problems to solve” .
In my opinion, this definition accurately describes the various dimensions that are necessary for data science teams to be effective. As such, I would like to attempt to decompose this quote further and try to understand it in more detail.
Drew Harry’s Hierarchy of Basic Data Science Needs
GREAT DATA SCIENCE REQUIRES POWERFUL DATA INFRASTRUCTURE
A common pitfall of data science teams is that they are sometimes forced either through lack of resources or through lack of understanding of the role of data scientists, to do time-intensive data wrangling activities (sourcing, cleaning, preparing data). Additionally, data scientists are often asked to complete ad-hoc requests and build business intelligence reports. These tasks should ideally be removed from the responsibilities of a data science team to allow them to focus on their core capabilities: that is utilising their mathematical and statistical abilities to solve challenging business problems and find interesting patterns in data rather than expending their efforts on housekeeping work. To do this, ideally data scientists should be supported by a dedicated team of data engineers. Data engineers typically build robust data infrastructures and architectures, implement tools to assist with data acquisition, data modeling, ETL, data architecture etc.
An example of this is at Facebook, a world leader in data engineering. Just imagine for a second the technical challenges inherent in providing over one billion people a personalised homepage full of various posts, photos and videos on a near-real time basis. To do this, Facebook runs one of the world’s largest data warehouses storing over 300 petabytes of data  and employs a range of powerful and sophisticated data processing techniques and tools . This data engineering capability enables thousands of Facebook employees to effectively use their data to focus on value enhancing activities for the company without worrying about the nuts and bolts of how the data got there.
I realise that we are not all blessed with the resources and data talent inherent in Silicon Valley firms such as Facebook. Our data landscapes are often siloed and our IT support teams where data engineers traditionally reside mainly focus on keeping the lights on and putting out fires. But this model has to change – set up your data science teams to have the best chance of success. Co-opt a data engineer onto the data science team. If this is not possible due to resource constraints then at least provide your data scientists with the tools to easily create ETL code and rapidly spin up bespoke data warehouses thus enabling them with rapid experimentation execution capability. Whatever you do, don’t let them be bogged down in operational data sludge.
GREAT DATA SCIENCE REQUIRES EASILY ACCESSIBLE, HIGH-QUALITY DATA
Data should be trusted, and be of a high quality. Additionally, there should be enough data available to allow data scientists to be able to execute experiments. Data should be easily accessible, and the team should have processing power capable of running complex code in reasonable time frames. Data scientists should, within legal boundaries, have easy, autonomous, access to data. Data science teams should not be precluded from the use of data on production systems and mechanisms need to be put in place to allow for this rather than being banned from use just because “hey – this is production – don’t you dare touch!”
In order to support their army of business users and data scientists, eBay, one of the world’s largest auction and shopping sites, has successfully implemented a data analytics sandbox environment separate from the company’s production systems. eBay allows employees that want to analyse and explore data to create large virtual data marts inside their data warehouse. These sandboxes are walled off areas that offer a safe environment for data scientists to experiment with both internal data from the organisation as well as providing them with the ability to ingest other types of external data sources.
I would encourage you to explore the creation of such environments in your own organisations in order to provide your data science teams with easily accessible, high quality data that does not threaten production systems. It must be noted that to support this kind of environment, your data architecture must allow for the integration of all of the organisation’s (and other external) data – both structured and unstructured. As an example, eBay has an integrated data architecture that comprises of an enterprise data warehouse that stores transactional data, a separate Teradata deep storage data base which stores semi-structured data as well as a Hadoop implementation for unstructured data . Other organisations are creating “data lakes” that allow raw, structured and unstructured data to be stored in a vast, low-cost data stores. The point is that the creation of such integrated data environments goes hand in hand with providing your data science team with analytics sandbox environments. As an aside, all the efforts going into your data management and data compliance projects will also greatly assist in this regard.
GREAT DATA SCIENCE REQUIRES ACCESS TO OPEN-MINDED DECISION-MAKERS WITH HIGH LEVERAGE PROBLEMS TO SOLVE
DJ Patel stated that “A data-driven organisation acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape” . This culture of being data-driven needs to be driven from the top down. As an example, Airbnb promotes a data-driven culture and uses data as a vital input in their decision-making process . They use analytics in their everyday operations, conduct experiments to test various hypotheses, and build statistical models to generate business insights to great success.
Data science initiatives should always be supported by top-level organisational decision-makers. These leaders must be able to articulate the value that data science has brought to their business . Wherever possible, co-create analytics solutions with your key business stakeholders. Make them your product owners and provide feedback on insights to them on a regular basis. This will help keep the business context front of mind and allows them to experience the power and value of data science directly. Organisational decision-makers will also have the deepest understanding of company strategy and performance and can thus direct data science efforts to problems with the highest business impact.
GREAT DATA SCIENCE REQUIRES STRONG TEAM RESEARCH PROCESSES
Data science teams should have strong operational research capabilities and robust internal processes. This will enable the team to be able to execute controlled experiments with high levels of confidence in their results. Effective internal processes can assist in promoting a culture of being able to fail fast, fail quickly and provide valuable feedback into the business experiment/data science loop. Google and Facebook have mastered this in their ability to amongst other things; aggregate vast quantities of anonymised data, conduct rapid experiments and share these insights internally with their partners thus generating substantial revenues in the process.
Think of this as employing robust software engineering principles to your data science practice. Ensure that your documentation is up to date and of a high standard. Ensure that there is a process for code review, and that you are able to correctly interpret the results that you are seeing in the data. Test the impact of this analysis with your key stakeholders. As Drew Harry states, “controlled experimentation is the most critical tool in data science’s arsenal and a team that doesn’t make regular use of it is doing something wrong” .
This blog is based on a decomposition of Drew Harry’s definition of what enables great data science teams. It provides a few examples of companies doing this well and some practical steps and open-ended questions to consider.
To summarise: A well-balanced and effective data science team requires a data engineering team to support them from a data infrastructure and architecture perspective. They require large amounts of data that is accurate and trusted. They require data to be easily accessible and need some level of autonomy in accessing data. Top level decision makers need to buy into the value of data science and have an open mind when analysing the results of data science experiments. These leaders also need to be promoting a data-driven culture and provide the data science team with challenging and valuable business problems. Data science teams also need to keep their house clean and have adequate internal processes to execute accurate and effective experiments which will allow them to fail and learn quickly and ultimately become trusted business advisors.
SOME FINAL QUESTIONS WORTH CONSIDERING AND NEXT STEPS
In writing this, some intriguing questions come to mind: Surely there is an African context to consider here? What are we doing well on the African continent and how can we start becoming exporters of effective data science practices and talent. Other questions that come to mind include: To what end does all of the above need to be in place at once? What is the right mix of data scientists/engineers and analysts? What is the optimal mix of permanent, contractor and crowd-sourced resources (e.g. Kaggle-like initiatives )? Academia, consultancies and research houses are beating the drum of how important it is to be data-driven, but to what extent is this always necessary? Are there some problems that shouldn’t be using data as an input? Should we be purchasing external data to augment the internal data that we have, and if so, what data should we be purchasing? One of our competitors recently launched an advertising campaign explicitly stating that their customers are “more than just data” so does this imply that some sort of “data fatigue” is setting in for our clients?
My next blog will explore in more detail, the ideal skillsets required in a data engineering team and how data engineering can be practically implemented in an organisation’s data science strategy. I will also attempt to tackle some of the pertinent open-ended questions mentioned above.
The dimensions discussed in this blog are by no means exhaustive, and there are certainly more questions than answers at this stage. I would love to see your comments on how you may have seen data science being implemented effectively in your organisations or some vexing questions that you would like to discuss.
References https://medium.com/mit-media-lab/highly-effective-data-science-teams-e90bb13bb709  https://blog.keen.io/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest-9b7cd881af54  https://www.wired.com/2013/02/facebook-data-team/  http://searchbusinessanalytics.techtarget.com/feature/Data-sandboxes-help-analysts-dig-deep-into-corporate-info  https://books.google.co.za/books?id=wZHe0t4ZgWoC&printsec=frontcover#v=onepage&q&f=false  https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c?s=keen-io  https://www.kaggle.com/
by Nicholas Simigiannis