Executive Brief
In this article, we continue our series on technology in government by reviewing Big Data. We plan to review the impact of Big Data in the Government and common applications of technologies to manage this issue. First of all, let’s look at some basic definitions and define the scope of this article.
What is big data?
While Roger Magoulis of O’Reilly Media is most commonly credited for coining the term “Big Data” back in 2005 and launching it into the mainstream of consciousness, the term has been floating around for a number of years. (researchers found examples dating from the mid-1990s in Silicon Graphics (SGI) Slide Decks) Nevertheless, Big Data basically refers to data sets that are so large to the extent that their size becomes an encumbrance when trying to manage and process the data using traditional data management tools.
According to IBM, we create 2.5 quintillion bytes of data each day and is commonly described by three characteristics:
- Volume: Big Data refers to large amounts of data that is generated across a variety of applications and industries. At the time of this article, the order of magnitude from 100s of GB to Terabytes and Petabytes of data could easily qualify under the definition.
- Variety: With a wide and disparate number of sources of Big Data, the data can be structured (like a database), semi-structured (indexed) or unstructured.
- Velocity: The data is generated at high speeds, and needs to be processed in relatively short durations (seconds).
Why is big data important?
Big data conveys an important shift in how we interpret data to look for meaningful things in the world. The advent of Social Networking and E-commerce brought about a need for suppliers of rapidly non-differentiated online services to learn about the behavior of online users in order to tailor a superior user experience. Some of the most successful companies in the World (Hint.. starts with the letter ‘G‘) have based their entire business models on delivering customized ads to users based on their search queries. Prominent research projects such as NASA’s SETI (Search for Extra-terrestrial Intelligence) and Mars Rover projects; and the Human Genome sequencing program also called for similar needs:
The ability to perform lightning speed computational processes on extremely large sets of data that were also subject to frequent changes.
The challenges of traditional data management tools
The problem with conventional approaches towards managing data was that the data primarily had to be structured. Picture a database that supports the catalog of a conventional online e-commerce website and holds hundreds and thousands of items. The database is structured and relational, meaning that each item put up for sale on the site can be stored as an object and described by a number of attributes, including the name of the item, the item’s SKU number, category, price, description, etc. For each item that we load onto the database ,we can perform searches according to product categories and descriptions and even sort the products by price. This is great and also efficient, because almost every object in the database will have the same types of attributes. Relational Database technologies such as SQL, Oracle etc. are great at handling this and are still very much in use today.
The problem we encounter when it comes to handling Big Data is that the data is subject to frequent change. With a Relational system, we need to define a structure or schema ahead of time. That’s not a big problem with an Online Shopping Cart database, since most items have the same attributes as described above. But what if we don’t know the types of attributes of the data we’re planning to store? Let’s imagine that we have a service that crawls the Web for Real Estate websites in a particular region. The objective is to build up an aggregated repository of information about properties for sale or rent that users can query. Very frequently, the data that is being collected can be in a variety of sizes and types. For example, we could have HTML files, media files (JPEGs and MPEGs) or even strings of characters. In some cases it may be impossible to build a structure ahead of time, because we simply don’t know what’s out there.
So what happens each time we need to change the structure of a relational database? Rolling out schema changes for a database is a potentially complex, time and resource-intensive process and has a definite performance impact on the database during the change. Conventional solutions such as adding more computing resources or splitting up the database into shards are feasible, but do not fundamentally change how the data is being managed.
Solution: Big Data Technology
In the previous section, we explored the need for corporations and organizations to manage increasingly large amounts of data as well as the ineffectiveness of existing Database Management systems in dealing with these large data sets . In this section, we will briefly cover the most commonly deployed solutions in the industry for Big Data management.
Hadoop: Some industry executives have likened Hadoop to the brand “Kleenex”, meaning to say that Hadoop is synonymous with Big Data. Hadoop was largely developed at Yahoo and named after the toy elephant of a researcher’s son. Hadoop’s mechanism and components are described briefly:
- Distributed Cluster Architecture: Hadoop comprises of a collection of nodes (Master + Workers). The Master node is responsible for assigning coordinating tasks via a Jobtracker role. Hadoop has to basic layers:
- The HDFS layer: The Hadoop Distributed File System maintains consistency of data distributed across a large number of data nodes. Large files are distributed across the cluster and managed via a metadata server known as the Primary Namenode. Each datanode serves up data over the network using a proprietary block protocol. HDFS maintains a number of High Availability features including replication and rebalancing of data across nodes. A major advantage of HDFS is location awareness, where nodes are scheduled to run computational processes for data that is situated close to the nodes, thereby reducing network traffic.
- The Map Reduce layer: The Processing logic of MapReduce consists of the Map function and the Reduce function. The Map function applies a transformation to a list and returns an attribute value pair (ie. result,1). The Reduce function then concatenates the list into a string.
- Additional Components: Hadoop is commonly implemented with a number of additional services. We’re listing the most common components here:
- Pig: Pig is a scripting language for creating MapReduce queries.
- Hive: Hive is a data query infrastructure
- Squoop: Squoop is a Relational Database connector combined with data analysis tools that allows connectivity into a company’s Business Intelligence layer.
- Scheduling: Scheduling tools such as Facebook’s Fair Scheduler and Yahoo’s Capacity Scheduler allow users to prioritize jobs and implement some degree of Quality of Service.
- Other tools: A number of other tools are available for managing Hadoop and include HCatalog, a table management service for access to Hadoop data and the Ambari monitoring and management console.
- Batch Processing: Hadoop fundamentally uses a batch processing system to manage data. Processing is typically divided up into the following steps:
- Data is divvied up into small units and distributed across a cluster
- Each data node receives a subset of data and applies map and reduce functions to locally stored data/cloud storage
- A Jobtracker coordinates jobs across the cluster
- Data may be processed in a workflow where outputs of one map/reduce pair become inputs for the next
- Data results may be applied to additional analysis/ reporting or BI tools
- Hadoop Distributions: Hadoop was originally designed to work on the Apache platform and has very recently (Circa. October 2012) been released by Microsoft as Microsoft HDInsight Server for Windows and the Windows Azure HDInsight Service for the cloud. Other large vendor support for Hadoop includes the Oracle Big Data appliance which integrates with Cloudera’s distribution of Apache Hadoop; and Amazon’s AWS Elastic Map Reduce service for the cloud and Google’s AppEngine-MapReduce on Google App Engine.
Latest Trends in Government
Now that we’ve covered some basics on Big Data, we are now ready to explore common implementations in the government sector around the world. Large governments led the charge for Big Data implementations, with an excess of 160 large Big Data programmes being pursued by the US Government alone.
- Search Engine Analytics: A pressing need to search vast amounts of data made publicly available by recent policy changes has seen a great practical application for Hadoop and Hive. For example, the UK government uses Hadoop to pre-populate relevant and possible search terms when a user types into a search box.
- Digitization Programs: The cost implications for ‘going digital’ are large, and regulators are taking notice, with some estimates that online transactions can be 20 times cheaper than by phone, 30 times cheaper than by face-to-face, and up to 50-times cheaper than by post (link).For example, the UK government stated in it’s November 2012 Government Digital Strategy that it can make up to £1.2 Billion by the year 2015 just by making public services digital by default. A number of large government bodies have been tasked with identifying large volume transactions (>100,000 a year) that can be digitized. Successful digitization requires a number of key movements:
- Non-exclusive policies: Bodies or groups that do not have the capabilities to go digital must not be penalized. This means that the choice to go digital should be open. Users who are not familiar with accessing digital information should also be given alternative mechanisms such as contact centers.
- Consolidation of processes: A number of governments are moving closer towards a single consolidated online presence. For example, the U.K. government is consolidating all publishing activities across all 24 UK central government websites to the GOV.UK website. The consolidation of information without incurring any performance penalties requires the standardization to common platforms and technologies.
- Large Agency initiatives: The largest agencies and ministries are spearheading programs on Big Data, with applications in Health, Defense, Energy and Meteorology taking on significant interests:
- Health Services: The US center for Medicare and Medicaid services (CMS) is developing a datawarehouse based on Hadoopto support analytic and reporting requirements of Medicare and Medicaid programs. The US National Institute for Health (NIH) is developing a the Cancer Imaging Archive, an image data-sharing service that leverages imaging technology used in assessment of therapeutic responses to treatment.
- Defense: The US Department of Defense listed 9 major projects in a March 2012 Whitehouse paper on the adoption of Big Data anlysis across the government. Major applications involved Artificial Intelligence, Machine Learning, Image and Video recognition and Anomaly detection.
- Energy: The US Department of Energy is investing in research on it’s Next Generation Networking program to move large datasets (>1 petabyte per month) for the Open Science Grid, ESG and Biology communities.
- Meteorology: The US National Weather Service uses Big Data in their modeling systems to improve Tornado forecasting systems.Modern weather prediction systems utilize vast amounts of data collected from ground sources and a geostationary orbiting satellite planned to be launched in 2014 and as weather conditions are constantly changing, the need for rapid processing of high velocity is paramount to these systems.
Strategic Value
Big Data is transformative in the sense that it provides us with an opportunity perform deep meaningful analysis of information beyond what is normally available. The idea is that with more information at our fingertips, we can make better decisions.
Positive Implications
- Greater Transparency: Big data has the opportunity to provide greater access to data by making data more frequently accessible to greater constituencies of people.
- More opportunities for enhancing performance: By providing users with access to not only greater amounts of data, but also greater varieties of data, we create more opportunities to identify patterns and trends by connecting information from more sources, leading us to capitalize on opportunities and expose threats. This results in an overall enhanced quality of decision making that could potentially lead to greater performance.
- Better Decisions: By allowing systems to collect more data and then applying Big Data analysis techniques to draw meaningful information from these data sets, we can make better, more timely and informed decisions.
- Greater segmentation of stakeholders: By exposing our analytics to greater pools of raw data, we can find interesting ways to segment our constituents, identifying unique patterns at a more granular level and devising solutions and services to meet these needs. For example, we can use Big Data to analyze the Elderly living in a particular part of a city that are alone, have a unique medical condition requiring specialist care, and use this information to manage staffing and service avalability for these users.
Negative Implications
- Big Brother: Governments are sensitive to the perception of using data to investigate and monitor the individual and the storing and analysis of data by government has long had a strong reaction in the public eye. However, the enactment of information transparency legislation and freedom of information policies, together with the formation of public watchdog sites have led to an encouraging environment for governments to pursue Big Data.
- Implementation Hurdles: Implementing Big Data requires a holistic effort beyond adopting a new technology. The task of effectively identifying data that can be combined and analyzed; to securely managing the data over it’s lifetime must be carefully managed.
Where to Start?
We’ve distilled a number of important lessons from around the web that could guide your Big Data implementation:
- Focus first on requirements: Decision makers are encouraged to look for the low hanging fruit, in other words, situations that have a pressing need for Big Data solutions. BIg Data is not a silver bullet and target implementations should be evaluated thoroughly.
- Start small: Care should be taken to manage stakeholder expectations before Big Data takes on the image of a large disruptive technology i the workplace. Focusing on small pilot projects that show tangible and visible benefits are the best way to go and often pave the way for much larger projects down the line. Often, extending the pipeline for Big Data projects allows technology stakeholders time to get over the learning curve of adoption.
- Reuse infrastructure: Big Data technologies can happily coexist on conventional infrastructure. In fact, Big Data implementations can happily coexist with Relational Database Systems in existing IT environments.
- Obtain high-level support: Big Data sees the greatest benefits in terms of performance and cost savings when combining different systems. But with this type of endeavor comes greater complexity and risks from differing priorities. Managing this challenge requires the appointment of senior stakeholders who can align priorities and provide the necessary visibility for forward movement.
- Push for standardization and educate decision makers: The Policy Exchange, a UK think tank recommends that “… public sector leaders and policymakers are literate in the scientific method and confident combining big data with sound judgment.”
- Address Ethical Issues first: A major obstacle to adopting Big Data is the pressure from groups of individuals who wish not to be tracked, monitored or singled out. Governments should tackle this issue head on by developing a code for responsible analytics
Useful Links
Information week article on Microsoft’s Big Data strategy here.
UPenn Research Paper > Development of Big Data here.
Research Trends Report on the evolution of Big Data as a Research topic here.
Cloudera whitepapers on Government Implementations here.
Article on Big Data’s success in Government here.
Article: UK govt. in talks to use Hadoop here.
Paper: UK Government Digital Strategy here.
Paper: US Federal Government Big Data Strategy here.
Article: Big data in government here.
Article: National Weather Service using Big Data here.
Research: Mckinsey Global Institute paper on Big Data here.
Report: Policy Exchange Report on Big Data here.