Creative forces launch innovation; provide alternatives to a market where choices may not be all that dissimilar. This couldn’t be more apparent with this current range of alternatives; BigData, NoSQL, or the analytic appliance (ADBMS); looking to displace the defacto RDBMS. The King is dead…long live the new King or as some call it cyclical history in progress.
It’s not that this current disruption isn’t lacking in technical or business rationale; rather its the misguided approach some have taken to by ruling out the applicability of the traditional dbms . The rdbms space over the past 20+ years has become dominated by the likes of Oracle, Microsoft, and IBM; who have exerted a sense of entitlement in creating a “one size fits all” offering to serve a broad range of applications.
This path to generalized optimization has manifested into bloated functionality; and worse yet into scalability constraints all at the expense of the captive user community. The cumbersome and primitive methods of data access and storage needs that gave way to the mainstream adoption of the rdbms is not that far removed from what is motivating organizations to seek out alternatives today. Though it may be worth a momentary step back; to understand the real problem; so we don’t reconstruct what we are trying to replace.
MySQL gave organizations that first sense of freedom as a low-cost, lightweight alternative that could be supported on commodity hardware platforms. For many this removed the licensing constraints of the enterprise dbms; that inhibited them from addressing growing performance overhead on their transactional systems. Moving this activity to what was being termed a “distributed server farm”; to perform tasks like ad-hoc querying, data mining, and online shopping. Innovative at the time; and compelling to many clients I worked with.
Point of clarification; when I refer to commodity hardware; I’m not talking about clustering together recycled 386 desktops; but moving away from supercomputers that scaled up; and to clusters of physical hardware that can scale out utilizing more elastic methods like virtualization.
BigData
The “BigData” problem being bantered about today is not a new problem; many organizations have been working with and analyzing terabytes and petabytes of data for some time. AT&T and Bell labs developed the Daytona project outside the confines of the a traditional dbms to tackle the analysis of their growing volumes of call data.
What this is becoming more about the is the openness and availability of the data; along with the increasing number of sources/devices generating steams of data we will need to make sense of. Moreover; unlike the AT&T; organizations don’t have to build their own hardware or software infrastructure; the cloud combined with open source projects for the most part takes care of this.
Amazon supplies us with a public cloud; that gives us a scalable server infrastructure at a fractional cost, (BigTable) (MapReduce) (Google), Hadoop (Yahoo), Cassandra (Facebook), and Dynamo (Amazon) have made available the solutions used to manage their BigData problem as Open Source managed projects. Commercially; IBM has put its collective effort in working with large data problems; and generated plenty of talk around Bigsheets.
The Hadoop project in particular has garnered the most attention; built on framework of Hadoop file system and MapReduce for data processing; has evolved into an ecosystem that scales with are data storage and processing needs. From HBase for analysis to Hive, and Pig that simplifies data queries; Hadoop is on course to righting the data problem.
To learn and understand the data; the methods used to store and process it is critical. The school bell rang in my ear while I was prototyping statistical models against a number of medium-sized data sets; ranging from 500 GB – 1 TB in size; and contained 50+ columns per row. Added into the mix was R; an open source statical language; to process, data mine, and predict possible outcomes; to get to this point I needed to build a dbms, create data model, load data, aggregate data, extract and flatten data so R could process it…STOP and RESET
Refine the approach; and think about what my initial problem was again; understanding and learning the patterns in my data. I took the problem to the cloud and utiziling the Amazon EMR platform to load, process, and flatten the data. Constructed on the combined framework of Hadoop and MapReduce providing APIs that can be interfaced using Python to perform sort, compare, reduce, map routines, and output of aggregated results that I could then analyze in R. All completed with credit card in hand; for around 10.00 dollars and under 45 minutes of processing time.
I was able to solve my computational problem by prototyping on Amazon EMR in minimal time; and was isolated from many of the complexities and limitations that can exist with an internally managed Hadoop distributed architecture. Being a batch problem; this can exist and grow in the cloud for some time; a point will be reached where from a cost perspective to move to dedicated hardware; I guess that why companies like Cloudera exist to help organizations when that time arrives
My problem isn’t alone being outside the typical web data use case that Hadoop being applied to. BioTech and more specifically the bioinformatic activity is greatly benefiting from this framework to solve their growing data needs.
NoSQL
A movement to make a software engineer’s life easier when dealing with modernized data.
post-relational distributed computing; and as Micheal Stonebraker has described it breaking down the overhead associated to maintaining consistency (ACID) properties of a transaction.
Tossing out a legacy; overused; query language; that adds to the overhead of managing data transactions.
One and two; yes; three; yes and no; having dealt with point-in-time analysis using SQL is a good example of where alternative approaches are required. I’m impressed with how SQLStream is addressing this problem.
There are a number of alternatives being released to solve these data scaling problems; MongoDb (document) , Redis (key), and Neo4j (Graph) are vendors I probably get asked most often about. Each with their advantages, but more importantly bringing an open discussion to a hard problem. Consider how the alternative manages consistency, persistence, and availability of the data. How does that fit into the application requirements? The thinking that its open source and can be customized; will be a full circle trip to the same problems that rdbms would have caused.
I was introduced to the NoSQL term and more so CouchDB while on a data mining project. A case of two projects converging; one analyzing test results; another sharing the test results and supporting documents across a number of labs. The problem spelled out; high transaction volumes; and data structure that didn’t fit into a relational model without frequent altering.
A sought out opinion became a quick introduction to the CouchDb architecture:
Not a rdbms, distributed document database, JSON interface, schema free, maintaining both unstructured/structured data, and data access through a number of open source methods.
I ask:
What are they getting that a traditional dbms can’t provide and what are they giving up.
In an environment with a number of research documents along with 100+ million rows of generated test results that needed to be accessed more frequently; and structures that could change based on the test being executed. Managing and constant updating of a relational data model wasn’t in the plans; nor was the resource to take this on
The CouchDB platform gave them a distributed/replicated data environment to handle requests as well as transactional volume; and while it does preserve the ACID properties; ensuring both data consistency and availability; there was willingness to working an eventually consistent state (BASE) for the performance gain. This wasn’t a dbms replacement; it was a new problem that until now couldn’t be solved easily.
Analytic Database
Disrupting a space that has long been dominated by Teradata; this new breed of DBMSs have set out to change the scale of economies for many organizations needing to perform analysis and proactive learning on their BigData.
Recently it was described as over-crowded; I like the competitive field and find that it produces much more open discussion and innovation for that matter. Leveraging the either scalability of standard commodity hardware or optimized devices for performance advantages and built on some foundation of the PostgresSQL open source dbms engine; Netezza (relational), Greenplum (relational), Vertica (Columnar), Aster Data (Relational), and ParAccel (Columnar) are now the growing forces in the space.
Leading way to a steady stream of product evolution and innovation; massive parallelism, in-memory processing, integration of the Hadoop and MapReduce framework to minimize processing data, deployments of their platform in the cloud, supporting solid-state storage (SSD) leading to significant performance gains in data access, and reducing the data movement bottleneck with hybrid data-application server introduced by Aster Data that embeds processing logic into the database engine.
Much of the focus to date has been on reducing the access and processing time on the stored data; at some point attention will need to shift to figuring out how to efficiently handle the blend of loading large batch files with real-time data streams.
Co-existing going forward
Neither the rdbms nor SQL are going away anytime soon; for the simple fact that these alternatives were not built as replacements for the dbms but rather address new data problems. I wouldn’t recommend pulling the rug from under the dbms managing your organization’s financial or ERP systems; there is plenty of best practices around addressing performance concerns with these applications. Most replacements today is due to standardization to a single dbms platform.
I see the offerings being developed by RethinkDB and Drizzle; that is lightning the MySQL framework to improve scalability in their product; as the model or approach going forward. Functionality in these alternatives will be adopted by the dbms vendors to address shortcomings in their own products. The MapReduce model is now being included in Teradata and I wouldn’t be surprised if Oracle released some integration into the MapReduce/Hadoop framework to meet customer demand. The addition of schema free support to MySQL may have changed the direction of the referenced project above.
Many of the NoSQL vendors are wrapping a SQL like interface around their platform to simplify access; but here how much functional bloat can be taken on; before the platform deviates from its applicable use. Versus getting caught up in the CAP theorem discussions or worse the coolness factor; remember this is about living with trade-offs and labeling data as a commodity.
Measuring your Measures a BI Afterthought
“An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts – for support rather than for illumination.” – Andrew Lang though unattributed
A recent blog post highlighting the importance of a Business Intelligence Competency Center (BICC)shed some light on establishing value and trust in the data we use to measure results.
The difference is in the “how” an organizations comes to this realization that a business strategy must be defined to direct the BI initiative to attain any value ongoing. Partial failure identified early on; can be a invaluable source to the needed measurements and process monitoring; prolonged and the process will reach a state of total failure, leaving little remnants to gather value from.
Absence of Strategy?
An all to familiar reoccurring task for many; a multitude of reports are whittled down to the required columns, data manipulated and reformatted in Excel, distributed to stakeholders lacking appropriate documentation to support the question being answered, suspect data quality, and the data produced now in doubt lacking in value and context misinterpreted.
or
Resembling the Sunday Times; the same reports are bundled up and circulated among the stakeholders who individually analyze the data, duplicating the efforts in manipulating, summarizing, leaving the data and its message in a state similar to what Dante described in his journey through the 8th Circle.
Is one more detrimental over the other to the business? While the first scenario the data summarized to ease consumption at the cost of data lacking context, leaves its audience to sort out the details resulting in confusion of repeated manipulation and messaging. The conclusion though; identical; neither provides for the beginnings of; nor builds upon a sustainable business intelligence strategy.
So why and how does the process or lack of one get this point? Simply stated; either the question was not sufficiently answered or lack of understanding to the question being asked. The business may be efficiently utilizing the data; and IT is effectively maintaining all the pieces of the BI environment; what has been lost is that point of reference that could curtail the bad habits that can form when people attempt to consume data. It now becomes a lesson in improving the organization understanding, communication, and measuring of what can be and is being done with the data today.
Fostering a BI Strategy
A fresh approach or fixing whats in place? While new does have its advantages; either stage must still identify and shape the strategy from the business goals and needs captured. Its crucial that these goals become the blueprint for data democratization; and the outcome leads to empowering the organization with the ability to learn from the data, process transparency, measuring the effectiveness of the answers, and establishing trust in the data produced.
STATIC it is not. Goals, questions, and measurements will evolve over time; avoiding the scenarios above; the strategy must be dynamic to optimize and adapt as well. This is not a list of tasks, capabilities, or a technical description of the BI environment; and while there is starting point there is no predetermined end point.
TECHNICAL it is not. To embed technology or solutions into the strategy would only inhibit its growth; promote data silos; and lead an organization down the road to failure.
ORGANIC it is. Its critical that the process, data, and effectiveness of the measures are made transparent; this will enable to build on the foundation from the ground up. Questions will be evaluated and determined to be inefficient, redundant, or longer applicable. From this new goals, measurements, and data will emerge.
ITERATIVE it is. Whether starting off or salvaging what exists; the design of the strategy and how it is executed must be agile. Either by focusing on specific business problem or question (i.e. sales and opportunities) to test out how the strategy takes form. As the process matures; by measuring and tracking the consumption of measurements and questions; will identify what questions can be weeded out based on a state of inactivity.
Transitioning Strategy to the Tactical
Where should the strategy live? A document or collection of guidelines it is; but it also must be measured against and updated, collaborative, and transparent to all stakeholders. It must be able to collect and capture the required data, mapping the relationship and work flow between goals and questions asked, performance indicators and measurements.
Reconcile, reconcile, and reconcile. An exercise in data discovery; reviewing all available reports, identifying redundancies,asking whether the report is ever consumed, undocumented calculations and measurements, and connecting the dots back to a specific business question or process. Purpose here is to reduce output, bring meaning to the data produced, and make what exists usable.
Constructing a Data Presentation Architecture program going forward. A mix of business intelligence roles and skill sets; it directly contributes to the optimization and success of the BI strategy. Its designed to bring relevance to the data, communicate meaning, and produce actionable results.
Both the growth of data and the need to analyze the information is increasing daily. As well it will be imperative to measure, validate, and adjust with this growing need; to ensure the right data is being applied to the right question.
→ Leave a comment
Posted in Business Intelligence, Case Study, Commentary
Tagged Accuracy, Business Intelligence, Data Governence, Measurement, Strategy, Sustainability