Creative forces launch innovation; provide alternatives to a market where choices may not be all that dissimilar. This couldn’t be more apparent with this current range of alternatives; BigData, NoSQL, or the analytic appliance (ADBMS); looking to displace the defacto RDBMS. The King is dead…long live the new King or as some call it cyclical history in progress.
It’s not that this current disruption isn’t lacking in technical or business rationale; rather its the misguided approach some have taken to by ruling out the applicability of the traditional dbms . The rdbms space over the past 20+ years has become dominated by the likes of Oracle, Microsoft, and IBM; who have exerted a sense of entitlement in creating a “one size fits all” offering to serve a broad range of applications.
This path to generalized optimization has manifested into bloated functionality; and worse yet into scalability constraints all at the expense of the captive user community. The cumbersome and primitive methods of data access and storage needs that gave way to the mainstream adoption of the rdbms is not that far removed from what is motivating organizations to seek out alternatives today. Though it may be worth a momentary step back; to understand the real problem; so we don’t reconstruct what we are trying to replace.
MySQL gave organizations that first sense of freedom as a low-cost, lightweight alternative that could be supported on commodity hardware platforms. For many this removed the licensing constraints of the enterprise dbms; that inhibited them from addressing growing performance overhead on their transactional systems. Moving this activity to what was being termed a “distributed server farm”; to perform tasks like ad-hoc querying, data mining, and online shopping. Innovative at the time; and compelling to many clients I worked with.
Point of clarification; when I refer to commodity hardware; I’m not talking about clustering together recycled 386 desktops; but moving away from supercomputers that scaled up; and to clusters of physical hardware that can scale out utilizing more elastic methods like virtualization.
BigData
The “BigData” problem being bantered about today is not a new problem; many organizations have been working with and analyzing terabytes and petabytes of data for some time. AT&T and Bell labs developed the Daytona project outside the confines of the a traditional dbms to tackle the analysis of their growing volumes of call data.
What this is becoming more about the is the openness and availability of the data; along with the increasing number of sources/devices generating steams of data we will need to make sense of. Moreover; unlike the AT&T; organizations don’t have to build their own hardware or software infrastructure; the cloud combined with open source projects for the most part takes care of this.
Amazon supplies us with a public cloud; that gives us a scalable server infrastructure at a fractional cost, (BigTable) (MapReduce) (Google), Hadoop (Yahoo), Cassandra (Facebook), and Dynamo (Amazon) have made available the solutions used to manage their BigData problem as Open Source managed projects. Commercially; IBM has put its collective effort in working with large data problems; and generated plenty of talk around Bigsheets.
The Hadoop project in particular has garnered the most attention; built on framework of Hadoop file system and MapReduce for data processing; has evolved into an ecosystem that scales with are data storage and processing needs. From HBase for analysis to Hive, and Pig that simplifies data queries; Hadoop is on course to righting the data problem.
To learn and understand the data; the methods used to store and process it is critical. The school bell rang in my ear while I was prototyping statistical models against a number of medium-sized data sets; ranging from 500 GB – 1 TB in size; and contained 50+ columns per row. Added into the mix was R; an open source statical language; to process, data mine, and predict possible outcomes; to get to this point I needed to build a dbms, create data model, load data, aggregate data, extract and flatten data so R could process it…STOP and RESET
Refine the approach; and think about what my initial problem was again; understanding and learning the patterns in my data. I took the problem to the cloud and utiziling the Amazon EMR platform to load, process, and flatten the data. Constructed on the combined framework of Hadoop and MapReduce providing APIs that can be interfaced using Python to perform sort, compare, reduce, map routines, and output of aggregated results that I could then analyze in R. All completed with credit card in hand; for around 10.00 dollars and under 45 minutes of processing time.
I was able to solve my computational problem by prototyping on Amazon EMR in minimal time; and was isolated from many of the complexities and limitations that can exist with an internally managed Hadoop distributed architecture. Being a batch problem; this can exist and grow in the cloud for some time; a point will be reached where from a cost perspective to move to dedicated hardware; I guess that why companies like Cloudera exist to help organizations when that time arrives
My problem isn’t alone being outside the typical web data use case that Hadoop being applied to. BioTech and more specifically the bioinformatic activity is greatly benefiting from this framework to solve their growing data needs.
NoSQL
A movement to make a software engineer’s life easier when dealing with modernized data.
post-relational distributed computing; and as Micheal Stonebraker has described it breaking down the overhead associated to maintaining consistency (ACID) properties of a transaction.
Tossing out a legacy; overused; query language; that adds to the overhead of managing data transactions.
One and two; yes; three; yes and no; having dealt with point-in-time analysis using SQL is a good example of where alternative approaches are required. I’m impressed with how SQLStream is addressing this problem.
There are a number of alternatives being released to solve these data scaling problems; MongoDb (document) , Redis (key), and Neo4j (Graph) are vendors I probably get asked most often about. Each with their advantages, but more importantly bringing an open discussion to a hard problem. Consider how the alternative manages consistency, persistence, and availability of the data. How does that fit into the application requirements? The thinking that its open source and can be customized; will be a full circle trip to the same problems that rdbms would have caused.
I was introduced to the NoSQL term and more so CouchDB while on a data mining project. A case of two projects converging; one analyzing test results; another sharing the test results and supporting documents across a number of labs. The problem spelled out; high transaction volumes; and data structure that didn’t fit into a relational model without frequent altering.
A sought out opinion became a quick introduction to the CouchDb architecture:
Not a rdbms, distributed document database, JSON interface, schema free, maintaining both unstructured/structured data, and data access through a number of open source methods.
I ask:
What are they getting that a traditional dbms can’t provide and what are they giving up.
In an environment with a number of research documents along with 100+ million rows of generated test results that needed to be accessed more frequently; and structures that could change based on the test being executed. Managing and constant updating of a relational data model wasn’t in the plans; nor was the resource to take this on
The CouchDB platform gave them a distributed/replicated data environment to handle requests as well as transactional volume; and while it does preserve the ACID properties; ensuring both data consistency and availability; there was willingness to working an eventually consistent state (BASE) for the performance gain. This wasn’t a dbms replacement; it was a new problem that until now couldn’t be solved easily.
Analytic Database
Disrupting a space that has long been dominated by Teradata; this new breed of DBMSs have set out to change the scale of economies for many organizations needing to perform analysis and proactive learning on their BigData.
Recently it was described as over-crowded; I like the competitive field and find that it produces much more open discussion and innovation for that matter. Leveraging the either scalability of standard commodity hardware or optimized devices for performance advantages and built on some foundation of the PostgresSQL open source dbms engine; Netezza (relational), Greenplum (relational), Vertica (Columnar), Aster Data (Relational), and ParAccel (Columnar) are now the growing forces in the space.
Leading way to a steady stream of product evolution and innovation; massive parallelism, in-memory processing, integration of the Hadoop and MapReduce framework to minimize processing data, deployments of their platform in the cloud, supporting solid-state storage (SSD) leading to significant performance gains in data access, and reducing the data movement bottleneck with hybrid data-application server introduced by Aster Data that embeds processing logic into the database engine.
Much of the focus to date has been on reducing the access and processing time on the stored data; at some point attention will need to shift to figuring out how to efficiently handle the blend of loading large batch files with real-time data streams.
Co-existing going forward
Neither the rdbms nor SQL are going away anytime soon; for the simple fact that these alternatives were not built as replacements for the dbms but rather address new data problems. I wouldn’t recommend pulling the rug from under the dbms managing your organization’s financial or ERP systems; there is plenty of best practices around addressing performance concerns with these applications. Most replacements today is due to standardization to a single dbms platform.
I see the offerings being developed by RethinkDB and Drizzle; that is lightning the MySQL framework to improve scalability in their product; as the model or approach going forward. Functionality in these alternatives will be adopted by the dbms vendors to address shortcomings in their own products. The MapReduce model is now being included in Teradata and I wouldn’t be surprised if Oracle released some integration into the MapReduce/Hadoop framework to meet customer demand. The addition of schema free support to MySQL may have changed the direction of the referenced project above.
Many of the NoSQL vendors are wrapping a SQL like interface around their platform to simplify access; but here how much functional bloat can be taken on; before the platform deviates from its applicable use. Versus getting caught up in the CAP theorem discussions or worse the coolness factor; remember this is about living with trade-offs and labeling data as a commodity.