Intelligence On Open Source?

The rise of business intelligence and business analytics to increase corporate performance and boost profits by crunching the behavior patterns of millions of consumers at a time, has spurred interest in super-large data-warehouses.
Nearly 15 petabytes of data are created every day, eight times more than the information in all the libraries in the U.S. Keep in mind that every web 2.0 company is ultimately a database company (Amazon, EBay, YouTube, Facebook, Yahoo, Google...)

  • EBay Inc. reportedly operates databases that process 10 billion records per day and are also able to do deep business analysis. They collectively store more than 6 petabytes of data.
  • National Energy Research Scientific Computing Center, archives include 3.5 petabytes of atomic energy research data.
  • The World Data Centre for Climate in Hamburg, Germany, which has 220 terabytes of data
  • Internal Revenue Service's data warehouse is sized at about 150 terabytes.
  • Yahoo, Google, Facebook all run data-warehouses in petabytes !.
But did you know that...
  • The world's single-largest datawarehouse(Yahoo Inc.) (2 petabytes in size) runs a free and opensource engine (PostgreSQL) ?
  • Yahoo Inc's transactional databases are based on another free and opensource engine MySQL running on a free and opensource operationg system (FreeBSD)?
  • Facebook is running on 1,800 MySQL database servers, 10,000 Web servers, and 805 Memcached servers and they are all linux based (free) servers ?
  • Google's superfast search engine is based on free and opensource MySQL running on low cost PC's networked together into a grid?
  • YouTube's one billion views per day engine is based on free and opensource database engine, MySQL?
The reason behind this move cannot be far fetched, (Total Cost of Ownership !!!)

In order for a business to make solid intelligent decisions, it needs large amount of data from various sources, which traditionally would trigger the need for a large data-warehouse on supercomputers, however these products in the market are super-expensive, these often throw the ROI for such intensions out of consideration.

  • Not only are the hardware needed for such initiative expensive, but the software pricing from the big vendors are huge !!
  • The fact that such pricing are usually proportional to the processing power of the hardware futher jeopardizes the business case.

Little wonder that companies with the largest amount of data around the world have had to be innovative enough to come up with counter-measures (lets just say alternatives !!)
Their answer to this challenge, total elimination; totally eliminate the super-expensive supercomputers and the super-expensive software.

Their approach:
  • Deploy a large number of small computers to form a grid or cluster.
  • Use free and open-source databases (its usually the component that costs the most) get support on bug-fixes via public forums.
  • Use highly redundant and cheap commodity components (added capacity becomes a low incremental cost)
  • Manage the cluster (grid) as a single unit (minimal amout of human resources needed to keep the solution up and running)