Hypertable

Seen that? - Open-source alternatives to BigTable

Filed in archive Best of on October 3, 2009

Open-source alternatives to BigTable at Googlestack

Seen that? - Open-source alternatives to BigTable
disclosure: I am a Hypertable MapReduce contributor, and this post has nothing to do with increasing publicity of the project, the opinions presented here are the opinions of my own One of the most important google stack components is BigTable, the underlying data store. With it's proven scalability and availability characteristics, there was no wonder that sooner or later some open source alternative would have been created. I have already described [...] Read More


Columns versus Rows at Googlestack

In an attempt to bring the column oriented storage closer to the readers, I've managed to find a series of really great explanations about these two distinct storage schemes. While still theoretical these will definitely serve as a good and thorough (but not so hard to read) explanation on the subject matter. Once we got through these we'll see some more practical things in the next posts. Here's the list [...] Read More


Outline of The Google Software Stack at Googlestack

Introduction The beginning of the research on software scalability spurred the research on fault tolerance and management technologies. After all, if you have a lot of software running on hundreds or even thousands of machines you have to know what's going on inside this set of machines, and when things break, you want the software to take care of it. These concepts, scalability, fault tolerance and software management, are so inherently [...] Read More


Open Source Testing Tools at The CIO Weblog

According to a post at the O'Reilly Weblog by Kevin Shockey - SpikeSource will host their first TestFest in their Redwood City Headquarters. With free food, free drinks, and free presentations on June 17, 2005 at 3:00pm. - I wish i could attend. Register HereSpikeSource, Inc. has brought to market a set of tools. The recently announced availability of an automated testing service should prove to be an irreplaceable tool [...] Read More


We screwed up on open source, says Sun Chief Open Source Officer at Java Entrepreneur

Yes you read it absolutely correctly; this is what Simon Phipps, Sun's Chief Open Source Officer said recently in an interview. Check out what he had to say: Open source developers have been much more skeptical of Sun; a lot of open source developers don't remember the fact that Sun was pretty much the first open source start-up in 1982. All they can remember is what happened in 2001/2002 when, to [...] Read More

Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo

Hypertable installation guide.

Filed in archive Tools on February 21, 2009

During the course of many months there were many people complaining about the difficulty of installing Hypertable on their machines, but the root cause of the problem was many sophisticated dependencies required for performance reasons, and not Hypertable itself. Time passed and things have changed. There's now a more or less complete, down to the gory details dependency installation guide for various linux flavors as well as OS X 10.4 and 10.5 family.

For more technical details please visit HowToBuild page on Hypertable Wiki.

Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo

Relational vs Key/Value based

Filed in archive Articles on February 14, 2009

Tony Bain from Read Write Web is running an interesting piece which is easy to follow, yet may become an eye opener. Brief, concise, straight to the point in just 3 pages. This is highly recommended reading:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php

I am just wondering why they did not mention Hypertable?

Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo

Taking Hypertable 0.9.2.1 for a ride.

Filed in archive Tools on February 4, 2009

This is the first post in the series about tools and how you can leverage them to increase your ability to scale. In this post we're going to take a look how it is possible to store a large quantity of data in the Hypertable. The data store chosen here is for a reason, as you will see in a moment.

Consider the following scenario, that you have a lot of data exposed in some internet location that you want to have at local site and be able to sift through it. Perhaps you are creating a web crawler and you must have this data on site, not in a remote location to avoid latency factor when processing it. Let's take Wikipedia for example. You could get the dump of the whole Wikipedia and try one of the two: search through the exported wikipedia in a XML file, maybe even extract articles into separate files (that would be inefficient and would probably be a real pain for your filesystem) or store it in some more clever way. What you need in such scenario is a data storage that can handle this amount of data with great ease. It turns out a Hypertable is a good tool for doing this kind of work.

If you want to follow this example then I'm assuming that you have hypertable compiled, installed and configured on some cluster already, and if you are having trouble then please leave me a note in the comments.

We're going to consider the latest data dump of an english version of Wikipedia that can be downloaded at the following address: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This file is 4.1 gigabytes in size, and it's about 20 gigabytes after decompression so you'll need at least 25 gigabytes of free space to decompress it. After decompression you can remove the compressed file.

Downloading the Wikipedia can be done with a standard set of tools



$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Once you are done, then grab this script and save it somewhere.
The next step is to prepare Python thrift library so that we can connect to Hypertable using Python.

This can be done by copying the gen-py and hyperthrift directories from the source code directory you've compiled Hypertable at. For example



$ mkdir ~/wiki_example
$ cd ~/wiki_example
$ wget http://meme.pl/convert_wiki.py
$ cp -R ~/hypertable/src/py/ThriftClient/gen-py ~/wiki_example/gen-py/
$ cp -R ~/hypertable/src/py/ThriftClient/hypertable ~/wiki_example/hypertable/
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Next step is to create a wikipedia table and we do so by typing in following commands into the
hypertable command line utility



hypertable> create table wikipedia (title,article, ACCESS GROUP meta (title), ACCESS GROUP text (article));



The above command will create a table called wikipedia with just two column families:
  • title - containing the title of the article

  • article - containing the page content


and two access groups
  • meta

  • text


These two access groups tell the hypertable to store their member column families in a separate physical file. With the above type of configuration we're going to have titles stored in a different file than full text of the articles. This will enable us to search through titles much faster as the files containing article text will not be processed at all during this process.
Once done with this setup we go back to the command shell and run the conversion script



$ cd ~/wiki_example
$ env PYTHONPATH="gen-py" python convert_wiki.py



This step is a last step and it can take a few hours depending on your cluster configuration, but once everything goes well you'll have a wikipedia dump imported into your hypertable installation.

The import script uses articles' titles as record keys so they are used for querying.
So let's search for some articles which titles start with Dirichlet:



hypertable> select title from wikipedia where row =^ "Dirichlet";
[... output truncated for brevity ...]
Elapsed time: 0.21 s
Avg value size: 22.02 bytes
Avg key size: 23.02 bytes
Throughput: 13245.13 bytes/s
Total cells: 63
Throughput: 294.13 cells/s



We can see there're 63 articles starting with the word Dirichlet.
Doing the same type of query for the word Time yields following results



hypertable> select title from wikipedia where row =^ "Time";
[... output truncated for brevity ...]
Elapsed time: 0.04 s
Avg value size: 26.74 bytes
Avg key size: 27.74 bytes
Throughput: 4348717.61 bytes/s
Total cells: 3585
Throughput: 79815.66 cells/s



You can clearly see the throughput numbers are really high but to be honest with the readers I have to admit that this is a heavy caching in action, as doing the query for the first time is not going to yield such impressive numbers. For example running the query for articles starting with ZX yields these results:



hypertable> select title from wikipedia where row =^ "ZX";
[... output truncated for brevity ...]
Elapsed time: 0.19 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 3783.18 bytes/s
Total cells: 35
Throughput: 188.89 cells/s



Now, running it for the second time gives us much higher numbers



Elapsed time: 0.01 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 47612.58 bytes/s
Total cells: 35
Throughput: 2377.23 cells/s



Let's assume, for the sake of example , that you're interested in articles starting with a letter 'A'. You can get these easily by running this query



hypertable> select title from wikipedia where ('A' <= ROW < 'B');



and the numbers you get for cached and uncached version are really high in both cases. Initial query results in these performance numbers



[... output truncated for brevity ...]
Elapsed time: 14.17 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 1005019.91 bytes/s
Total cells: 376198
Throughput: 26557.52 cells/s



Subsequent queries result in this kind of performance



Elapsed time: 2.77 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 5131230.21 bytes/s
Total cells: 376197
Throughput: 135591.75 cells/s



As you can see clearly for yourself the numbers are really high. They could be even better! Trying to be objective I have to confess that the first test was much slower due to output going to the console over a wireless network which results in delays when performing output.

This post's goal was not to be a thorough performance review of hypertable so if you are interested in numbers then you should definitely take hypertable for spin yourself, but if you're interested in how you can utilize extra scalability then stay tuned for the next post in the Tools series. I'm going to explain how to make wikipedia imported into the hypertable searchable.

Update: I forgot to mention that these tests were performed on a x86_64 Ubuntu 8.10 Linux with hypertable compiled and operating in 64 bit mode on a single Intel Core 2 Quad Q6600 processor clocked at 2.4 GHz with 4 GB of physical memory and a single 500 GB SATA II disk.

Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo

Hypertable project has a new sponsor!

Filed in archive News on January 24, 2009

After a few months of evaluating Hypertable project, many of you probably noticed that Hypertable has a new sponsor - Baidu. Here are the numbers according to Wikipedia:

Baidu provides an index of over 740 million web pages, 80 million images, and 10 million multimedia files.[5] The domain baidu.com attracted at least 5.5 million visitors annually by 2008 according to a Compete.com scentury
, and what's more Baidu is going to throw Hypertable at their engineers and the whole infrastructure will be operating under the control of Hypertable. Neat! This is a no doubt one of the early symptoms of project quality acknowledged by the community.

Here's the official announcement: Baidu now an Official Sponsor of Hypertable

Update: Here you can find the official announcement on the Hypertable PR page

Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo
Share It
RSSrss
Google google
Yahoo! yahoo
Addthis Subscribe using any feed reader!
Bloglines Bloglines
TwitterFollow us on Twitter!
Most Popular   Articles   Best of   Concepts   Did you know   Entertainment   Information About   MapReduce   News   Quick introduction   Scalability stories   Tools