Takitatha

Friday, December 10, 2010

HDD vs SSD vs RAM

Recently I wanted a machine that would be added to a cluster that does some heavy lifting with Wikipedia. CPU consumption is not much, but the I/O bandwidth is pretty much peaked out in all our cluster nodes.

All our cluster nodes have large amounts of RAM, in excess of 8GB, although now it does not sound like much. About 6 GB of this RAM is converted into RAM disks using freely available software like, Dataram RAMDisk.

HDD and SSD comparison are available all over the internet and I just used this as reference. The results of these test can be summarized as follows:

Disk Drive	Avg Read rate (MB/Sec)	Avg Write rate (MB/Sec)
HDD	98	87
SSD	196	85

Now I thought that was very impressive for SSD's compared to HDD. But RAM is so much faster and if you can fit all your disk intensive data onto the RAM, then you wont need a fast HDD or SSD or RAID 0 SSD, etc.

Here are the results of the benchmark that ran on the RAM disk.

Disk Drive	Avg Read rate (MB/Sec)	Avg Write rate (MB/Sec)
RAM disk	787	785

Now that is fast. I have decided not to buy any SSD's as all my data can fit into RAM itself. I hope this article can help others decide on expensive SSD's and even RAID 0 with SSD's.

But one thing that I have to mention here is that you can have the OS run on an SSD, but not the RAM disk setup that I have mentioned here. So if you want speed boost prior to the OS boot, then try out an SSD. Personally I feel they are still not worth the buck, unless you use it in a RAID.

Remember, if your data fits in the RAM, then you can see blazing speeds. If a Photoshop user reads this, then they can definitely try it on a RAM disk or RAM+SSD. I will be doing this for my Dad(professional photographer) sometime soon.

Tuesday, August 10, 2010

Setup or Install Mediawiki with Wikipedia

Some months ago I setup Mediawiki with the latest Wikipedia dump. It was the most horrible experience. Its easier to work with Linux kernel code than installing Mediawiki and setup Wikipedia. So here are the steps:

1. Use BitNami MediaWiki
BitNami MediaWiki stack has MediaWiki, Apache, MySQL, PHP and phpMyAdmin. So everything that you need for Wikipedia gets installed in one shot.

2. Drop and recreate all the tables, within the Mediawiki database in MySQL, to have ENGINE=MYISAM and everything from latin1 has to change to utf8. Here is my sql file to recreate the tables.
IMPORTANT: MYISAM engine is optimal for reading from the database. So if you want to provide Wikipedia edits then MYISAM is not for you. But its rare to provide editing for a duplicate Wikipedia site.

3. Change the MySQL configuration file to this. Basically you increase the memory limits as Wikipedia is massive.

4. Use mwdumper to import the Wikipedia dump into Mediawiki. All the instructions and troubleshooting problems are listed at that link. But the key advice is to read each and every instruction and follow it exactly, even for the troubleshooting options.

5. Mediawiki is very very slow with Wikipedia. So you need a caching mechanism within Mediawiki. Follow the steps listed here. Installing a PHP cache engine is a must.

6. Next install a reverse proxy cache(caching outside mediawiki) like Squid if you will have large number of hits. Wikipedia itself uses Squid extensively. But this is optional and only for the highest form of optimization. I just used reverse proxy caching in IIS7.

That should be it. Am sure you will break your head even with these steps but atleast the wall your banging your head on is not made of titanium now.

I am going to revisit this soon and see what other optimization I can do, as its still slow. That can be another Wikipedia Optimization post.

Thanks for reading!

Best Mediawiki parser

Mediawiki is a real horrendous piece of software, especially if you want to work with Wikipedia. The wiki-markup(language used to write articles) does not even have a standard. Hence there are no perfect parsers other than Mediawiki itself.

I have been working with Wikipedia for the past 2 years now and would really prefer to write my own parser, only if there is a standard for wiki-markup.

Mediawiki lists its alternate parsers here.
Almost none of them are complete. I cannot say all of them are incomplete because I have not tried all of them. But I feel all of them are incomplete, especially with Wikipedia articles.

The Best Mediawiki parser I have used was gwtwiki. This is also listed in the alternative parsers link on mediawiki.
But this is still incomplete as not all articles are displayed correctly.
Kindly list some good mediawiki standalone parsers that you have used and that work.
I will add them to the list at the end of the article.

Best Mediawiki Alternate Parser
1. gwtwiki