Some months ago I setup Mediawiki with the latest Wikipedia dump. It was the most horrible experience. Its easier to work with Linux kernel code than installing Mediawiki and setup Wikipedia. So here are the steps:
1. Use BitNami MediaWiki
BitNami MediaWiki stack has MediaWiki, Apache, MySQL, PHP and phpMyAdmin. So everything that you need for Wikipedia gets installed in one shot.
2. Drop and recreate all the tables, within the Mediawiki database in MySQL, to have ENGINE=MYISAM and everything from latin1 has to change to utf8. Here is my sql file to recreate the tables.
IMPORTANT: MYISAM engine is optimal for reading from the database. So if you want to provide Wikipedia edits then MYISAM is not for you. But its rare to provide editing for a duplicate Wikipedia site.
3. Change the MySQL configuration file to this. Basically you increase the memory limits as Wikipedia is massive.
4. Use mwdumper to import the Wikipedia dump into Mediawiki. All the instructions and troubleshooting problems are listed at that link. But the key advice is to read each and every instruction and follow it exactly, even for the troubleshooting options.
5. Mediawiki is very very slow with Wikipedia. So you need a caching mechanism within Mediawiki. Follow the steps listed here. Installing a PHP cache engine is a must.
6. Next install a reverse proxy cache(caching outside mediawiki) like Squid if you will have large number of hits. Wikipedia itself uses Squid extensively. But this is optional and only for the highest form of optimization. I just used reverse proxy caching in IIS7.
That should be it. Am sure you will break your head even with these steps but atleast the wall your banging your head on is not made of titanium now.
I am going to revisit this soon and see what other optimization I can do, as its still slow. That can be another Wikipedia Optimization post.
Thanks for reading!
Tuesday, August 10, 2010
Best Mediawiki parser
Mediawiki is a real horrendous piece of software, especially if you want to work with Wikipedia. The wiki-markup(language used to write articles) does not even have a standard. Hence there are no perfect parsers other than Mediawiki itself.
I have been working with Wikipedia for the past 2 years now and would really prefer to write my own parser, only if there is a standard for wiki-markup.
Mediawiki lists its alternate parsers here.
Almost none of them are complete. I cannot say all of them are incomplete because I have not tried all of them. But I feel all of them are incomplete, especially with Wikipedia articles.
The Best Mediawiki parser I have used was gwtwiki. This is also listed in the alternative parsers link on mediawiki.
But this is still incomplete as not all articles are displayed correctly.
Kindly list some good mediawiki standalone parsers that you have used and that work.
I will add them to the list at the end of the article.
Best Mediawiki Alternate Parser
1. gwtwiki
I have been working with Wikipedia for the past 2 years now and would really prefer to write my own parser, only if there is a standard for wiki-markup.
Mediawiki lists its alternate parsers here.
Almost none of them are complete. I cannot say all of them are incomplete because I have not tried all of them. But I feel all of them are incomplete, especially with Wikipedia articles.
The Best Mediawiki parser I have used was gwtwiki. This is also listed in the alternative parsers link on mediawiki.
But this is still incomplete as not all articles are displayed correctly.
Kindly list some good mediawiki standalone parsers that you have used and that work.
I will add them to the list at the end of the article.
Best Mediawiki Alternate Parser
1. gwtwiki
Subscribe to:
Posts (Atom)