Friday, August 16, 2013

Using simple methods for data extraction

Here is a simple problem

IMDB publishes movies of all time. The list contains 250 movies.






Problem: In the web page, there is no sorting based on rating or votes.

Goal: I want to sort it by year, so that I can watch movies from latest to the old ones. And year is not a field at all in their table.
It will be good to have sorting by votes,rating as well.

How to go about it?

We can use very simple methods to achieve the solution instead of writing a full fledged program.

Steps

The first step is to copy the data into excel and separate it out into columns. (Excel can separate it out into columns if we copy the data into a text file and use the import operator.)

But there are problems with importing into excel. The two broad types of import are "fixed width" and "delimiter".

The length of the movie name is not fixed. So we cannot use fixed width column import.
We cannot use space as delimiter. Since movie names contain space.

Alternate Solution

I first use the left bracket "( "as the delimiter and get retrieve data.
Then I use the right bracket ")" and retrieve  data. This is because excel does not provide option for two custom delimiters.
Hence the two step process.

Once I have separated out the date and votes like this, i move them to a separate text file.

Now i come back to the original file (which just contains Rank, Rating and Title) and use fixed delimiter to separate out the Rank and Rating. What remains is the "movie title" neatly separated out into its own column.

So we have reached the solution by using a combination of custom delimiter and fixed width in Excel.

I can sort by ID, Rating, Title, Year or Votes. :)



And here is the link to the excl file i created using the above steps.

https://dl.dropboxusercontent.com/u/48173633/Top%20250%20Movies%20sorted%20by%20year.xlsx