Bug #189
closedIMDB scraper - use HTTP compression (gzip)
0%
Description
Hi,
In attempt to make the scraping faster, I passed the traffic from XBMC4Xbox through a Fiddler session (http://www.fiddler2.com) and I found that the response wasn't compressed. Looking for some examples on the web I found that in fact gzip is supported and confirmed it by changing imdb.xml quickly to add the gzip option - apparently the support was in there for a while (ticket 17389).
Attached is a capture from original scraper, modified one (compressed) and a diff attempt - it reduces the traffic for some pages from 98K down to 22K! (see Transformer tab in Fiddler for compression info).
There might be more to do around "cache" option, as you can see from the captures if you open them in Fiddler it makes the query for the same URL a few times, I wonder if that can be cached locally instead of being downloaded again, but that might be a subject of another ticket.
Just wondering if we could modify the scraper to use HTTP compression, and whether maybe there was a reason for not using it in the first place.
Thanks,
Dan
Files
Updated by dandar3 almost 13 years ago
I can't really say it was a lot quicker, we might need to look at the "cache" option, that'll reduce the loading of the same resource again.
http://akas.imdb.com/find?s=tt;q=team%20america%20%2d%20world%20police%20(2004)
http://akas.imdb.com/title/tt0372588/?fr=c2M9MXxsbT01MDB8ZmI9dXx0dD0xfG14PTIwfGh0bWw9MXxjaD0wfGNvPTB8cG49MHxmdD0wfGt3PTB8cXM9dGVhbSBhbWVyaWNhIC0gd29ybGQgcG9saWNlICgyMDA0KXxzaXRlPWFrYXxxPXRlYW0gYW1lcmljYSAtIHdvcmxkIHBvbGljZXxubT0w;fc=1;ft=20
http://akas.imdb.com/title/tt0372588/
http://akas.imdb.com/title/tt0372588/plotsummary
http://akas.imdb.com/title/tt0372588/
http://akas.imdb.com/title/tt0372588/
http://akas.imdb.com/video/imdb/vi3129803033/player
http://akas.imdb.com/title/tt0372588/posters
http://akas.imdb.com/title/tt0372588/
I think maybe most of the time might be spent in the RegExp expressions though, but still I think the gzip option is still worth considering to reduce network traffic.
Updated by buzz over 12 years ago
- Status changed from New to Closed
- Assignee set to dandar3
- Target version set to 3.2
- Resolution set to fixed