Daily Archives: October 7, 2008

v0.2 of my Visit/PageView Cube – Creating a Hierarchy for the Source Dimension

So I am still playing around with building an Analysis Services cube from the pageview data I recently started collecting.  Over this past weekend I added a hierarchy to the Source dimension of my cube that gives me a bit more insight about how visitors find their way to my site.  If you read my last post, you saw that I could view my pageview data broken down by two pretty general traffic sources: direct traffic and referring links …

  image

Well, now with my new hierarchy, I can not only view hit counts by direct traffic, referring site, or search engine …

image

… but I can also drill into each of these sources and view my pageviews at a more granular level.

image

I can do this because I have added a bit of preprocessing logic to my cube creation process that sends all referring urls through a very simple rule engine that assigns the referring url a category and subcategory.  There is some source code a little further below, but the rules look something like this …

   1: -- Search Engine Rules
   2:  
   3: if ref_url's domain like google.com and ref_url querystring contains the q token then Category=SearchEngine, SubCategory=Google 
   4:  
   5: if ref_url's domain like yahoo.com and ref_url querystring contains the p token then Category=SearchEngine, SubCategory=Yahoo
   6:  
   7: -- Community Referrers
   8:  
   9: if ref_url's domain like dotnetkicks.com then Category=Referrer, SubCategory=Community
  10:  
  11: if ref_url's domain like digg.com then Category=Referrer, SubCategory=Community
  12:  
  13: -- Internal Referrers
  14:  
  15: if ref_url's domain like mattberseth.com then Category=Referrer, SubCategory=Internal
  16:  
  17: if ref_url's domain like mattberseth2.com then Category=Referrer, SubCategory=Internal
  18:  
  19: ...

 

So I took these rules (and a few others) and created a quick and dirty console app that rips through all of the my pageviews and assigns a category and subcategory based on these rules.  The console app is a total of 250 LOC, so I am not going to post all of the source, but this is the gist of it (I removed the SqlClient stuff where I do the getting and putting)

   1: class Program
   2: {
   3:     static void Main(string[] args)
   4:     {
   5:         List<TrafficSource> trafficSources = new List<TrafficSource>()
   6:         {
   7:             //  Search Engines
   8:             new SearchEngine(){ Name="google", SourceID=3, QueryToken="q", IsMatch = uri => uri.Host.IndexOf("google", StringComparison.OrdinalIgnoreCase) >= 0 && uri.LocalPath == "/search" },
   9:             new SearchEngine(){ Name="yahoo", SourceID=4, QueryToken="p", IsMatch = uri => uri.Host.IndexOf("search.yahoo", StringComparison.OrdinalIgnoreCase) >= 0 && uri.LocalPath == "/search" },
  10:  
  11:             //  Social
  12:             new TrafficSource(){ Name="dotnetkicks", SourceID=6, IsMatch = uri => uri.Host.IndexOf("dotnetkicks.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  13:             new TrafficSource(){ Name="digg", SourceID=6, IsMatch = uri => uri.Host.IndexOf("digg.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  14:             new TrafficSource(){ Name="reddit", SourceID=6, IsMatch = uri => uri.Host.IndexOf("reddit.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  15:             new TrafficSource(){ Name="stumbleupon", SourceID=6, IsMatch = uri => uri.Host.IndexOf("stumbleupon.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  16:  
  17:             //  Community
  18:             new TrafficSource(){ Name="weblogs.asp.net", SourceID=6, IsMatch = uri => uri.Host.IndexOf("weblogs.asp.net", StringComparison.OrdinalIgnoreCase) >= 0 },
  19:  
  20:             //  Forums
  21:             new TrafficSource(){ Name="forums.asp.net", SourceID=7, IsMatch = uri => uri.Host.IndexOf("forums.asp.net", StringComparison.OrdinalIgnoreCase) >= 0 },
  22:             new TrafficSource(){ Name="expertsexchange", SourceID=7, IsMatch = uri => uri.Host.IndexOf("experts-exchange.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  23:             new TrafficSource(){ Name="stackoverflow", SourceID=7, IsMatch = uri => uri.Host.IndexOf("stackoverflow.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  24:  
  25:             //  Internal
  26:             new TrafficSource(){ Name="mattberseth.com", SourceID=8, IsMatch = uri => uri.Host.IndexOf("mattberseth.com", StringComparison.OrdinalIgnoreCase) >= 0 },
  27:             new TrafficSource(){ Name="mattberseth2.com", SourceID=8, IsMatch = uri => uri.Host.IndexOf("mattberseth2.com", StringComparison.OrdinalIgnoreCase) >= 0 }
  28:         };
  29:  
  30:         //  default uncategorized traffic source
  31:         TrafficSource defaultTrafficSource = new TrafficSource() { Name = "Other", SourceID = 9, IsMatch = uri => true };
  32:         //  direct traffic
  33:         TrafficSource directTrafficSource = new TrafficSource() { Name = "Direct", SourceID = 0, IsMatch = uri => true };
  34:  
  35:         bool done = false;
  36:         while (!done)
  37:         {
  38:             //  fetch the next batch of page views
  39:             List<PageView> pageviews = PageView.Fetch();
  40:  
  41:             //  get all of the visits that contain referring urls
  42:             foreach (PageView pageview in pageviews)
  43:             {
  44:                 if (pageview.RefUri == null)
  45:                 {
  46:                     //  direct traffic, assign the direct traffic source
  47:                     pageview.Source = directTrafficSource;
  48:                 }
  49:                 else
  50:                 {
  51:                     foreach (TrafficSource source in trafficSources)
  52:                     {
  53:                         //  see if it matches any of the existing rules we have setup
  54:                         if (source.IsMatch(pageview.RefUri))
  55:                         {
  56:                             //  add it to the group
  57:                             pageview.Source = source;
  58:                             break;
  59:                         }
  60:                     }
  61:  
  62:                     if (pageview.Source == null)
  63:                     {
  64:                         //  else its just some random ref we don't care enough about
  65:                         //  to further categorize
  66:                         pageview.Source = defaultTrafficSource;
  67:                     }
  68:                 }
  69:  
  70:  
  71:                 //  update the record
  72:                 pageview.Update();
  73:             }
  74:  
  75:             done = pageviews.Count == 0;
  76:         }
  77:     }
  78: }
  79:  
  80: class TrafficSource
  81: {
  82:     public int SourceID { get; set; }
  83:     public string HostName { get; set; }
  84:     public string Name { get; set; }
  85:  
  86:     public Func<Uri, bool> IsMatch { get; set; }
  87: }
  88:  
  89: class SearchEngine : TrafficSource
  90: {
  91:     public string QueryToken { get; set; }
  92:  
  93:     public string ParseKeywords(Uri uri)
  94:     {
  95:         return HttpUtility.ParseQueryString(uri.Query)[this.QueryToken];
  96:     }
  97: }
  98:  
  99: class PageView
 100: {
 101:     public int ID { get; set; }
 102:     public Uri Uri { get; set; }
 103:     public Uri RefUri { get; set; }
 104:     public TrafficSource Source { get; set; }
 105:  
 106:     public void Update()
 107:     {
 108:         // update the pageview with the category/subcategory
 109:     }
 110:  
 111:     public static List<PageView> Fetch()
 112:     {
 113:         // get the next batch of pageviews that need to be categorized
 114:     }
 115: }

 

And the output of this preprocessing a table that contains the set of domains that link to my blog, as well as that referring domains category and subcategory.  I then take this table and use it as the data source for my Source dimension.

  image

And after reprocessing the cube, you can write what ever custom MDX queries you want.  The one below calculates the what percent of the daily total the traffic category makes up.

   1: with
   2:     member [Measures].[Direct Traffic Hits] as ([Source].[Hierarchy].[Category Name].&[Direct], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
   3:     member [Measures].[Search Engines Hits] as ([Source].[Hierarchy].[Category Name].&[Search Engine], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'    
   4:     member [Measures].[Community Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Community], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
   5:     member [Measures].[Forums Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Forums], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'    
   6:     member [Measures].[Internal Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Internal], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'    
   7:     member [Measures].[Unknown Hits] as ([Source].[Hierarchy].[Subcategory Name].&[Unknown], [Measures].[Hit]) / ([Measures].[Hit]), format_string = 'Percent'
   8:     member [Measures].[All Hits] as ([Measures].[Hit]), format_string = 'Standard'
   9: select
  10: non empty
  11: {
  12:     [Measures].[Direct Traffic Hits],
  13:     [Measures].[Search Engines Hits],    
  14:     [Measures].[Community Hits],
  15:     [Measures].[Forums Hits],
  16:     [Measures].[Internal Hits],
  17:     [Measures].[Unknown Hits],
  18:     [Measures].[All Hits]
  19: } on 0,
  20: non empty
  21: {
  22:     [Time].[Date].children
  23: } on 1
  24: from PageView

 

Executing the above MDX produces the following result set …

image

 

And you can pipe this right into Excel to chart the results (thanks to ScottGu for the huge jump in referring traffic last Thursday) …

image

 

Conclusion

I have got a ton of questions on how I am collecting my pageview data.  I don't have all of the kinks worked out yet, but I will probably write up a quick post next on how I am doing it.

 

That's it.  Enjoy!

ExtractNow Is a Lightening Fast Bulk Extraction Tool [Featured Windows Download]

Windows only: ExtractNow is a free application with a singular focus: extracting your files lightening fast. Sometimes it’s worth eschewing all the bells and whistles of more robust program for a feature that works extremely well. ExtractNow is a bulk extraction tool that supports ZIP, RAR, ISO, BIN, IMG, IMA, IMZ, 7Z, ACE, JAR, GZ, LZH, LHA, TAR, and SIT archive formats. Configure the basic settings, such as where you want the files to be extract to, whether or not to preserve the folder structure of the archive, etc. and from then ExtractNow becomes a drag and drop extraction workhorse. ExtractNow extracted the test files used in the screen shot above so quickly that I almost missed a screenshot of it in action as my hand traveled from the Enter to Print Screen key. ExtractNow is Windows-only donationware.

Appnr Makes Finding and Installing Ubuntu Apps a Breeze [Ubuntu]

Free Ubuntu app search site Appnr is a great way to find and install a huge range of applications and tools, especially for Linux newcomers. Faster and more explanatory than the default Synaptic installer, more expansive and up-to-date than the “Add/Remove Programs” tool, Appnr uses Ubuntu’s apt-url handler to make it simple: Click “Install” next to any program, and your system takes care of the rest. I could only ask for results-as-you-type searching; otherwise, it’s a great tool for searching and browsing what’s available in the world of Ubuntu Linux. Appnr is free, doesn’t require a sign-up.

WP Like Button Plugin by Free WordPress Templates