Sunday, July 27, 2014

Using Guava Multiset to do collated-counting


Andrew Sy (162 karma) posted on Feb 24, 2013
Recently, our team developed some Hadoop jobs that collate sets of data by type and then count number of occurences by type. For example, we have the following Hadoop jobs that do this sort of collate-counting:
  • TraitsByAudience
  • LookAlike
  • RelatedCounts
A Multiset is a data struct that's designed to do this very type of collated counting. Guava includes an implementation of Multiset. A very good explanation of it is given on the Guava Explained site.
I copy-pasted their example here:
Collated-Counting: Traditional Hand Crafted Solution
1
2
3
4
5
6
7
8
9
Map<String, Integer> counts = new HashMap<String, Integer>();
for (String word : words) {
 Integer count = counts.get(word);
 if (count == null) {
   counts.put(word, 1);
 } else {
   counts.put(word, count + 1);
 }
}
Constrast the above with this:
Collated-Counting: Using Guava Multiset
1
2
3
Multiset<String> wordsMultiset = HashMultiset.create();
wordsMultiset.addAll(words);
// now we can use wordsMultiset.count(String) to find the count of a word
Here's how we leveraged it in our TraitsByAudience hadoop job. We use Multiset to do our collate-counting in the countTraits() method.
Using Multiset to count traitIds
1
2
3
4
5
6
7
8
9
10
11
12
13
@VisibleForTesting Multiset<String> countTraits(Iterable<Text> listOfTraitIdsAsCsvList)
{
   Multiset<String> traitsCounter = HashMultiset.create();
    
   for (Text traitsAsCsvList : listOfTraitIdsAsCsvList)
   {
       //split "traitId_1, traitId_2,,,traitId_5," while ignoring empty fields
       Iterable<String> iterTraits = Splitter.on(",").omitEmptyStrings().split(traitsAsCsvList.toString());
       traitsCounter.addAll(Lists.newArrayList(iterTraits));
   }
    
   return traitsCounter;
}
And here's how we sort and read the different aggregate values out of the Multiset counter in createStatisticsSortedByHighestCountFrom(). Can't get much simpler than this
Using Multiset to sort and read aggregate statistics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
@VisibleForTesting List<TraitStatistics> createStatisticsSortedByHighestCountFrom(Multiset<String> traitsCounter)
{
   final List<TraitStatistics> allTraitStats = new ArrayList<TraitStatistics>();
    
   final int totalNumberOfTraits = traitsCounter.size();
   final int numberOfUniqueTraits = traitsCounter.elementSet().size();
    
   Multiset<String> traitsSortedByHighestCount = Multisets.copyHighestCountFirst(traitsCounter);
   for (Entry<String> traitCount : traitsSortedByHighestCount.entrySet())
   {
       TraitStatistics stat = new TraitStatistics();
       stat.totalNumberOfTraits =  totalNumberOfTraits;
       stat.numberOfUniqueTraits = numberOfUniqueTraits;
       stat.traitId = traitCount.getElement();
       stat.traitCount = traitCount.getCount();
       stat.traitPercentage = (float)traitCount.getCount() * 100 / totalNumberOfTraits;
        
       allTraitStats.add(stat);
   }
    
   return allTraitStats;
}



3 comments:

  1. In fact, this influenced them to think what diverse activities are valuable for those of us who end up all over the place or have confined rigging decisions. collated printing mac meaning

    ReplyDelete
  2. I would endorse my profile is crucial to me, I invite you to look at this subject... economical 24 hours printing toronto

    ReplyDelete
  3. The site is delicately balanced and saved as much as date. So it should be, a dedication of gratefulness is all together to offer this to us. photoshop image vector effect

    ReplyDelete