Finding Second Highest Frequent Character in a text with Java

Finding second highest frequency character in a text is last weeks Thursdays Dzone Code Puzzle. Intention of this post is to try and work out this interview oriented problem in Java with you by explaining the way I am approaching this task. In return, both you and myself can learn through a discussion.

The requirement is to return the character that is the second most frequent. Before writing any code, it is important to break down this requirement into a simple set of test cases. As a TDD (Test Driven Development) oriented developer, I always try to approach a problem in that manner. The most correct way would be to start with one test case and proceed with the implementation code, but for the easiness in explanation, I will show four initial test cases and production code to support that requirement. Part 2 will cover additional scenarios which this program fails to support.

Initial Test Cases

1. shouldReturnNullForEmptyText()
> result should be null character for empty string
2. shouldReturnNullForTextOfSameCharacter()
> result should be null character for a text with the same character since there has to be at least two different characters to have a second most frequency
3. shouldReturnCorrectCharWhenTextHasTwoCharsOneBeingMostFrequent()
> result should be the character that is not the most frequent character
4. shouldReturnCorrectOneForATextWithOneBeingSecondMostFrequent()
> result should be the second most frequent character

Above test cases are implemented in the following manner with JUnit4. All test cases check the equality of expected values against actual.

package com.digizol.puzzle;

import static java.lang.Character.valueOf;
import static org.junit.Assert.assertEquals;
import org.junit.Before;
import org.junit.Test;

public class SecondFrequentCharacterInitialTest {

    private MapBasedFrequentCharacter sut;

    @Before
    public void setup() {
        sut = new MapBasedFrequentCharacter();
        // sut = new SortBasedFrequentCharacter();
    }

    @Test
    public void shouldReturnNullForEmptyText() {
        assertEquals(valueOf('\u0000'), valueOf(sut.getSecondMostFrequent("")));
    }

    @Test
    public void shouldReturnNullForTextOfSameCharacter() {
        assertEquals(valueOf('\u0000'), valueOf(sut.getSecondMostFrequent("dddddddd")));
    }

    @Test
    public void shouldReturnCorrectCharWhenTextHasTwoCharsOneBeingMostFrequent() {
        assertEquals(valueOf('y'), valueOf(sut.getSecondMostFrequent("iiiiiyiiiii")));
    }

    @Test
    public void shouldReturnCorrectOneForATextWithOneBeingSecondMostFrequent() {
        // most frequent is 'i', second most is 'd'
        assertEquals(valueOf('d'), valueOf(sut.getSecondMostFrequent("iaibicidieidif")));
    }
}

Production Code

As per the test cases, production code should have a method to take the text as an input and to return a character. One approach to solve this is to use a map to record the frequencies of each character and extract the second most frequent. Another approach is to record the character frequencies in a list, then sort by the frequency and extract the second most frequent. There is another approach to record the frequencies in an array where the array index is the integer representation of the character, but size of the character set must be known prior. In this post, both map based and sort based approaches are implemented, not the array based one.

Map based solution

In the following implementation, the characters in the text is added to a map and the frequency of those are incremented. Completed production code looks as follows.

package com.digizol.puzzle;

import java.util.*;
import java.util.Map.Entry;

public class MapBasedFrequentCharacter {

    public char getSecondMostFrequent(String text) {

        char[] charArray = text.toCharArray();
        
        // calculate char frequencies
        Map<Character, Integer> charFrequenciesMap = new HashMap<Character, Integer>();
        
        // loop1
        for (char c : charArray) {
            int frequency = 1;
            if (charFrequenciesMap.get(c) != null) {
                frequency = charFrequenciesMap.get(c) + 1;
            }
            charFrequenciesMap.put(c, frequency);
        }

        int currentMostFrequency = 0;
        int currentSecondMostFrequency = 0;
        char mostFrequentChar = '\u0000';
        char secondMostChar = '\u0000';

        // find second most frequent char
        Iterator<Entry<Character, Integer>> charFrequencies
                                                = charFrequenciesMap.entrySet().iterator();
        // loop2
        while (charFrequencies.hasNext()) {
            Entry<Character, Integer> entry = charFrequencies.next();

            char currentChar = entry.getKey();
            int frequency = entry.getValue();

            if (frequency > currentMostFrequency) {
                secondMostChar = mostFrequentChar;
                currentSecondMostFrequency = currentMostFrequency;
                currentMostFrequency = frequency;
                mostFrequentChar = currentChar;
            } else if (frequency > currentSecondMostFrequency) {
                currentSecondMostFrequency = frequency;
                secondMostChar = currentChar;
            }
        }
        return secondMostChar;
    }
}

Above solution uses two loops to complete the operation, and number of iterations in second loop is related to size of character set. This might be considerable if the character set is larger in size e.g: UNICODE. If that is a concern, number of iterations can be reduced by combining two loops; using the frequency recording loop itself to record the second most frequent character as well.

To check this, I did a simple test only to compare the time taken by each variation of the above program against a text with over 5 million ASCII characters (5,264,279 to be exact). As per the results, two loops based program performs slightly quicker than the other. This may be due to the small size of ASCII character set. Anyway one loop variation is not shown here since it is simple enough for you to work out.

Sorting based solution

In this approach, characters in text are chronologically ordered before starting to record the frequencies. A new type is introduced named 'Frequency' to hold the character and it's frequency. After recording the frequencies, a Comparator is used to sort the list of 'Frequency' in descending order of character frequencies to facilitate the extraction of second most frequent.

package com.digizol.puzzle;

import java.util.*;

public class SortBasedFrequentCharacter {

    public char getSecondMostFrequent(String text) {

        char[] charArray = text.toCharArray();
        Arrays.sort(charArray);

        List<Frequency> list = new ArrayList<Frequency>();
        char previous = '\u0000';
        Frequency f = null;

        for (int i = 0; i < charArray.length; i++) {
            char c = charArray[i];

            if (i == 0 || previous != c) {
                f = new Frequency(1, c);
                list.add(f);
                previous = c;
            } else {
                f.count = f.count + 1;
            }
        }

        Collections.sort(
                        list, 
                        new Comparator<Frequency>() {
                            public int compare(Frequency fr0, Frequency fr1) {
                                // sort in descending order
                                return fr1.count - fr0.count;
                            }
                        }
                    );

        char value = '\u0000';
        if (list.size() > 1) {
            value = list.get(1).character;
        }
        return value;
    }

    private class Frequency {
        int count;
        char character;

        public Frequency(int count, char character) {
            this.count = count;
            this.character = character;
        }
    }
}

Both of the above production codes are not re-factored and purposefully kept it as an exercise for you.

More Test Cases

After this initial implementation, we need to consider other test cases that belongs to the requirement. As you may have already noted, above implementation does not support the scenarios where multiple characters being of the same frequencies. As an example, following test case fails against both of the above programs.

@Test
public void shouldReturnCorrectWhenManyMostFrequentsAndOneSecondMostFrequent() {
    // both i & c are the most frequent, y is the second most frequent
    assertEquals(valueOf('y'), valueOf(sut.getSecondMostFrequent("iiiiiyccccc")));
}

As this article is growing in length, Part 2 covers these extra scenarios.

4 comments:

MichaelJuly 24, 2013 at 12:42 AM
I like test based development model; it covers tangible requirement, not just what developer thinks.

Thank you for sharing.
AnonymousOctober 11, 2013 at 12:20 AM
Your map based solution does not work, sir.
Kamal MettanandaOctober 12, 2013 at 9:59 AM
@1 - Thanks Michael for the comment.
Kamal MettanandaOctober 12, 2013 at 10:00 AM
@2 - Sorry if there is an issue, could you please provide more info (or a sample) so that we can work it out...

Post Top Ad

Post Top Ad

Tuesday, July 23, 2013

Finding Second Highest Frequent Character in a text with Java

Initial Test Cases

Production Code

Map based solution

Sorting based solution

More Test Cases

4 comments:

Post Top Ad

Author Details

Recent

Popular

Comments

Archive

Sponsor

Technology

Tags

Tags

Connect With us

Recent News

About Me

Contact Form

Contact

Post Top Ad

Post Top Ad

Tuesday, July 23, 2013

Finding Second Highest Frequent Character in a text with Java

Initial Test Cases

Production Code

Map based solution

Sorting based solution

More Test Cases

4 comments:

Post Top Ad

Author Details

Socialize

Recent

Popular

Comments

Archive

Sponsor

Technology

Tags

Tags

Connect With us

Recent News

About Me

Contact Form

Contact