CPSC 225, Spring 2019 -- Lab 8

CPSC 225, Spring 2019
Lab 8: Books and Sets

We have been looking at collection classes in the Java Collection Framework. This lab asks you to use JFC sets and arraylists. You will create and manipulate large sets of words. Each set will consist of all the words from a classic book, and the sets will be used to investigate and compare the vocabularies used in the books.

You will write this week's program from scratch; no starting point is given. However, if you plan to work on your own computer, you will need to copy the necessary book files to your computer.

This is not a GUI program, and you will be creating static methods exclusively.

Your program is due, as usual, at the next lab, which will be on March 26 (two weeks from this lab, because of Spring break). You should copy your program into a directory named lab8 inside your homework folder in /classes/cs225/homework. There is no need to submit copies of the book files!

Words from the Classics

The complete text of eight books (seven classic novels and one play) can be found in the folder /classes/cs225/books. If you are working on a lab computer, then it is not necessary to copy these files into your own account. Your program can access them directly from that folder. Here are a couple of global variables that you can use to store the full paths to the files:

private static final String DIRECTORY = "/classes/cs225/books/";
private static final String[] bookFiles = {
        DIRECTORY + "20000-leagues.txt",
                  // 20,000 Leagues Under the Sea (Jules Verne)
        DIRECTORY + "alice-in-wonderland.txt",
                  // Alice in Wonderland (Lewis Carroll)
        DIRECTORY + "bleak-house.txt",
                  // Bleak House (Charles Dickens)
        DIRECTORY + "hamlet.txt",
                  // Hamlet (William Shakespeare)
        DIRECTORY + "huckleberry-finn.txt",
                  // Huckleberry Finn (Mark Twain)
        DIRECTORY + "pride-and-prejudice.txt",
                  // Pride and Prejudice (Jane Austin)
        DIRECTORY + "something-new.txt",
                  // Something New (P.G. Wodehouse)
        DIRECTORY + "time-machine.txt"
                  // The Time Machine (H.G. Wells)
};

If you are working on your own computer, you will need to copy the book files onto your computer, and you will need to change the value of "DIRECTORY" in the above code. You can copy the book files into an Eclipse project, if you want. You should make sure that they are inside the project directory but not inside the src directory or any other subdirectory; in that case, you can set the value of DIRECTORY to be an empty string, or delete the references to DIRECTORY entirely.

In your program, you will need to read all the words from each file and store them in a set, which can be of type HashSet<String>. (TreeSet<String> would also work, but hashsets are OK since you will not need to traverse the set in alphabetical order.) We define a "word" to be a sequence of letters, except that a word can also include apostrophes as long as they occur between two letters, as in "can't" or "Fred's". It can be tricky to extract the words from a file, but it is possible to do it with a Scanner. You can create a Scanner for reading from a file, using code something like this:

Scanner filein = null;
try {
    filein = new Scanner( new File( filePath ) );
}
catch ( Exception e ) {
    System.out.println("Error: could not read from file " + filePath);
    System.out.println("Cannot continue with this program.");
    System.exit(1);
}

In this code segment, filePath is a String containing the name of a file or the full path to a file, such as one of the elements in the bookFiles array given above. You will need to import class File from package java.io. Attempting to open the file can generate an Exception if the file does not exist, or if you don't have the necessary permission to read it.

Usually, a Scanner skips white space (spaces, tabs, end end-of-lines) between tokens. However, you can tell a scanner to skip other stuff instead. Here, we would like the scanners to skip all non-letter characters, except for apostrophes that occur between two letters. Setting a Scanner to do this requires a fancy "regular expression," and you won't encounter regular expressions until CPC 229, if ever. Here is a line that will set the Scanner filein to skip everything except words:

filein.useDelimiter("('*[^a-zA-Z']'*|''+|^'|'$)+");

You should copy-and-paste this line into your program. I don't advise trying to type it in correctly. With this "delimiter," filein.next() will return words from the file, skipping everything else. (Creating and debugging the regular expression took me longer than writing the rest of the program!)

The Program

As a first step, you should write a a method that can read the words from a file, and store them in a HashSet<String>. The method can take the file path as its parameter, and its return value should be the set that it creates, containing all of the words from the file. Note that all words should be converted to lower case before adding them to the set (or to upper case, if you prefer). However, before returning, the method should print out the following information about the book that it has read:

The total number of words read from the file, counting duplicates (You don't need the set to do this; you just neet to count words as you read them.)
The number of different words found in the file
The number of words that occurred only once in the file
The percentage of words that occurred only once as a percentage of the number of different words.

Here is a sample output from my file-input method, with what I hope are the correct values for The Time Machine:

From the file: /classes/cs225/books/time-machine.txt:
    Total number of words: 32,653
    Number of different words: 4,607
    Number of words that occur only once: 2,415 (52.4%)

(If you are wondering how I got the commas in numbers such as 32,653, you can do that with System.out.printf using the output format %,d with a comma between the percent sign and the d.)

In order to find the number of words that occurred only once, you will need to make a second HashSet as you read the file. This second set should contain all the words that occur more that once in the in the file. That is, when you read a word, if that word is already in the first set, then you should add it to the second set instead. At the end, you can find the number of words that occur only once by subtracting the size of the second set from the size of the first set. You will not need to keep the second set. Only the first set should be returned by the method.

The main() routine of your program can call the above method for each book. The rest of the program will work with the sets that are returned from the method. It will be convenient to add the sets returned by the method to a variable of type

ArrayList<HashSet<String>>

That is, the elements of the list are themselves sets. (For stupid technical reasons in Java, you can't make an array of HashSet<String>, but you can put hashsets into an ArrayList.) Recall that for an ArrayList, list, you can get the i^th element from the list by calling list.get(i). And remember that the length of the list is given by list.size().

Once you have the eight sets in an array list, the rest of the assignment is to process the sets in various ways, using the methods in the HashSet API. You can do most of the required processing using the set methods addAll(), retainAll() and removeAll(). We have seen that these methods can be used to compute union, intersection, and set difference of sets. For some of the computations, you will have to traverse a set using a for-each loop. You should label all output clearly and arrange it attractively. You should not simply do everything in the main() routine! You should use well-designed subroutines to perform various tasks. Furthermore, as a special requirement for this lab, your program should use no global variables, aside from the DIRECTORY and bookFiles variables given above. Your methods should communicate using parameters and return values.

Here are your tasks, some expressed as questions for which you should compute an answer:

How many different words, in total, are used in all eight books combined? That is, how many words are there that occur in at least one of the eight books?
How many words are common to all eight books? That is, how many words are there that occur in every one of the eight books?
Among the words that are common to all eight books, what is the longest word? (There is only one word in the intersection that has the maximum length.)
Among the words that occur in at least one of the books, what is the longest word, and which book does it occur in? (There is only one word in the union that has the maximum length.)
For each book, determine how many words from that book are not used in any of the other seven books.

Remember that the object is to do things efficiently, using set operations! The code that you need to perform the necessary computations will not be very long, even though the computations are complex.

You can do other things for some extra credit. It would nice if you think of something yourself, but here are some ideas. (Don't try to do everything!)

Make a table of the number of words of each length in one of the sets.
Compute the average length of the words in each book.
Put the unsorted_words.txt from Lab 1 in a HashSet, and use it to find out how many words from each book are, or are not, in that list.
Since the files are very different lengths, but all greater than 26,000, repeat some of the analysis using only 26,000 words from each file (which might make the comparisons more meaningful).
Print out the set of words that appear in all eight books, in alphabetical order, in, say, six neat columns.