Word Web

A Natural Language Parser to interactively view conceptual themes in text. You may view and download the code here.

The Stanford NLP Package was used to parse a textfile in tandem with the GraphStream project to visually represent the extracted data as a Graph. Data from a Textile was read and cast into a string via the readAllBytes function. The Stanford Natural Language Processing package was then utilized in annotating and identifying all sentences and their triples. The triple, consisting of the subject, predicate, and object of the sentence, was considered to represent the main idea of the sentence. Initially, the triples were stored in a simple ArrayList yet there were hundreds of duplicate subjects and predicates which made parsing for information impossible. For example, the subject would appear separately in hundreds of triples was though the subject differed each time. Each sentence regarding a common subject was treated as a different idea and the annotations in the textile made it impossible to gain any insight into the purpose of the Wikipedia Article. Due to vastness and duplication in the extracted data, the data was represented in a more visual manner. Grouping the ideas surrounding a single subject together significantly reduced redundancies.

The implementation of the duplication removal necessitated the use of three inner classes to represent the three main ideas in a sentence: the Subject, the Predicate (referred to as the relation in the Stanford NLP documentation), and the Object. The Subject class consisted of a String containing the extracted subject string along with a HashMap of the predicates associated with the particular subject. The Predicate class consisted of a Predicate containing the extracted predicate string along with a HashMap of the objects associated with the particular predicate and subject whereas the Object class merely consisted of the retrieved object string. The main class contained a HashMap of Subjects such that the overall data-structure resulted in a HashMap of HashMaps of HashMaps. When each triple was read, the program checked if the subject was already present in the HashMap; if so, the associated Subject was retrieved else a new Subject was created. Likewise, the program checked to see if the predicate was present within the Subject’s HashMap of Predicates. The program either retrieved or added the predicate to the HashMap and continued by checking if the associated Object was present within the Predicate’s HashMap of Objects before determining whether or not to add it. The retrieval speed of HashMaps yielded the quickest retrieval for sorting through massive amounts of data while reducing data duplication.

GraphStream’s existing open source library was utilized in presenting a visual representation of the ideas contained within the textfile. Each subject and object were represented as nodes connected by the predicate which was represented as an edge. If objects from one sentence were utilized as subjects from another sentence, the graph would add a new branch from the existing object to a new one such that sentences and ideas could be read by traversing through a branch of the Graph. The importance of each node and link was determined via the number of HashMap entries within the given object and a color spectrum was utilized to visually represent the relative importance of the objects. The Node Spectrum ranged from black to red with black representing an unimportant subject or object with few links while red represented the most important subjects with many links and central ideas. Likewise, the predicates ranged from black to a bright purple with black representing unimportant links to obscure objects while magenta represented heavily used predicates spanning multiple ideas and sentences. The initial view of the Graph after parsing through excerpts of Roger Federer’s Wikipedia page appears as follows:

While a sizeable portion of the graph appears to contain individual nodes with few references (implied by the black color of the nodes and their space connections), the darker clusters in the center appear more dense with plenty of connections. By centering the GUI’s view upon the dense area (moving around the graph is achieved via the use of the arrow keys), we may zoom in on the dense mass (zooming in is achieved by pressing the 1-key while zooming out is achieved by pressing the 2-key). The following views are a series of closer views:

The final image shows the red subject “Federer” connected by bright purple Predicates as the central idea of the Wikipedia Entry. Roger Federer is indeed the subject of the Wikipedia Entry hence we may deduce that the color scheme in representing the relative importance of particular subjects and their predicates works to an extent. Other additional features of Graph include the ability to drag nodes and their associated links such that it shows the connections between all the node and its surroundings. In conclusion, the Graph proved to be an interesting and efficient representation of the collected data due to its reduced duplication and ability to highlight important parts of the text.