String comparison in Java

I have tons of files laying around, mostly of the manually named in a iffy. Naturally this results in some misspellings. If you want to use the file name as an input for further automated processing, like categorizing of images, this greatly diminishes the value of your collection. Therefore I looked into methods of adaptive correction. I choose an adaptive approach because I do not want to use a prefabricated dictionary, where certain words might be missing or in my case, the use of different languages makes this approach useless. Of course these techniques are not limited to file names but can also be applied to a text.

Building an adaptive dictionary

This first step is quite easy and is the basis for later operations. Parse all filename for words and make up a dictionary. The important part is that you keep track of the number of occurrences. The more often a word occurs the more probable it is, that the word is correct.The basic assumption is, that errors occur occasionally. For example sometimes I mistype ‚the‘ as ‚teh‘, but if I were to mistype it every time it could not be recognized.  In the end you will have a bunch of words that occurred only once, these might be misspelled or only be used once. There is no way to distinguish them, but these are the words you have to look at. Another group of misspelled words to look out for is the group with multiple occurrence which is very similar to another word with an even higher occurrence count. For example you mistyped ‚teh‘ four times but did it correctly 10 times.

String similarity

Investigating methods to find similar strings I stumbled on Ralph Allan Rice string-similarity library. He uses the Dice and Jaro-Winkler algorithm to compute the similarity. I checked them out and only found Jaro-Winkler of any use. This algo has some deficits, most notable its focus on matching beginnings of words: Fed is not very similar to Ned

Eliminating related words

This aspect I did not investigate further. The basic gist is that want to bring similar word together that are related like ’see‘ and ’sees‘ or ‚get‘ and ‚got‘. This would best be done with JAWS and WordNet.

Some further theoretical thoughts

These are things that I did not fully implement, the the theory seems sound, however there are probably some problems with suitable abort criteria for efficient computation. First some definitions:

  • Operation: An operation transforms string A into string B. Operations can be queued/concatenated (°). An operation is a simple action. Each operation has its complexity. Concatenated operations are not necessary commutative, meaning executing a°b on A does not result in the same string as b°a.
  • Complexity: The operation with the complexity 0 is the Identity. The complexity is a floating point value. The complexity is measured against the length of the string.
  • Minimal length: The only operation executable on a string of length 2 is a Transposition. Therefore the minimal length of a String is considered 2 characters.
  • Similarity of order: String A and B are similar if there exists a sequence of Operation that transforms String A into String B and there combined (added up) complexity is below the specified order.
  • Success: To measure if an operation leads to a more similar result, the similarity computed with one of the above algorythms must be greater.
  • Upper-Lower-Case: Most languages do not distinguish upper and lower case words. To my knowledge the is only a link between upper-lower case in the grammatical sense but not the semantical. Therefore this can be ignored as a no-issue.

Many of the operations will use common sub sequences to limit their range (already common sequences must not be transformed). Identifying the common causes for errors helps defining operators:

  • Transposition: swapping two characters in the string. The closer the characters the lesser the complexity.
  • Replacement: change a single character with another.
  • Misstyped: During spelling out the word the key besides the intended was hit. Different complexity for the direction of the two keys.
  • MissingCharacter: character is missing. Higher complexity within the word (when typing rapidly, it may happen that the first letter of the second string is the last letter of the first, this would be a transposition of the space and the letter, spaces however were removed during parsing)
  • Duplication: Duplication of an existing character
  • Reducing: Reduce double characters
  • Removal: Remove a character. Higher complexity within the word. Reverse of MissingCharacter.

The code is part of my utilities project, but not yet part of the latest released version (1.2.3)

Schreibe einen Kommentar