LukasKorsikaDesignStudy

From CSSEMediaWiki

(Difference between revisions)

Revision as of 05:15, 29 July 2010

The Problem

The project I design in this study is an application to help me manage my file system. I tend to have a number of copies of the same file scattered throughout my various file system for reasons such as:

Some partitions are only accessible under Linux
I often copy videos to my laptop to watch away from my desk.

Requirements

Must take a list of files as input, and output identical files in groups (ie, cluster all identical files together, don't output unique files)
Must support a variety of approaches for determining equality -- based on raw data, or file-type specific comparisons.
Must use reasonable amounts of memory and I/O bandwidth.
Should be file-system agnostic (and support NFS, etc)
Should be extensible

Initial Design

(converted to Java from C, so some liberties have been taken with classes, but this is essentially its original form)

Design Description

As this was a program in C, there is essentially a God Class, with a few helper classes and methods thereupon. The helper classes are:

File -- This represents a file on the file system, and has methods to find its size, and its SHA-1 hash.
Tree -- This is a simple class representing a Tree. A tree is composed of a set of TreeNode, and stores a reference to the root. It also has a prune method, which uses a recursive algorithm to remove all TreeNodes with one or fewer (<= 1) files in it.
TreeNode -- A tree node represents a node in a binary tree, stores its key (which may be size or hash depending on the tree), and a list of all files which have that value. TreeNode has a number of recursive methods to iterate over the tree, get the list of files at that node, and insert a new file with a key recursively.

Most of the functionality occurs in functions which are placed in the God class to represent their being global functions. This god class handles the parsing of arguments, the reading of the file list from stdin, building groups of identical files, and outputting the result.

Grouping is performed by first building a grouping tree using the size as the key (it calls Tree.insert(myFile.getsize(), myFile);), then pruning the tree. It then iterates over the resulting tree (consisting of pairs of sizes and file lists) to find files that might be identical (as they have the same size).

For each of these groups it then constructs a new tree using the hash values as keys, prunes the tree, and outputs the files grouped by hash. Thus, the files are grouped by size and hash.

I realise that this is terrible design. This design study will iteratively improve the design, as well as creating a Java implementation of the program.

Criticisms

Uses a God class -- many sub-issues to do with this.
TreeNode deals with both maintaining collections of files, as well as implementing a binary tree. This violates the Single responsibility principle.
The prune method should be in TreeNode, to Keep related data and behavior in one place.
The God class shouldn't have to ask the File for its size/hash. It should instead tell the Tree to insert it using whatever hashing method that tree uses Tell don't ask.
The tree should know what it's grouping by rather than that information being implied by the variable storing the tree( eg Tree sizeTree) Keep related data and behaviour in one place.
The File class shouldn't calculate hashes Single responsibility principle
It should be possible to add new key types to group by without modifying existing classes. Open closed principle Beware type switches

@@ Line 21: / Line 21: @@
 The helper classes are:
 * File -- This represents a file on the file system, and has methods to find its size, and its SHA-1 hash.
-* Tree -- This is a simple class representing a Tree. A tree is composed of a set of TreeNode, and stores a reference to the root.
+* Tree -- This is a simple class representing a Tree. A tree is composed of a set of TreeNode, and stores a reference to the root. It also has a prune method, which uses a recursive algorithm to remove all TreeNodes with one or fewer (<= 1) files in it.
 * TreeNode -- A tree node represents a node in a binary tree, stores its key (which may be size or hash depending on the tree), and a list of all files which have that value. TreeNode has a number of recursive methods to iterate over the tree, get the list of files at that node, and insert a new file with a key recursively.
+Most of the functionality occurs in functions which are placed in the God class to represent their being global functions. This god class handles the parsing of arguments, the reading of the file list from stdin, building groups of identical files, and outputting the result.
+Grouping is performed by first building a grouping tree using the size as the key (it calls Tree.insert(myFile.getsize(), myFile);), then pruning the tree. It then iterates over the resulting tree (consisting of pairs of sizes and file lists) to find files that might be identical (as they have the same size).
+For each of these groups it then constructs a new tree using the hash values as keys, prunes the tree, and outputs the files grouped by hash. Thus, the files are grouped by size and hash.
 I realise that this is terrible design. This design study will iteratively improve the design, as well as creating a Java implementation of the program.
+== Criticisms ==
+* Uses a [[God class]] -- many sub-issues to do with this.
+* TreeNode deals with both maintaining collections of files, as well as implementing a binary tree. This violates the [[Single responsibility principle]].
+* The prune method should be in TreeNode, to [[Keep related data and behavior in one place]].
+* The [[God class]] shouldn't have to ask the File for its size/hash. It should instead tell the Tree to insert it using whatever hashing method that tree uses [[Tell don't ask]].
+* The tree should know what it's grouping by rather than that information being implied by the variable storing the tree( eg Tree sizeTree) [[Keep related data and behaviour in one place]].
+* The File class shouldn't calculate hashes [[Single responsibility principle]]
+* It should be possible to add new key types to group by without modifying existing classes. [[Open closed principle]] [[Beware type switches]]

LukasKorsikaDesignStudy

Revision as of 05:15, 29 July 2010

Contents

The Problem

Requirements

Initial Design

Design Description

Criticisms

Views

Personal tools

Navigation

Search

Toolbox