Background
Data Structure and Algorithm
Data Structures: efficiently organise data in computer memory.
Algorithms: manipulate data structures to achieve a given goal.
In order for a program to terminate fast (or in time), it has to use appropriate data structures and efficient algorithms.
Outline
- various data structures
- basic algorithms
- understanding the strengths and weaknesses of those, in terms of their time and space complexities
Outcome: to be able to decide which algorithms and data structure to chose for a given task.
Preliminaries
Abstract data types (ADT)
An abstract data type is
- a type
- with associated operations
- whose representation is hidden to the user
Example: Integers are an abstract data type with operations +
, -
, *
, mod
, div
,…
- A type is a collection of values, e.g. integers, Boolean values (true and false).
- The operations on ADT might come with mathematically specified constraints, for example on the time complexity of the operations.
- Advantages of ADT’s as explained by Aho, Hopcroft and Ullman (1983):
- “At first, it may seem tedious writing procedures to govern all accesses to the underlying structures. However, if we discipline ourselves to writing programs in terms of the operations for manipulating abstract data types rather than making use of particular implementations details, then we can modify programs more readily by reimplementing the operations rather than searching all programs for places where we have made accesses to the underlying data structures. This flexibility can be particularly important in large software efforts, and the reader should not judge the concept by the necessarily tiny examples found in this book.”
Mathematical Preliminaries
Sets and Relations
The concept of a set in the mathematical sense has wide application in computer science. The notations and techniques of set theory are commonly used when de- scribing and implementing algorithms because the abstractions associated with sets often help to clarify and simplify algorithm design.
A set is a collection of distinguishable members or elements. The members are typically drawn from some larger population known as the base type. Each member of a set is either a primitive element of the base type or is a set itself. There is no concept of duplication in a set. Each value from the base type is either in the set or not in the set. For example, a set named $P$ might consist of the three integers 7, 11, and 42. In this case, $P$’s members are 7, 11, and 42, and the base type is integer.
Symbols commonly used to express sets and their relationships:
- $\{1, 4\}$
- A set composed of the members 1 and 4
- $\{x \mid x \: is \: a \: positive \: integer\}$
- A set definition using a set former.
- Example: the set of all positive integers
- $x \in P$
- x is a member of set P
- $x \notin P$
- x is not a member of set P
- $\emptyset$
- The null or empty set
- $\mid P \mid$
- Cardinality: size of set P or number of members for set P
- $P \subseteq Q, \: Q \supseteq P $
- Set P is included in set Q, set P is a subset of set Q, set Q is a superset of set P
- $P \cup Q$
- Set Union: all elements appearing in P OR Q
- $P \cap Q$
- Set Intersection: all elements appearing in P AND Q
- $P \: — Q$
- Set difference: all elements of set P NOT in set Q
A sequence is a collection of elements with an order, and which may contain duplicate-valued elements. A sequence is also sometimes called a tuple or a vector. In a sequence, there is a 0th element, a 1st element, 2nd element, and so on. I indicate a sequence by using angle brackets <> to enclose its elements. For example, <3, 4 4, 5,> is a sequence. Note that sequence <3, 4 5, 4,> is distinct from sequence <3, 4 4, 5,>, and both are distinct from sequence <3, 5 4,>.3,>3,>3,>3,>
A relation R over set S is a set of ordered pairs from S. As an example of a relation, if S is {a, b, c}, then
is a relation, and
is a different relation.
If tuple ⟨x, y⟩ is in relation R, we may use the infix notation xRy. We often use relations such as the less than operator (<) on the natural numbers, which includes ordered pairs such as ⟨1, 3⟩ and ⟨2, 23⟩, but not ⟨3, 2⟩ or ⟨2, 2⟩. Rather than writing the relationship in terms of ordered pairs, we typically use an infix notation for such relations, writing 1 < 3.
Define the properties of relations as follows, with R a binary relation over set S.
- R is reflexive if aRa for all $a \in S$.
- R is symmetric if whenever aRb, then bRa, for all $a, b \in S$.
- R is antisymmetric if whenever aRb and bRa, then a = b, for all $a, b \in S$.
- R is transitive if whenever aRb and bRc, then aRc, for all $a, b, c \in S$.
A binary relation is called a partial order if it is antisymmetric and transitive. The set on which the partial order is defined is called a partially ordered set or a poset. Elements x and y of a set are comparable under a given relation if either xRy or yRx. If every pair of distinct elements in a partial order are comparable, then the order is called a total order or linear order.
Logarithms
A logarithm of base $b$ for value $y$ is the power to which $b$ is raised to get $y$. Normally,this is written as $\log_b y = x$. Thus, if $\log_b y = x$. then $b^x = y$,and $b^{\log_b y} = y$.
Logarithms have the following properties, for any positive values of m, n, and r, and any positive integers a and b.
The first two properties state that the logarithm of two numbers multiplied (or divided) can be found by adding (or subtracting) the logarithms of the two numbers.
Summations and Recurrences
Most programs contain loop constructs. When analyzing running time costs for programs with loops, we need to add up the costs for each time the loop is executed. This is an example of a summation. Summations are simply the sum of costs for some function applied to a range of parameter values. Summations are typically written with the following “Sigma” notation:
This notation indicates that we are summing the value of f(i) over some range of (integer) values. The parameter to the expression and its initial value are indicated below the $\sum$ symbol. Here, the notation $i = 1$ indicates that the parameter is $i$ and that it begins with the value 1. At the top of the $\sum$ symbol is the expression $n$. This indicates the maximum value for the parameter $i$. Thus, this notation means to sum the values of $f (i)$ as $i$ ranges across the integers from 1 through $n$. This can also be written $f(a)+f(2)+ … + f(n)$. Within a sentence, Sigma notation is typeset as $\sum_{i = 1}^{n}f(i)$.
Given a summation, you often wish to replace it with an algebraic equation with the same value as the summation. This is known as a closed-form solution, and the process of replacing the summation with its closed-form solution is known as solving the summation. For example, the summation $\sum_{i = 1}^{n} 1$ is simply the expression “$1$” summed $n$ times (remember that $i$ ranges from $1$ to $n$). Because the sum of $n$ 1s is $n$, the closed-form solution is $n$. The following is a list of useful summations, along with their closed-form solutions.
The sum of reciprocals from 1 to n, called the Harmonic Series.
Arrays and Linked list
List
We all have an intuitive understanding of what we mean by a “list.” Our first step is to define precisely what is meant so that this intuitive understanding can eventually be converted into a concrete data structure and its operations. The most important concept related to lists is that of position. In other words, we perceive that there is a first element in the list, a second element, and so on. We should view a list as embodying the mathematical concepts of a sequence, as defined above.
We define a list to be a finite, ordered sequence of data items known as elements. “Ordered” in this definition means that each element has a position in the list. (We will not use “ordered” in this context to mean that the list elements are sorted by value.) Each list element has a data type. In the simple list implemen- tations discussed in this chapter, all elements of the list have the same data type, although there is no conceptual objection to lists whose elements have differing data types if the application requires it. The operations defined as part of the list ADT do not depend on the elemental data type. For example, the list ADT can be used for lists of integers, lists of characters, lists of payroll records, even lists of lists.
A list is said to be empty when it contains no elements. The number of ele- ments currently stored is called the length of the list. The beginning of the list is called the head, the end of the list is called the tail. There might or might not be some relationship between the value of an element and its position in the list. For example, sorted lists have their elements positioned in ascending order of value, while unsorted lists have no particular relationship between element values and positions. This section will consider only unsorted lists.
When presenting the contents of a list, we use the same notation as was in- troduced for sequences above. To be consistent with Java array indexing, the first position on the list is denoted as $0$. Thus, if there are $n$ elements in the list, they are given positions $0$ through $n − 1$ as $⟨a_0, a_1, …, a_{n−1}⟩$. The subscript indicates an element’s position within the list. Using this notation, the empty list would appear as ⟨⟩.
Before selecting a list implementation, a program designer should first consider what basic operations the implementation must support. Our common intuition about lists tells us that a list should be able to grow and shrink in size as we insert and remove elements. We should be able to insert and remove elements from any- where in the list. We should be able to gain access to any element’s value, either to read it or to change it. We must be able to create and clear (or reinitialize) lists. It is also convenient to access the next or previous element from the “current” one.
An example of a list of numbers
Ordered collection of elements
List is an ADT; list operations are:
- insert an entry (on a certain position)
- delete an entry
- access data by position
- search
- move around the list - concatenate two lists
- sort
- …
Different representations of lists
Depending on what operations are needed for our application, we choose from different data structures (some implement certain operations faster than the others):
- Arrays.
- Linked lists.
- Dynamic arrays.
- Unrolled linked lists.
A list can be iterated through as shown in the following code fragment.
1 | /** |
Array-Based List
Following code fragment shows the array-based list implementation:
1 | /** |
listArray
must be allocated at some fixed size, the size of the array must be known when the list object is created. Note that an optional parameter is declared for the AList
constructor. With this parameter, the user can indicate the maximum numberofelementspermittedinthelist. Ifnoparameterisgiven,thenittakesthe value defaultSize
, which is assumed to be a suitably defined constant value.
Because each list can have a differently sized array, each list must remember its maximum permitted size. Data member maxSize
serves this purpose. At any given time the list actually holds some number of elements that can be less than the maximum allowed by the array. This value is stored in listSize
. Data member curr
stores the current position. Because listArray
, maxSize
, listSize
, and curr
are all declared to be private
, they may only be accessed by methods of Class AList
.
Class AList
stores the list elements in the first listSize
contiguous array positions. Array positions correspond to list positions. In other words, the element at position i in the list is stored at array cell i. The head of the list is always at position 0. This makes random access to any element in the list quite easy. Given some position in the list, the value of the element in that position can be accessed directly. Thus, access to any element using the moveToPos
method followed by the getValue
method takes $\Theta(1)$ time.
Because the array-based list implementation is defined to store list elements in contiguous cells of the array, the insert
, append
, and remove
methods must maintain this property. Inserting or removing elements at the tail of the list is easy, so the append
operation takes $\Theta(1)$ time. But if we wish to insert an element at the head of the list, all elements currently in the list must shift one position toward the tail to make room, as illustrated below. This process takes $\Theta(n)$ time if there are $n$ elements already in the list. If we wish to insert at position $i$ within a list of $n$ elements, then $n − i$ elements must shift toward the tail. Removing an element from the head of the list is similar in that all remaining elements must shift toward the head by one position to fill in the gap. To remove the element at position $i$, $n − i − 1$ elements must shift toward the head. In the average case, insertion or removal requires moving half of the elements, which is $\Theta(n)$.
Most of the other member functions for Class AList
simply access the current list element or move the current position. Such operations all require $\Theta(n)$ time. Aside from insert
and remove
, the only other operations that might require more than constant time are the constructor, the destructor, and clear
. These three member functions each make use of the system free-store operation new
.
Linked list
The second traditional approach to implementing lists makes use of pointers and is usually called a linked list. The linked list uses dynamic memory allocation, that is, it allocates memory for new list elements as needed.
A linked list is made up of a series of objects, called the nodes of the list. Because a list node is a distinct object (as opposed to simply a cell in an array), it is good practice to make a separate list node class. An additional benefit to creating a list node class is that it can be reused by the linked implementations for the stack and queue data structures presented later in this chapter. The following code shows the implementation for list nodes, called the Link
class. Objects in the Link
class contain an element
field to store the element value, and a next
field to store a pointer to the next node on the list. The list built from such nodes is called a singly linked list, or a one-way list, because each list node has a single pointer to the next node on the list.
1 | /** |
The Link
class is quite simple. There are two forms for its constructor, one with an initial element value and one without. Member functions allow the link user to get or set the element
and link
fields.
The following figure shows a graphical depiction for a linked list storing four integers. The value stored in a pointer variable is indicated by an arrow “pointing” to some- thing. Java uses the special symbol null for a pointer value that points nowhere, such as for the last list node’s next field. A null pointer is indicated graphically by a diagonal slash through a pointer variable’s box. The vertical line between the nodes labeled 23 and 12 in figure(a) indicates the current position (immediately to the right of this line).
The list’s first node is accessed from a pointer named head
. To speed access to the end of the list, and to allow the append
method to be performed in constant time, a pointer named tail
is also kept to the last link of the list. The position of the current element is indicated by another pointer, named curr
. Finally, because there is no simple way to compute the length of the list simply from these three pointers, the list length must be stored explicitly, and updated by every operation that modifies the list size. The value cnt stores the length of the list.
The following code shows the definition for the linked list class, named LList
:
1 | /** |
Comparison of List Implementations
Operation | Array-based List | Linked List |
---|---|---|
access data by position | constant | linear |
search | linear | linear |
insert an entry at the beginning | linear | constant |
insert an entry at the end | linear | linear* |
insert an entry (on a certain position) | linear | linear |
delete first entry | linear | constant |
delete entry i | linear | linear |
concatenate two lists | linear | linear* |