Sparse Matrix Data Structures

There are about two popular known data structures for handling Sparse Matrices each with its own benefits. I also propose a third way that could be used.

Description

The two popular ways are -

Hash-based Storage Here the sparse matrix is stored as a hash-table with a map of key (rowcol index) , values (matrix element). Only non-zero elements are stored.

Compressed Row/Column Storage Here, the non-zero matrix elements are stored in terms of three arrays (values,col_index,row_ptr) for CRS. The row_ptr vector stores the locations in the val vector that starts a row.

The third possible format is -

Pre-Solve Graph Store the Matrix as a Graph (with the adjacency list implementation) but in a modified sense. Suppose there are m constraints (or rows) and n variables (or columns), then construct a undirected weight Bipartite Graph on m+n vertices with the partite sets - set of constraints, set of variables. So for example, if the ith row constraint was - 2x_1 + 3x_2 = 4 , then the vertex ci would be connected to vertex v1 by an edge of weight 2 and to vertex v2 by an edge of weight 3.

Why does this make sense? The redundancies in LP have simple impact on the data. RowColumn Singletons would mean a rowcolumn vertex that has a length one list. Note that we are constructing a undirected graph, so we can traverse elements in both dimensions symmetrically. Almost all the pre-solve routines involve finding or establishing row/column singletons. Thus this takes care of that nicely. But, we would also have to do double work in fixing variables. Modification of a non-zero element at A_{i,j} means changes to corresponding elements in ith row list and jth column list.

Comparison

Data Structures :

Task Hash-based CRSCSC Pre-solve Graph
Data for Matrix Creation Matrix dimensions Matrix dimensions + nnz Matrix Dimensions
Insertion O(1) O(1) with strict ordering O(1)
Access Expected O(1) O(log(nnz)) O(log(nnz)
Arbitray Modification Expected O(1) O(log(nnz)) O(log(nnz))
Sequential Modification Expected O(1) O(1) O(1)
Memory O(y(2textrm{sizeof(int)}+textrm{sizeof(float)})) with 1<y<1.5  O(textrm{sizeof(int)}+textrm{sizeof(float)}) O(2(textrm{sizeof(int)}+textrm{sizeof(float)})
Enumerate nnz O(textrm{TableSize}) with no ordering O(nnz) with ordering O(nnz)
Linear algebra Support (BLAS level 2,3) No Yes Yes (less efficient than CRS)
Has it been implemented in Julia? No Yes No
Future usability ? Yes Yes Probably Not
Reliable implementation in other language ? Yes Yes No
Efficient Conversion to CRSCSC? Yes - Yes

The Pre-solve Graph data structure captures more information about the optimization problem than the other two formats as it is tailor made. But it takes slightly more memory and might not be usable for other use-cases of sparse matrices. The hash-based format has the main advantage that it has expected O(1) time access to elements. “Expected” because in rare cases it can degrade to O(n) if the hashing is imbalanced. Using this, We can run our algorithms as if the underlying data structure were a matrix. The Compressed RowColumn formats while good for space and matrix multiplication and other linear algebra operations, The main fundamental requirement in our routines is the access of elements and traversal of them in both row & column order. Either CSRCSC would mean the other order of traversal is very slow and unless every routine goes sequentially in row/column order, we do not have O(1) access to the elements. Another point to be noted is that Hash-based Matrices do not require beforehand the amount of non-zeros in the Matrix. Thus it is easier to initialize.

Hash-based matrices have been implemented in reliable libraries in languages other than Julia and I recommend that it be implemented as a part of this project in the JuliaSparse repository. It can be implemented in the same structure and form as in - alglib or in colt

A lot of time has actually gone in deliberating over the pros and cons of the different data structures. What are the time complexities? Which operations are more frequent and should hence take precedence? How is this a certain routine typically implemented in Hash-based as compared to Pre-solve Graph ? What methods and functionality should the data structure have ?

In my next post would detail the implementations and requirements from the hash-based format. It shall be along the same lines as the two libraries I have quoted. I don't claim to be an expert but I think its implementation could have other uses apart from the theme of this GSOC project. Also more,importantly, it is as-good as the Pre-solve Graph in expected performance for almost every sub-routine. Why? Pre-solve Graph is basically an Adjacency list on a graph that is twice (roughly) the size of the conventional Graph associated with the matrix A, thus every operation is done twice. It is better than the Compressed format but worse than the hash-based format by a constant time. But this could make a difference. Also, the ability of O(1) arbitray access can not be understated.

Thus I propose, that we first take in the data from the user in Hash-based format and carry out our pre-solve routine and convert it to CSC(or CSR) format so that it can be used by the solvers.