Want to use metis 5.0 with Matlab? Try the new metismex!

The underlying C api for Metis changed considerably in the new version 5.0 that was released within the last year.  I had previously updated the metismex interface to work with an alpha release of metis 5.0 and posted it on github.  However, the final release had a ton of new changes to the API.

So if you need to use metis to partition graphs, methods, or to compute nested dissection orders with it, go checkout my github repo for metis mex.  It’s still a little rough on how to use it, but I hope those who are motived can figure it out with the help of the readme.

Posted in Uncategorized | Leave a comment

Video of my Ph.D. Defense on Random Alpha PageRank

I’ve been meaning to post this video for a few years.  Here was my Ph.D. Thesis defense.  

http://www.youtube.com/watch?v=2CSaDB-sLR4

The PageRank model helps evaluate the relative importance of nodes
in a large graph, such as the graph of links on the world
wide web. An important piece of the PageRank model is the teleportation
parameter alpha. We explore the interaction between alpha and
PageRank through the lens of sensitivity analysis. Writing the PageRank vector as a function of alpha allows us to take a derivative, which is a simple sensitivity measure.
As an alternative approach, we apply techniques from the field of uncertainty quantification. Regarding alpha as a random variable produces a new PageRank
model in which each PageRank value is a random variable. We
explore the standard deviation of these variables to get another
measure of PageRank sensitivity. One interpretation of this new
model shows that it corrects a small mistake in the original
PageRank formulation.

Both of the above techniques require solving multiple PageRank
problems, and thus a robust PageRank solver is needed. We
discuss an inner-outer iteration for this purpose. The method is
low-memory, simple to implement, and has excellent performance
for a range of teleportation parameters.

We show empirical results with these techniques on graphs with
over 2 billion edges.

Hope someone finds it useful!  It’s about PageRank and our particular modification we developed.

Slides: Models and Algorithms for PageRank Sensitivity slides

Thesis: Models and Algorithms for PageRank Sensitivity 

Posted in Uncategorized | Leave a comment

Making an equation \small er: A common latex problem

Here’s a problem that I run into … not infrequently … an equation is just a little bit too big and you’d like to shrink it.  The obvious thing to try is:

Here's the line before the equation.  I'm making it long to 
illustrate a problem with this approach. Ideally, it should be a
few lines long.
\[ \footnotesize f(x) = g(x) \]

… and that doesn’t work.  The equation doesn’t actually become smaller.

Getting a bit more aggressive, you can try:

Here's the line before the equation.  I'm making it long to 
illustrate a problem with this approach. Ideally, it should be a
few lines long.
{\footnotesize \[ f(x) = g(x) \] }

… which works, but the text above the paragraph gets “squeezed” in an unnatural way.  This happens because \footnotesize anywhere in the current paragraph adjusts the baseline for the entire paragraph.  That is not what we want.

Here’s what you really want:

Here's the line before the equation.  I'm making it long to 
illustrate a problem with this approach. Ideally, it should be a
few lines long.\\
\parbox[t]{\linewidth}{\footnotesize \[ f(x) = g(x) \]}}

That’ll keep the formatting of the previous paragraph.  I still think there is a better solution, so I’m going to post this as a question on the tex stack exchange to see what people come up with.

Posted in Uncategorized | Leave a comment

compiling mex files with mingw64

Compiling mex files on 64-windows has always been a huge pain.  It used to require having a real copy of Visual Studio.  A year or so ago, there was a platform sdk that finally included a 64-bit compiler.

Now, the mingw64 project is sufficiently advanced to let us compile mex files.  This is all nicely packaged up in the cygwin repository too.

Don’t worry, the mingw compilers in the cygwin repository generate fully redistributable native windows code without any cygwin dependencies (a small caveat: I’m not sure about whether or not you need a libstdc++.dll file for g++ codes… comments?) 

  1. Get setup.exe from cygwin
  2. Search for mingw and mark all the mingw modules for installation
  3. Search for mingw64 and mark all the mingw64 modules for installation (yes, this is way more than strictly necessary, but its the easiest way to go)
  4. Install those modules and all dependencies to the default location: C:\cygwin
Now, in matlab, run
>> edit(fullfile(prefdir,’mexopts.bat’))
Create the file if it doesn’t exist, and paste the contents of the following gist into it.
That should let you compile and run 64-bit mex files
I’m still working on a way of building Fortran mexfiles.  You can use the gfortran compiler in windows, but I haven’t gotten things working yet.
If anyone that is knowledgeable about the following areas, I’d love to hear from you:
  • Using mingw64 compilers with gnumex.  Is this still necessary?  The above seems to work pretty nicely!
  • Why doesn’t -static-libstdc++ work under the cygwin version of x86_64-w64-mingw32-gcc-4.5.3 ?
  • How the heck can we get a single mexopts.bat file that’ll work with fortran codes too?
  • Can we get matlab to include a default config file for this compiler setup and support it?

Special thanks to Oren Livne and the dynare team for helping to figure this stuff out!  (The above mexopts.bat file is basically what they had with a few small changes that Oren and I discovered.)

Posted in Uncategorized | Leave a comment

bisquik – fast random graph sampling

Generating a random instance of a graph with a prescribed degree sequence is a surprisingly tricky problem.  Such random objects are necessary to assess the statistical significance of various network analysis measures.  They can be used to answer the question: is property X present in many graphs with this degree distribution, or just the graph I’m looking at?

The Erdős-Gallai conditions provide a check of whether a connected graph with a degree sequence exists and also construct such a graph.  But these graphs are not random.

The standard algorithm for generating a random instance is a Markov Chain Monte Carlo procedure that starts with a graph with the prescribed degree sequence — possibly the output of the Erdős-Gallai consruction — and then performs a series of edge swaps.  That is, we pick two edges in the graph randomly, and swap their endpoints.  To be concrete, let (u,v) and (r,s) be the two edges.  The swap replaces (u,v) with $(u,r)$ and (v,s) if the resulting graph is a simple graph (no loops).  If we let this procedure run “long enough” then it will produce a random graph with the prescribed degree distribution.

The weakness of the algorithm is that “long enough” isn’t always well understood.  Most of the theory for related algorithms requires something like O(n^7) steps for a random sample, where n is the number of vertices. In practice, many people often run it for as little as O(|E|) steps.

Bayati, Kim, and Saberi set out to devise a better approach, and they produced one:

Bayati, Kim, and Saberi.  A sequential algorithm for sampling random graph.  Algorithmica 58(4):860-910, 2010. doi:10.1007/s00453-009-9340-1

This algorithm has a slightly complicated implementation.  I’ve provided my implementation online in a github repository:

https://github.com/dgleich/bisquik

The name bisquik is a play on Bayati/Kim/Saberi.  But it’s also a play on Bisquick the baking ingredient.  The idea is that bisquik is the ingredient you need to make graph analysis good.    I’m currently working on a python wrapper for this algorithm to make it useful for packages like igraph and networkx as well.

So give it a try and let me know if you run into any problems.

I’ve tested it under Linux, OSX, and Cygwin.  (The makefile requires a small edit on OSX, remove the -std=c++0x flag from the CFLAGS.)

I’m currently working on a quick paper on some of the implementation decisions.

 

Posted in Uncategorized | Leave a comment

FlowImprove in MatlabBGL: Fast partition improvement!

One of the reasons that I originally wrote MatlabBGL was to be able to quickly implement and test algorithms that utilize graph algorithms as a subroutine.  A great example of this is the FlowImprove routine from Andersen and Lang.  (Needless disclosure, I know both of the authors and have an extremely high level of respect for their work.)

The authors’ own descriptions of the methods are a 2008 paper in SODA and a talk at Stanford’s MMDS conference:

At a high-level, this routine takes a partition of a graph into two pieces, and returns a new partition of the graph with an improved quotient cut score.  As a reminder (or an introduction!) the quotient cut score of a set is

\text{QCut}(S) = \frac{\text{cut}(S)}{\min(w(S),w(\bar{S}))}

where \text{cut}(S) measures the number of edges leaving the set S, and w(S), w(\bar{S}) is the sum of weights of vertices in S and “not S” (\bar{S}), respectively.  A common choice for the weight of a vertex is the uniform weight, or the degree of the vertex.  In the second case, the QCut score corresponds to the Normalized Cut objective — another frequently used partitioning score.

The algorithm works by solving a sequence of max-flow/min-cut problems.  At each iteration, the weights of the edges changes.  The algorithm works towards improving the partition by adjusting the weights.  They show that the algorithm will do at most n^2 iterations for a graph with n vertices, but usually find that only a few iterations are required.  It’s quite likely there is some improved theory that could explain why it works so fast in practice.

This makes it a perfect algorithm to implement in MatlabBGL, it really only requires 5 or 6 max-flow calls!  The only real difficulty in the implementation, which is entirely analogous to the pseudocode in the paper, is rounding the floating point weights in the flow procedure to integer weights in order to use the push-relabel algorithm in MatlabBGL (and which is what the authors use in the paper).   Because of this change, we introduce the “scale” in the code below to do this rounding.

The full implementation

This implementation is implemented as a custom code in MatlabBGL: https://github.com/dgleich/matlab-bgl/blob/master/custom/flow_improve.m

function [S,q] = flow_improve(G,A,p,maxiter)
% FLOW_IMPROVE Improve an existing clustering using max-flow
%   G must be a symmetric graph
%   A is an indicator over the vertices of G
%   p a non-negative integer vertex weight vector (default ones(n,1))
%   maxiter the maximum number of times to run (default 5)

%% History
%  2008-10-20: Added partition list input.

n = size(G,1);

if ~exist('maxiter','var') || isempty(maxiter), maxiter=5; end
if ~exist('p','var'), p='cond'; end

if ~islogical(A), lA = false(n,1); lA(A)=1; A=lA; end % convert A to logical

% compute p
if isempty(p), p = ones(n,1); end
if ischar(p) && strfind(p,'cond'), p = full(sum(G,2)); end

[ei ej ev] = find(G);
L = diag(sum(G,2))-G; % compute the laplacian
from_s = find(A); pA = p(A);
to_t = find(~A); pB = p(~A);

infval = 2^29;

piA = sum(pA);
piB = sum(pB);

iter= 1;
S= A;
% compute alpha = Q(A)
v= zeros(n,1);
v(S) = 1;
v(~S) = -1;
cv = v'*(L*v)/4;
bal = min(piA,piB);
alpha = cv/min(piA,piB);
q = alpha;
fprintf('Initial cut : cut= %7i bal= %7i quot= %8.5f\n', ...
    cv, bal, q);
q_old = q; S_old = S;
while iter<=maxiter
    % compute G_A(alpha)
    scale = ceil(1/alpha);
    from_s_v = alpha*pA*scale;
    to_t_v = alpha*pB*piA/piB*scale;
    GA = sparse([ei; (n+1)*ones(length(from_s),1); to_t], ...
                [ej; from_s; (n+2)*ones(length(to_t),1)], ...
                [ev*scale; ceil(from_s_v); ceil(to_t_v)], n+2, n+2);

    [flowval,cut] = push_relabel_max_flow(GA,n+1,n+2);
    S = cut(1:n)==1;
    DAofS = sum(p(S&A)) - sum(p(S&~A))*piA/piB;
    if DAofS > 0,
        v= zeros(n,1);
        v(S) = 1;
        v(~S) = -1;
        cv = v'*(L*v)/4;
        QAofS = cv/DAofS;
    else
        QAofS = infval;
    end
    bal = min(sum(p(S)), sum(p(~S)));
    q = cv/bal;
    fprintf('   iter %3i : cut= %7i bal= %7i q= %8.5f quot(A)= %8.5f DA= %7f\n', ...
        iter, cv, bal, q, QAofS, DAofS);
    if QAofS>=alpha, S = S_old; q = q_old; break; end
    alpha = QAofS;
    q_old = q; S_old = S;
end

Performance

How well does it do?  Well, we can reproduce some of the examples from the paper:

Here’s a vertical bisection of a US road network with 126,000 vertices

The score of this cut is 148/57507 (quotient score = 0.00257).  In 5 seconds, the flow_improve routine yields a new partition which has a cut score of 42/62244 (quotient score = 0.00067).  This new cut tracks the Mississippi river and the Indiana border.  This was an example that Kevin Lang showed me years ago, and it’s fun to be able to easily reproduce it.

So get MatlabBGL and flow_improve_example.m a try to reproduce this yourself.  (Note, you’ll need Tim Davis’ UFget.m function)

Posted in Uncategorized | Leave a comment

Updated Erdos-Renyi demo

All that work last week on picking a directory went towards updating my Erdos-Renyi random graph demo.  The results of that are now online:

http://www.cs.purdue.edu/homes/dgleich/demos/erdos_renyi/

And the repository at github:

https://github.com/dgleich/erdosrenyi-demo

If you find it useful, let me know!

Posted in Uncategorized | Leave a comment

Finding a directory for research demos

I’ve had a few online Matlab demos over my career as a graduate student. One that’s rather fun is to see the evolution of the giant component in an Erdos-Reyni random graph as the connection probability is increased beyond a phase-transition point.

After a few frantic emails from a course TA trying to get the demo to work for a class demonstration, I figured I should probably update this demo. I also hope to do more of these, and so I thought I’d try and puzzle out a way of doing them efficiently.

When I began doing this, the first question hit me:

Where should I put the files?

That is, this demo will involve

  • text for an online post/web-page
  • Matlab code or some other source code
  • figure files that are produced by the Matlab code.

So where should these files go?

Given where I currently store files, there are three logical places.

  • publications directory This directory is where I store all articles that I’m working on along with all of the supporting figures. However, usually the codes for these are stored elsewhere (in research/publications). This directory is hosted in Dropbox so I can always edit these files.
  • website directoryThis directory holds all the information for my website. Since
    the demos will go on the website, wouldn’t this be a logical place?
    This file is also on Dropbox.
  • research directoryThis directory is where I put small little research codes, as well
    as all the code for ongoing research projects. The separation between
    the two is a bit loose. For instance, a small research code will
    go into research/2011/09-05-new-test if I did a little work on
    something called new-test starting on 2011-09-05. If these
    small experiments become something more concrete, they get moved
    into a new git repository under research/publications. For space reasons, this directory is not on Dropbox. I also have a Dropbox/research directory for really small codes that I work on while not at my main computers.

After writing this description, the solution became obvious:
store these in Dropbox/research/demos This directory will
hold all of the code and documentation for the website.

Posted in Uncategorized | Leave a comment

Matrix analysis inspired computer names

I always have trouble coming up with computer names.  Here are some I’ve been thinking about recently:

Concepts

  • resolvent
  • recurrent good for an office?
  • transient good for a ec2 vm?
  • unitary
  • singular
  • deficient good for an old machine?

People

  • kyfan
  • schatten
  • lowner
  • gerschgorin
  • hadamard
These all have a matrix analysis theme.
Posted in Computers | 1 Comment

KDD2011

I returned from KDD2011 a few days ago.  Here are some of the talks I found interesting, and some quick points about them.

  1. Learning Relationships Defined by Linear Combinations of Constrained Random Walks, William Cohen, Carnegie Melon University. Invited talk at Mining and Learning with Graphs workshop.

    My interest The main piece of the talk was characterizing which are the most important types of paths in a large network of linked data.  Think of taking a database and linking all the entities via there relationships.  Which paths through relationships are the best?  The authors address this question using a PageRank heuristic.

  2. Sparsification of influence networks.  Michael Mathioudakis, Francesco Bonchi, Carlos Castillo, Aristides Gionis, Antti Ukkonen.  My interest Most links in social networks do not correspond to the true influence paths.  These authors study the problem of sparsifing an input graph with known information propagation paths such that it only has \( k \) links.
  3. Model Order Selection for Boolean Matrix Factorization.  Pauli Miettinen, Jilles Vreeken

    My interest Rather than factorizing a set of relational data — stated as a matrix — over the real numbers.  These authors study the problem over the booleans.  They use a minimum description length principle to decide the “rank” of the factorization.

  4. Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent.  Rainer Gemulla*, Peter Haas, Erik Nijkamp, Yannis Sismanis

    My interest These authors develop a block-cyclic stochastic gradient iteration to factor a matrix.  It looks cute.  I’d love to see it compared against some of the large scale distributed optimization routines developed at the national labs (e.g. Sandia/Argonne)

  5. Collective Graph Identification.  Galileo Namata, Stanley Kok, Lise Getoor

    My interest Given a graph, are the nodes and edges of the graph the real nodes and edges of the underlying problem you purport to study?  Most often not, these authors argue.  See their paper for a first step on how to “identify” the true graph from a set of nodes and edges.  This involves: entity resolution, link prediction, and node labeling.  I’m not sure if these approaches will scale, but it’s interesting work.

Posted in Uncategorized | Leave a comment