Dev Notes

Class interface

  • fit
    • X: features or distances matrix (each line is the features of a sample, or each column is the distance to the other samples)

    • Y: labels array (1D array where each element is the label of the sample)

    • learning: "default", "iterative", "agglomerative"

    • metric: "euclidian", "log_euclidian", "chi_square", "manhattan", "canberra", "squared_chord", "squared_chi_square", "bray_curtis"

    • precomputed_distance: True or False

    • Input Contract:
      • Number of lines of X must be equal to the number of elements of Y
      • Learning must be "default"
      • Metric must "euclidian"
      • If precomputed_distance is True, X must be square, and filled with the distances
      • If Y is not supplied, its the unsupervised training (no labels = clustering)
  • predict
    • X: features or distances matrix (each line is the features of a sample, or each column the distance to the other samples)

    • Input Contract:
      • X shape must match parameters of fit: if precomputed distance, the number of lines must match the number of nodes on the graph; if not, the number of columns must match the number of features
    • Output Contract:
      • Returns an array containing the labels for the number of samples on input X

Graph Representation

The graph is represented by two members: (1) a list of tuples V, where each tuple is a vertex and contains the indices of the adjacent vertices; and (2) a distance matrix f, that may be stored as a sparse matrix.

Example:

V =

[(1, 3),
 (0, 2),
 (1, 3),
 (0, 2)]

f = [[0, 10, 0, 5],
     [10, 0, 3, 0],
     [0,  3, 0, 4],
     [5,  0, 4, 0]]

This representation allows to efficiently store a complete graph, represented by a special case where V = [].

The generation of the list of edges and the associated value is simple:

from itertools import product
import numpy

# the number of vertices
nodes = f.shape[0]

# a list with all non repeating edges
edges = list(combinations(range(nodes), 2))

# an array of the edges
edges_ar = numpy.array(edges)
rows = edges_ar[:,0]
cols = edges_ar[:,1]

# extract the weight of each edge
weigths = f[rows, cols]

# a list of edges with its associated weight
graph = zip(edges, weigths)

# remove zeroed edges
graph = filter(lambda x: x[1] != 0, graph)

# now we can sort it based on the weigth
from operator import itemgetter
graph_sorted = sorted(graph, key=itemgetter(1))

For the example above, graph is:

[((0, 1), 10), ((0, 3), 5), ((1, 2), 3), ((2, 3), 4)]

And graph_sorted is:

[((1, 2), 3), ((2, 3), 4), ((0, 3), 5), ((0, 1), 10)]

MST Algorithms

  • Boruvka

    • Adequado para implementação paralela
    • Não usa estruturas complexas
  • Prim

    • O kernel é o mesmo do Dijkstra, mas a fila de prioridade contém as arestas ao invés dos vértices
    • A implementação mais eficiente utiliza um head de Fibonacci, que não é nativo do Python
  • Kruskal

    • Algoritmo mais simples, com estruturas de dados simples (arestas ordenadas e union-find)
    • Mesma complexidade do alg. de Prim
  • Chazelle

    • PDF
    • A complexidade deste algoritmo é O(m α(m,n)), que pra motivos práticos acaba sendo linear em m, que é o número de arestas do grafo
  • Paralelismo