Tree Edit Distance Enlive Version

In my last post, I presented a code-dump that computed a restricted version of the tree edit distance algorithm. I was able to achieve a decent speed-up using enlive. Here’s a code-dump:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
(ns tree-edit-distance.enlive-core
  "A version of RTDM that operates on a tree produced by
   enlive."
  (:require [clj-http.client :as client]
            [net.cgrand.enlive-html :as html]))

(defn init
  "Perform the correct initialization"
  [m n c1 c2 del-cost ins-cost]
  (let [M (make-array Integer/TYPE (inc m) (inc n))]
    (do
      (doseq [i (range (inc m))
              j (range (inc n))]
        (aset M i j (int
                     (+ (* del-cost c1)
                        (* ins-cost c2)))))
      M)))

(defn load-tree
  "Fetches a link's content and builds an enlive-tree with it"
  [a-link]
  (->> a-link
       client/get
       :body
       java.io.StringReader.
       html/html-resource
       (filter (fn [x] (:tag x))) ; pick out the tree and not the docstring
       first))

(defn tree-children
  [a-tree]
  (->> a-tree :content (filter map?)))

(defn num-children
  [a-tree]
  (-> a-tree tree-children count))

(defn tree-descendants
  [a-tree]
  (if (-> a-tree tree-children seq)
    (+ (num-children a-tree)
       (apply + (map tree-descendants (tree-children a-tree))))
    0))

(declare tree-edit-distance)

(defn invert-cost
  [tree1 tree2 del-cost ins-cost sub-cost]
  (let [t1-desc (tree-descendants tree1)
        t2-desc (tree-descendants tree2)]
    (- (+ (* del-cost t1-desc)
          (* ins-cost t2-desc))
       (tree-edit-distance tree1 tree2 del-cost ins-cost sub-cost))))

(defn tree-edit-distance
  [tree1 tree2 del-cost ins-cost sub-cost]
  (let [m (num-children tree1)
        n (num-children tree2)

        t1-children (tree-children tree1)
        t2-children (tree-children tree2)

        t1-desc (tree-descendants tree1)
        t2-desc (tree-descendants tree2)

        M (init m n t1-desc t2-desc del-cost ins-cost)]

    (doseq [i (range m)
            j (range n)] 
      (let [c-i (nth t1-children i)
            c-j (nth t2-children j)

            c-i-desc (tree-descendants c-i)
            c-j-desc (tree-descendants c-j)

            del (aget M i (inc j))
            ins (aget M (inc i) j)

            sub-i (- (aget M i j)
                     del-cost
                     ins-cost)

            sub (if (= c-i c-j)
                  (- sub-i
                     (* ins-cost c-j-desc)
                     (* del-cost c-i-desc))
                  (cond
                   (or (not (-> c-i :content (filter map?)))
                       (not (-> c-j :content (filter map?))))
                   (+ sub-i sub-cost)

                   (or (= (-> c-i :tag) (-> c-j :tag)))
                   (- sub-i (invert-cost c-i c-j del-cost ins-cost sub-cost))

                   :else
                   (+ sub-i sub-cost)))]
        (aset M (inc i) (inc j) (int (min del ins sub)))))
    (aget M m n)))

(defn tree-edit-distance-link
  "load trees and return their edit distance"
  [link1 link2]
  (let [t1 (load-tree link1)
        t2 (load-tree link2)]
    (- 1
       (/ (tree-edit-distance t1 t2 1 1 1)
          (+ (tree-descendants t1)
             (tree-descendants t2))))))

This is now a convenient implementation that you can use. The full source code is in this GitHub repo.


Per Intellectum, Vis
(c) Shriphani Palakodety 2013-2016