pattern_clustering.boost.pattern_clustering_with_preprocess

pattern_clustering_with_preprocess(lines: list, map_name_dfa: Optional[dict] = None, densities: Optional[list] = None, max_dist: float = 0.6, use_async: bool = True, make_mg: Optional[callable] = None) list[source]

Computes the pattern clustering of input lines by grouping matching PAs.

This implies that lines having matching PatternAutomaton always fall in the same clusters which accelerate the code. Sometimes, this may lead to weird cluster, especially if some lines are unrelated and conform to the same PatternAutomaton.

Parameters
  • lines – A list(str) gathering the input lines.

  • map_name_dfa – A dict{str : Automaton} mapping each pattern name with the corresponding Automaton.

  • densities – A density vector. See make_densities().

  • max_dist – The maximum distance between an element of a cluster and the cluster representative. As distances are normalized, this value should be between 0.0 and 1.0.

  • use_async – Pass True to run computations using async calls. This accelerates computations.

  • make_mg – A MultiGrepFunctor instance.

Returns

A list(int) mapping each line index with its corresponding cluster identifier.