Statistical Textual Document Analysis
Our current research is enabling the analysis of large sets of documents. We start with a declarative, high-level specification of a hierarchical Bayesian model. We then generate optimized parallel C code by exploiting model structure at code generation time.
To achieve this we must:
- Parallelize EM algorithm for Latent Dirichlet Allocation model
- Consider that input is a sparse word-document matrix
- Remember "word" and "document" can be loosely interpreted.
We hope to generalize to related hierarchical models for a wide variety of data types. The code generation turns a high-level model specification into parallel C code. We take a compiler-oriented approach and assume model structure is known statically at code generation time. The code generation can easily be cross-compiled.
