Skip to Main Content U.S. Department of Energy
Center for Adaptive Supercomputing - Multithreaded Architectures

Statistical Textual Document Analysis

Our current research is enabling the analysis of large sets of documents. We start with a declarative, high-level specification of a hierarchical Bayesian model. We then generate optimized parallel C code by exploiting model structure at code generation time.

Statistical Textual Document Analysis

To achieve this we must:

  • Parallelize EM algorithm for Latent Dirichlet Allocation model
  • Consider that input is a sparse word-document matrix
  • Remember "word" and "document" can be loosely interpreted.

We hope to generalize to related hierarchical models for a wide variety of data types. The code generation turns a high-level model specification into parallel C code. We take a compiler-oriented approach and assume model structure is known statically at code generation time. The code generation can easily be cross-compiled.

CASS-MT

Research and Development

Resources

Recent News

Additional Resources

PNNL Contacts