LDAModel¶

class pyspark.mllib.clustering.LDAModel(java_model: py4j.java_gateway.JavaObject)[source]¶

A clustering model derived from the LDA method.

Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology

“word” = “term”: an element of the vocabulary
“token”: instance of a term appearing in a document
“topic”: multinomial distribution over words representing some concept

New in version 1.5.0.

Notes

See the original LDA paper (journal version) [1]

1: Blei, D. et al. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (2003): 993-1022. https://www.jmlr.org/papers/v3/blei03a

Examples

>>> from pyspark.mllib.linalg import Vectors
>>> from numpy.testing import assert_almost_equal, assert_equal
>>> data = [
...     [1, Vectors.dense([0.0, 1.0])],
...     [2, SparseVector(2, {0: 1.0})],
... ]
>>> rdd =  sc.parallelize(data)
>>> model = LDA.train(rdd, k=2, seed=1)
>>> model.vocabSize()
2
>>> model.describeTopics()
[([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])]
>>> model.describeTopics(1)
[([1], [0.5...]), ([0], [0.5...])]

>>> topics = model.topicsMatrix()
>>> topics_expect = array([[0.5,  0.5], [0.5, 0.5]])
>>> assert_almost_equal(topics, topics_expect, 1)

>>> import os, tempfile
>>> from shutil import rmtree
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
>>> sameModel = LDAModel.load(sc, path)
>>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix())
>>> sameModel.vocabSize() == model.vocabSize()
True
>>> try:
...     rmtree(path)
... except OSError:
...     pass

Methods

`call`(name, *a)	Call method of java_model
`describeTopics`([maxTermsPerTopic])	Return the topics described by weighted terms.
`load`(sc, path)	Load the LDAModel from disk.
`save`(sc, path)	Save this model to the given path.
`topicsMatrix`()	Inferred topics, where each topic is represented by a distribution over terms.
`vocabSize`()	Vocabulary size (number of terms or terms in the vocabulary)

Methods Documentation

call(name: str, *a: Any) → Any¶: Call method of java_model

describeTopics(maxTermsPerTopic: Optional[int] = None) → List[Tuple[List[int], List[float]]][source]¶

Return the topics described by weighted terms.

New in version 1.6.0.

Warning

If vocabSize and k are large, this can return a large object!

Parameters

maxTermsPerTopicint, optional: Maximum number of terms to collect for each topic. (default: vocabulary size)

Returns

list: Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight.

classmethod load(sc: pyspark.context.SparkContext, path: str) → pyspark.mllib.clustering.LDAModel [source]¶

Load the LDAModel from disk.

New in version 1.5.0.

Parameters

scpyspark.SparkContext
pathstr: Path to where the model is stored.

save(sc: pyspark.context.SparkContext, path: str) → None¶: Save this model to the given path.

New in version 1.3.0.

topicsMatrix() → numpy.ndarray[source]¶: Inferred topics, where each topic is represented by a distribution over terms.

New in version 1.5.0.

vocabSize() → int[source]¶: Vocabulary size (number of terms or terms in the vocabulary)

New in version 1.5.0.

LDA

BinaryClassificationMetrics