Spark 高级分析:第十一章第2节用Thunder加载数据
更一般地说,Thunder、Series和Images中的两种核心数据类型都继承自数据类,后者包装了一个Python RDD对象并公开了RDD API的一部分。数据类为键-值对的RDD建模,其中键表示某种类型的语义标识符(例如,空间中的一组坐标),值是一个数量不多的实际数据数组。例如,对于Images对象,键可以是一个时间点,值是格式化为NumPy数组的该时间点的图像。对于序列对象,关键可能是具有相应体素坐标的N维元组,而该值是表示该体素测量时间序列的一维NumPy数组。序列中的所有数组必须具有相同的维度。下面总结了对象API的一些有用部分。
[mw_shl_code=python,true]class Data:
property dtype:
# The dtype of the numpy array in this RDD's value slot
# lots of RDD methods, like first(), count(), cache(), etc.
# methods for aggregating across arrays, like mean(),
# variance(), etc., that keep the dtype constant
class Series(Data):
property dims:
# lazily computes Dimension object with information about
# the spatial dimensions encoded in the keys of this RDD
property index:
# a set of indices into each array, in the style of a
# Pandas Series object
# lots of methods to process all of the 1D arrays in parallel
# across the cluster, like normalize(), detrend(), select(),
# and apply(), that keep the dtype constant
# methods for parallel aggregations, like seriesMax(),
# seriesStdev(), etc., that change the dtype
def pack():
# collects the data at the client and repacks from the
# sparse representation in the RDD to a dense
# representation as a NumPy array with shape corresponding
# to dims
class Images(Data):
property dims:
# the Dimension object corresponding to the NumPy shape
# parameter of each value array
property nimages:
# number of images in RDD; lazily executes an RDD count
# operation
# multiple methods for aggregating across images or processing
# them in parallel, like maxProjection(), subsample(),
# subtract(), and apply()
def toSeries():
# reorganize data as a Series object[/mw_shl_code]
首先我们加载数据 我们看到的是与之前相同尺寸的图像,但是有240个时间点而不是20个。为了获得最佳的集群,我们必须规范化我们的特性。 让我们画几个系列来看看它们是什么样子的。Thunder允许我们获取RDD的一个随机子集,并且只过滤满足特定标准的集合元素,比如默认情况下的最小标准偏差。为了为阈值选择一个好的值,我们首先计算每个序列的标准差,并绘制一个10%样本值的柱状图。 考虑到这一点,我们将选择0.1的阈值来查看最“活跃”的系列。
[mw_shl_code=python,true]plt.plot(normalizedRDD.subset(50, thresh=0.1, stat='std').T)[/mw_shl_code]
现在我们已经了解了数据,最后让我们将体素聚集到各种行为模式中。Thunder已经为使用RDD实现了一个sciKit学习风格的API。在某些情况下,Thunder包含自己的实现(例如,矩阵分解代码)。在本例中,Thunder的K-Means抽象调用MLib Python API。我们将对k的多个值执行k均值。
[mw_shl_code=python,true]from thunder import KMeans
ks = [5, 10, 15, 20, 30, 50, 100, 200]
models = []
for k in ks:
[mw_shl_code=python,true]def model_error_1(model):
def series_error(series):
cluster_id = model.predict(series)
center = model.centers[cluster_id]
diff = center - series
return diff.dot(diff) ** 0.5
return (normalizedRDD
def model_error_2(model):
return 1. / model.similarity(normalizedRDD).sum()[/mw_shl_code]
[mw_shl_code=python,true]import numpy as np
errors_1 = np.asarray(map(model_error_1, models))
errors_2 = np.asarray(map(model_error_2, models))
ks, errors_1 / errors_1.sum(), 'k-o',
ks, errors_2 / errors_2.sum(), 'b:v')[/mw_shl_code]
[mw_shl_code=python,true]model20 = models[3]
plt.plot(model20.centers.T)[/mw_shl_code] 根据指定的簇,使用着色的体素绘制图像本身也很容易。
[mw_shl_code=python,true]from matplotlib.colors import ListedColormap
by_cluster = model20.predict(normalizedRDD).pack()
cmap_cat = ListedColormap(sns.color_palette("hls", 10), name='from_list')
plt.imshow(by_cluster[:, :, 0], interpolation='nearest',
aspect='equal', cmap='gray')[/mw_shl_code]