本帖最后由 feilong 于 2018-6-15 09:38 编辑
问题导读
1.如何校验SVD的输出?
2.SVD中V代表什么?
3.如何得到文档集?
上一篇:Spark 高级分析:第六章第6节 奇异值分解
http://www.aboutyun.com/forum.php?mod=viewthread&tid=24634&extra=
SVD输出一串数字。如何检验它们,以证实它们与有用的东西有关?V矩阵通过对它们重要的词语来表示概念。如上所述,V包含每个概念的一列,每一项都包含一行。每个位置的值可以被解释为该术语与该概念的相关性。这意味着,对于每个顶级概念,最相关的词语都可以用类似的方式找到:
[mw_shl_code=scala,true]import scala.collection.mutable.ArrayBuffer
val v = svd.V
val topTerms = new ArrayBuffer[Seq[(String, Double)]]()
val arr = v.toArray
for (i <- 0 until numConcepts) {
val offs = i * v.numRows
val termWeights = arr.slice(offs, offs + v.numRows).zipWithIndex
val sorted = termWeights.sortBy(-_._1)
topTerms += sorted.take(numTerms).map{
case (score, id) => (termIds(id), score)
}
}
topTerms[/mw_shl_code]
注意,在驱动程序中,V是内存中的一个矩阵,并且计算是以非分布式方式进行的。对于每个顶级概念的相关术语可以用类似于U的方式找到,但是代码看起来有点不同,因为U是作为一个分布式矩阵存储的。
[mw_shl_code=scala,true]def topDocsInTopConcepts(
svd: SingularValueDecomposition[RowMatrix, Matrix],
numConcepts: Int, numDocs: Int, docIds: Map[Long, String])
: Seq[Seq[(String, Double)]] = {
val u = svd.U
val topDocs = new ArrayBuffer[Seq[(String, Double)]]()
for (i <- 0 until numConcepts) {
val docWeights = u.rows.map(_.toArray(i)).zipWithUniqueId
topDocs += docWeights.top(numDocs).map{
case (score, id) => (docIds(id), score)
}
}
topDocs
}[/mw_shl_code]
让我们来检查前几个概念:
[mw_shl_code=scala,true]val topConceptTerms = topTermsInTopConcepts(svd, 4, 10, termIds)
val topConceptDocs = topDocsInTopConcepts(svd, 4, 10, docIds)
for ((terms, docs) <- topConceptTerms.zip(topConceptDocs)) {
println("Concept terms: " + terms.map(_._1).mkString(", "))
println("Concept docs: " + docs.map(_._1).mkString(", "))
println()
}
Concept terms: summary, licensing, fur, logo, album, cover, rationale, gif, use, fair
Concept docs: File:Gladys-in-grammarland-cover-1897.png, File:Gladys-in-grammarland-cover-2010.png, Concept terms: disambiguation, william, james, john, iran, australis, township, charles, robert, river
Concept docs: G. australis (disambiguation), F. australis (disambiguation), U. australis (disambiguation), Concept terms: licensing, disambiguation, australis, maritima, rawal, upington, tallulah, chf, satyanarayana, Concept docs: File:Rethymno.jpg, File:Ladycarolinelamb.jpg, File:KeyAirlines.jpg, File:NavyCivValor.Concept terms: licensing, summarysource, summaryauthor, wikipedia, summarypicture, summaryfrom, summaryself, Concept docs: File:Rethymno.jpg, File:Wristlock4.jpg, File:Meseanlol.jpg, File:Sarles.gif, File:SuzlonWinMills.Concept terms: establishment, norway, country, england, spain, florida, chile, colorado, australia, Concept docs: Category:1794 establishments in Norway, Category:1838 establishments in Norway, Category[/mw_shl_code]
第一个概念中的文档似乎都是图像文件,这些词语似乎与图像属性和许可相关。第二个概念似乎是消除歧义页面。看起来,这个转储并不仅仅局限于原始的Wikipedia文章,而且还被管理页面和讨论页面所充斥。对中间阶段的输出进行检查有助于及早发现这类问题。幸运的是,Cloud9似乎提供了一些过滤这些功能的功能。上面的wikiXmlToPlainText方法的更新版本如下:
[mw_shl_code=scala,true]def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
...
if (page.isEmpty || !page.isArticle || page.isRedirect ||
page.getTitle.contains("(disambiguation)")) {
} else {
Some((page.getTitle, page.getContent))
}
}[/mw_shl_code]
在经过筛选的文档集上重新运行管道会产生一个更合理的结果:
[mw_shl_code=scala,true]Concept terms: disambiguation, highway, school, airport, high, refer, number, squadron, list, may, Concept docs: Tri-State Highway (disambiguation), Ocean-to-Ocean Highway (disambiguation), Highway Concept terms: disambiguation, nihilistic, recklessness, sullen, annealing, negativity, initialization, Concept docs: Nihilistic (disambiguation), Recklessness (disambiguation), Manjack (disambiguation), Concept terms: department, commune, communes, insee, france, see, also, southwestern, oise, marne, Concept docs: Communes in France, Saint-Mard, Meurthe-et-Moselle, Saint-Firmin, Meurthe-et-Moselle, Concept terms: genus, species, moth, family, lepidoptera, beetle, bulbophyllum, snail, database, natural, Concept docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Cribrilinidae, Tahla (genus), Concept terms: province, district, municipality, census, rural, iran, romanize, population, infobox, Concept docs: New York State Senate elections, 2012, New York State Senate elections, 2008, New York Concept terms: genus, species, district, moth, family, province, iran, rural, romanize, census, village, Concept docs: Chelonia (genus), Palea (genus), Argiope (genus), Sphingini, Tahla (genus), Cribrilinidae, Concept terms: protein, football, league, encode, gene, play, team, bear, season, player, club, reading, Concept docs: Protein FAM186B, ARL6IP1, HIP1R, SGIP1, MTMR3, Gem-associated protein 6, Gem-associated[/mw_shl_code]
前两个概念仍然是消除歧义的,但其余的似乎对应于有意义的类别。第三种似乎是法国的地区,动物和昆虫分类的第四和第六。第五个问题涉及选举、市政和政府。第七个关注蛋白质的文章,还有一些词语也提到了足球,也许是与性能增强药物的适应度交叉?虽然每一个词都出现了意想不到的词语,但所有的概念都表现出一些主题的连贯性。
|
|