用以下语句生成数据
n = 1000000; ids = rand(0..9,n); words = rand(`apple`pig`flash`cat`dog`tomato`elevator`building,n); data = table(ids as id, words as word); db.createPartitionedTable(data, "words", "id").append!(data)
如何用mr函数完成每个单词个数统计?
文档举的例子太高深,太难看懂了,还是从简单需求开始吧
如下代码是否符合mr的设计思想?
def mapWords(partitionTable){ //参数是表的一个分区 wordCount = select word,count(*) from partitionTable group by word; //每个分区节点计算一次group return wordCount; } def reduceWords(wordsCount1,wordsCount2){ //参数是两次map的结果 combineTable = wordsCount1.append!(wordsCount2) //两个分区的统计结果拼接 wordCountCombine = select word,sum(count) as count from combineTable group by word; //合并两次map的结果 return wordCountCombine; } def finalWords(result){ //参数是最后一次reduce的结果 return result; //没有额外处理 } def wordCount(ds){ //将map、reduce、final合并成一个函数,只需要传入表的所有分区数组即可 return mr(ds, mapWords, reduceWords, finalWords) } wordsTable = loadTable("dfs://testdb", "wordCount") wordCount(sqlDS(<select * from wordsTable>));
另:reduce函数是否只能是二元操作的?这与hadoop中的mr似乎有所不同