kylin4.0.1部署过程

发布时间 2023-11-14 15:00:24作者: 爱睡懒觉的我

部署环境

hadoop 3.0.0-cdh6.3.2

hive 3.1.2

kylin 4.0.1

spark 3.1.1

一、准备工作

1、下载apache-kylin-4.0.1-bin-spark3.tar.gz并解压到本地目录,将spark-3.1.1-bin-hadoop2.7.tgz下载解压后放在kylin目录下

2、给解压后的kylin和spark文件夹改个名

mv apache-kylin-4.0.1-bin-spark3 kylin-4.0.1-spark3
mv spark-3.1.1-bin-hadoop2.7 spark 

3、在mysql中创建一个数据库用以存放kylin的元数据

create database kylin4;

二、进行相关配置

配置环境变量

vim /etc/profile
export $KYLIN_HOME=/export/servers/kylin-4.0.1-spark3
source /etc/profile

配置spark

1、复制一份mysql连接驱动到spark的jars目录下

cp /opt/mysql-connector-java-8.0.16.jar $KYLIN_HOME/spark/jars/

2、在hdfs创建一个用以存放spark的目录并将spark的jars目录下的所有jar包上传到该目录,并记住该目录,下面配置会用到

hdfs dfs -mkdir /user/spark/spark3.1.3_jars
hdfs dfs -put $KYLIN_HOME/spark/jars/* /user/spark/spark3.1.3_jars/

3、将core-site.xml、hdfs-site.xml、hive-site.xml、yarn-site.xml四个配置文件通过软连接(推荐)或者复制粘贴的方式放到spark的conf目录下

ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/core-site.xml $KYLIN_HOME/spark/conf/
ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/hdfs-site.xml $KYLIN_HOME/spark/conf
ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/hive-site.xml $KYLIN_HOME/spark/conf
ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/yarn-site.xml $KYLIN_HOME/spark/conf

4、将spark-default.conf.template复制一份并改名为spark-default.conf,spark-env.sh.template也复制一份并改名为spark-env.sh

cp spark-default.conf.template spark-default.conf
cp spark-env.sh.template spark-env.sh

5、修改spark-default.conf配置文件

vim $KYLIN_HOME/spark/conf/spark-default.conf

下面是我的spark配置

spark.master yarn
spark.yarn.jars hdfs://node01:8020/user/spark/spark3.1.3_jars/*

spark.eventLog.enabled true
spark.eventLog.dir hdfs://node01:8020/spark/logs/

spark.yarn.queue default
spark.driver.maxResultSize 15g
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 120s
spark.dynamicAllocation.initialExecutors 1
spark.dynamicAllocation.maxExecutors 25
spark.dynamicAllocation.minExecutors 1
spark.eventLog.compress false
spark.eventLog.enabled false

spark.executor.instances 4
spark.driver.memory 5g
spark.executor.cores 4
spark.executor.memory 10g
spark.executor.memoryOverhead 4g
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 2g
spark.yarn.am.memory 5g
spark.vcore.boost.ratio 5

spark.files.ignoreCorruptFiles true
spark.hadoopRDD.ignoreEmptySplits true
spark.hadoop.fs.file.impl.disable.cache true
spark.hadoop.fs.hdfs.impl.disable.cache true
spark.hadoop.hadoop.proxyuser.hive.groups *
spark.hadoop.hadoop.proxyuser.hive.hosts *
spark.hadoop.hadoop.security.authentication KERBEROS
spark.hadoop.hive.exec.orc.split.strategy BI
spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads 4
spark.hadoop.mapreduce.input.fileinputformat.split.minsize 268435456
spark.kryoserializer.buffer.max 512m

spark.ui.showConsoleProgress true
spark.sql.authorization.enable false
spark.sql.innerjoin.addNullFilter true
spark.thriftserver.proxy.enabled true
spark.acls.ebable false
spark.sql.parquet.filter.Pushdown true
spark.hadoop.datanucleus.schema.autoCreateAll true
spark.sql.hive.convertMetastoreCtas false
spark.sql.optimizer.bloomFilterPruning.enable false

spark.shuffle.file.buffer 128k
spark.shuffle.io.maxRetries 30
spark.shuffle.io.preferDirectBufs false
spark.shuffle.io.retryWait 60s
spark.shuffle.registration.timeout 5000
spark.shuffle.registration.maxAttempts 10
spark.shuffle.service.enabled true
spark.shuffle.useOldFetchProtocol true

spark.sql.hive.metastore.version=3.1.2
spark.sql.hive.metastore.jars=/opt/cloudera/parcels/CDH/lib/hive/apache-hive-3.1.2-bin/lib/*

spark.kerberos.keytab  /opt/node01.keytab
spark.kerberos.principal hive/node01@NBDP.COM

spark.security.credentials.hbase.enabled false

在上面的配置中,spark.yarn.jars是刚才创建并上传spark的jar包的目录,spark.sql.hive.metastore.version是你本地的hive版本,spark.sql.hive.metastore.jars则是hive的jar包(仔细检查hive的jar包里是否包含mysql的驱动包,没有的话也需要复制一份过去,不然会报各种奇奇怪怪的错误

6、修改spark-env.sh文件,配置spark内部的环境变量

export JAVA_HOME=/usr/java/jdk1.8.0_191
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
export SPARK_HOME=/export/servers/kylin-4.0.1-spark3/spark
export YARN_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive/apache-hive-3.1.2-bin

7、做完以上操作后,spark的配置工作应该就完成了,可以运行spark的bin目录下的spark-sql.sh脚本查看是否能成功查询Hive数据

$KYLIN_HOME/spark/bin/spark-sql.sh  # 启动脚本
show databases; # 如果能正常显示hive的所有数据库而不报错,说明spark配置成功

配置kylin

1、在kylin.sh文件的头部加上相关的环境变量设置

vim $KYLIN_HOME/bin/kylin.sh

export SPARK_HOME=$KYLIN_HOME/spark
export KYLIN_CONF=$KYLIN_HOME/conf
export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive/apache-hive-3.1.2-bin

注意:在kylin.sh脚本中注释掉对于prepare-hadoop-dependency.sh脚本的调用,因为这个脚本会用服务器环境中hadoop的部分jar包去替换掉spark目录下有关hadoop的部分jar包,导致spark初始化报错(可能我这里的hadoop版本跟spark版本不太兼容,替换掉spark目录下自带的有关hadoop的jar包就会有问题)

vim $KYLIN_HOME/bin/kylin.sh

# 在kylin.sh文件里注释掉下面这条语句
${KYLIN_HOME}/bin/prepare-hadoop-dependency.sh

2、将core-site.xml、hdfs-site.xml、hive-site.xml、yarn-site.xml四个配置文件通过软连接(推荐)或者复制粘贴的方式放到spark的conf目录下

ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/core-site.xml $KYLIN_HOME/conf/
ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/hdfs-site.xml $KYLIN_HOME/conf/
ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/hive-site.xml $KYLIN_HOME/conf/
ln -s /opt/cloudera/parcels/CDH/lib/hive/conf/yarn-site.xml $KYLIN_HOME/conf/

3、配置kylin.properties

vim $KYLIN_HOME/conf/kylin.properties

下面是我的配置文件

kylin.metadata.url=kylin_metadata@jdbc,url=jdbc:mysql://node01:3306/kylin4,username=root,password=123456,maxActive=10,maxIdle=10
kylin.env.zookeeper-base-path=/kylin4
kylin.env.zookeeper-is-local=false
kylin.env.zookeeper-connect-string=bigdata01:2181,bigdata02:2181,bigdata16:2181
kylin.server.mode=all
kylin.env.hadoop-conf-dir=/etc/hadoop/conf
kylin.engine.spark-conf.spark.master=yarn
kylin.engine.spark-conf.spark.submit.deployMode=cluster
kylin.engine.spark-conf.spark.yarn.queue=default
kylin.engine.spark-conf.spark.executor.cores=6
kylin.engine.spark-conf.spark.executor.memory=10G
kylin.engine.spark-conf.spark.executor.instances=10
kylin.engine.spark-conf.spark.executor.memoryOverhead=3G
kylin.engine.spark-conf.spark.driver.memory=8G
kylin.engine.spark-conf.spark.driver.memoryOverhead=3G
kylin.engine.spark-conf.spark.debug.maxToStringFields=1000
kylin.query.auto-sparder-context-enabled=true
kylin.query.sparder-context.app-name=kylin_query
kylin.query.spark-conf.spark.master=yarn
kylin.query.spark-conf.spark.yarn.queue=kylin

kylin.query.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.query.spark-conf.spark.sql.hive.metastore.jars=file:///data1/tools/hive_3.1.2_jars/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/lib/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop-hdfs/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop-yarn/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop-mapreduce/*
kylin.engine.spark-conf.spark.sql.hive.metastore.version=3.1.2
kylin.engine.spark-conf.spark.sql.hive.metastore.jars=file:///data1/tools/hive_3.1.2_jars/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/lib/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop-hdfs/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop-yarn/*:file:///opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop-mapreduce/*

kylin.query.spark-conf.spark.driver.cores=6
kylin.query.spark-conf.spark.driver.memory=5G
kylin.query.spark-conf.spark.driver.memoryOverhead=3G
kylin.query.spark-conf.spark.executor.cores=10
kylin.query.spark-conf.spark.executor.instances=20
kylin.query.spark-conf.spark.executor.memory=20G
kylin.query.spark-conf.spark.executor.memoryOverhead=3G
kylin.query.spark-conf.spark.debug.maxToStringFields=1000

4、由于我的集群环境是cdh6.3.2,kylin4为了兼容,还需要进行以下几步操作

# 下面这些jar包没有的可自行到网上下载

mkdir $KYLIN_HOME/ext
# 将mysql的驱动包放到该目录下
cp /opt/mysql-connector-java-8.0.16.jar $KYLIN_HOME/ext/

mkdir -p $KYLIN_HOME/bin/hadoop3_jars/cdh6
# 添加以下三个jar包到该目录下
cp /opt/commons-configuration-1.10.jar $KYLIN_HOME/bin/hadoop3_jars/cdh6/
cp /opt/hive-exec-1.21.2.3.1.0.0-78.jar $KYLIN_HOME/bin/hadoop3_jars/cdh6/
cp /opt/stax2-api-3.1.4.jar $KYLIN_HOME/bin/hadoop3_jars/cdh6/

三、启动kylin

1、运行kylin目录下的bin文件夹里的kylin.sh启动kylin

$KYLIN_HOME/bin/kylin.sh start

2、查看日志是否正常启动

tail -1000f $KYLIN_HOME/logs/kylin.log

3、启动成功后在浏览器输入bigdata01:7070/kylin进入kylin登陆页面,输入账号ADMIN,密码KYLIN即可进入kylin页面

踩坑记录

1、执行kylin模型构建的过程抛出tableNotFound的问题

我的原因是集群中有三台存储hive元数据的节点,其中有一个节点有问题,我将Kylin conf目录下的hive-site.xml中Hive元数据节点路径由三台改为了一台,就没有再抛出过这个异常