红联Linux门户
Linux帮助

阿里云服务器安装Jupyter及conda/Spark技巧

发布时间:2016-10-31 11:21:54来源:linux网站作者:openthings
阿里云服务器上通过Anaconda安装Jupyter Notebook,第三方的包很多下载不了,比如Spark/TensorFlow这些重要的库都不成功。这里提供了一个方法,并通过conda的参数修改成功安装。这个方法还可以用于建立conda软件源的本地镜像站。
 
1、问题与解决思路
阿里云服务器上通过Anaconda安装Jupyter Notebook,第三方的包很多下载不了。
为什么呀?因为好多库是放在Amazon的S3或者AWS上面的,你懂的。
不过,Spark/TensorFlow这些超级Newbility的库都安装不了,是可忍孰不可忍?
一般通用的方法是使用代理或者VPN了,但是Aliyun的服务器,呵呵,一般是不让用的了。
那么,我们可以本地下载下来,然后上传到服务器,然后再设置conda安装本地源,然后,就可以用了。
这里提供了一个方法,并通过conda的参数修改成功安装。这个方法还可以用于建立conda软件源的本地镜像站。
 
2、conda安装本地软件源
首先,分享下如何让conda安装本地软件源。
运行conda install -h,可以看到下面的信息:
usage: conda install [-h] [--revision REVISION] [-y] [--dry-run] [-f] [--file FILE] [--no-deps] [-m] [--use-index-cache]
[--use-local] [--offline] [--no-pin] [-c CHANNEL] [--override-channels] [-n ENVIRONMENT | -p PATH] [-q]
[--copy] [--alt-hint] [--update-dependencies] [--no-update-dependencies] [--show-channel-urls]
[--no-show-channel-urls] [--json]
[package_spec [package_spec ...]]
Installs a list of packages into a specified conda environment.
This command accepts a list of package specifications (e.g, bitarray=0.8)
and installs a set of packages consistent with those specifications and
compatible with the underlying environment. If full compatibility cannot
be assured, an error is reported and the environment is not changed.
Conda attempts to install the newest versions of the requested packages. To
accomplish this, it may update some packages that are already installed, or
install additional packages. To prevent existing packages from updating,
use the --no-update-deps option. This may force conda to install older
versions of the requested packages, and it does not prevent additional
dependency packages from being installed.
If you wish to skip dependency checking altogether, use the '--force'
option. This may result in an environment with incompatible packages, so
this option must be used with great caution.
conda can also be called with a list of explicit conda package filenames
(e.g. ./lxml-3.2.0-py27_0.tar.bz2). Using conda in this mode implies the
--force option, and should likewise be used with great caution. Explicit
filenames and package specifications cannot be mixed in a single command.
Options:
positional arguments:
package_spec          Packages to install into the conda environment.
optional arguments:
-h, --help            Show this help message and exit.
--revision REVISION   Revert to the specified REVISION.
-y, --yes             Do not ask for confirmation.
--dry-run             Only display what would have been done.
-f, --force           Force install (even when package already installed), implies --no-deps.
--file FILE           Read package versions from the given file. Repeated file specifications can be passed (e.g. --file=file1
--file=file2).
--no-deps             Do not install dependencies.
-m, --mkdir           Create the environment directory if necessary.
--use-index-cache     Use cache of channel index files.
--use-local           Use locally built packages.
--offline             Offline mode, don't connect to the Internet.
--no-pin              Ignore pinned file.
-c CHANNEL, --channel CHANNEL
Additional channel to search for packages. These are URLs searched in the order they are given (including
file:// for local directories). Then, the defaults or channels from .condarc are searched (unless
--override-channels is given). You can use 'defaults' to get the default packages for conda, and 'system'
to get the system packages, which also takes .condarc into account. You can also use any name and the
.condarc channel_alias value will be prepended. The default channel_alias is http://conda.anaconda.org/.
--override-channels   Do not search default or .condarc channels. Requires --channel.
-n ENVIRONMENT, --name ENVIRONMENT
Name of environment (in /root/anaconda3/envs).
-p PATH, --prefix PATH
Full path to environment prefix (default: /root/anaconda3/envs/GISpark).
-q, --quiet           Do not display progress bar.
--copy                Install all packages using copies instead of hard- or soft-linking.
--alt-hint            Use an alternate algorithm to generate an unsatisfiability hint.
--update-dependencies, --update-deps
Update dependencies (default: True).
--no-update-dependencies, --no-update-deps
Don't update dependencies (default: False).
--show-channel-urls   Show channel urls (default: None).
--no-show-channel-urls
Don't show channel urls.
--json                Report all output as json. Suitable for using conda programmatically.
Examples:
conda install -n myenv scipy
让conda install 指向本地源的方法很简单:
conda install -c file://path packages
但这个运行会出错。需要先安装conda-build建立包的索引。
source deactivate envs-xxx      #退回根环境
conda install conda-build
condo index channel/linux-64    #包一定要放给定环境下,ubuntu-x64是linux-64 
#将自己下载的*.bz2包放到自建的linux-64目录下,这个不能改(windows-64和Mac是osx-64) 
source activate envs-xxx        #再入自己的环境
然后再按照上面的方法运行conda install.
 
3、Spark的安装具体步骤
下载py4j和spark的包:
wget -c https://anaconda.org/anaconda-cluster/py4j/0.8.2.1/download/linux-64/py4j-0.8.2.1-py35_0.tar.bz2
wget -c 
https://anaconda.org/anaconda-cluster/spark/1.6.0/download/linux-64/spark-1.6.0-py35_1.tar.bz2
上传到阿里云服务器
scp py4j-0.8.2.1-py35_0.tar.bz2  root@服务器IP地址:/root/GISpark/channel/linux-64 
scp spark-1.6.0-py35_1.tar.bz2  root@服务器IP地址:/root/GISpark/channel/linux-64
ssh登陆进服务器安装:
#更新包索引
cd /root/GISpark/
conda index channel/linux-64
conda install —n GISpark -c file://channel py4j
conda install —n GISpark -c file://channel spark
如果使用python3,在.profile等环境文件设置环境变量即可:
export PYSPARK_PYTHON=python3
 
4、Spark在阿里云服务器运行问题
Exception                                 Traceback (most recent call last)
<ipython-input-1-96a91924b452> in <module>()
2 from pyspark import SparkConf, SparkContext
----> 4 conf = (SparkConf()
5          .setMaster("local")
6          .setAppName("MyApp")
/root/anaconda3/envs/GISpark/lib/python3.5/site-packages/pyspark/conf.py in __init__(self, loadDefaults, _jvm, _jconf)
102         else:
103             from pyspark.context import SparkContext
--> 104             SparkContext._ensure_initialized()
105             _jvm = _jvm or SparkContext._jvm
106             self._jconf = _jvm.SparkConf(loadDefaults)
/root/anaconda3/envs/GISpark/lib/python3.5/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway)
243         with SparkContext._lock:
244             if not SparkContext._gateway:
--> 245                 SparkContext._gateway = gateway or launch_gateway()
246                 SparkContext._jvm = SparkContext._gateway.jvm
247 
/root/anaconda3/envs/GISpark/lib/python3.5/site-packages/pyspark/java_gateway.py in launch_gateway()
92                 callback_socket.close()
93         if gateway_port is None:
---> 94             raise Exception("Java gateway process exited before sending the driver its port number")
95 
96         # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
 
5、解决阿里云的内存不足问题
原来以为是Aliyun的端口堵塞的问题,后来分析是内存不足。
参考下面的链接处理,圆满解决:
http://www.linuxdiyf.com/linux/19786.html
后来发现是阿里云的缺省swap-虚拟内存功能没有打开。
然后,Spark运行正常。
 
本文永久更新地址:http://www.linuxdiyf.com/linux/25585.html