阿里云服务器上通过Anaconda安装Jupyter Notebook,第三方的包很多下载不了,比如Spark/TensorFlow这些重要的库都不成功。这里提供了一个方法,并通过conda的参数修改成功安装。这个方法还可以用于建立conda软件源的本地镜像站。
1、问题与解决思路
阿里云服务器上通过Anaconda安装Jupyter Notebook,第三方的包很多下载不了。
为什么呀?因为好多库是放在Amazon的S3或者AWS上面的,你懂的。
不过,Spark/TensorFlow这些超级Newbility的库都安装不了,是可忍孰不可忍?
一般通用的方法是使用代理或者VPN了,但是Aliyun的服务器,呵呵,一般是不让用的了。
那么,我们可以本地下载下来,然后上传到服务器,然后再设置conda安装本地源,然后,就可以用了。
这里提供了一个方法,并通过conda的参数修改成功安装。这个方法还可以用于建立conda软件源的本地镜像站。
2、conda安装本地软件源
首先,分享下如何让conda安装本地软件源。
运行conda install -h,可以看到下面的信息:
usage: conda install [-h] [--revision REVISION] [-y] [--dry-run] [-f] [--file FILE] [--no-deps] [-m] [--use-index-cache]
[--use-local] [--offline] [--no-pin] [-c CHANNEL] [--override-channels] [-n ENVIRONMENT | -p PATH] [-q]
[--copy] [--alt-hint] [--update-dependencies] [--no-update-dependencies] [--show-channel-urls]
[--no-show-channel-urls] [--json]
[package_spec [package_spec ...]]
Installs a list of packages into a specified conda environment.
This command accepts a list of package specifications (e.g, bitarray=0.8)
and installs a set of packages consistent with those specifications and
compatible with the underlying environment. If full compatibility cannot
be assured, an error is reported and the environment is not changed.
Conda attempts to install the newest versions of the requested packages. To
accomplish this, it may update some packages that are already installed, or
install additional packages. To prevent existing packages from updating,
use the --no-update-deps option. This may force conda to install older
versions of the requested packages, and it does not prevent additional
dependency packages from being installed.
If you wish to skip dependency checking altogether, use the '--force'
option. This may result in an environment with incompatible packages, so
this option must be used with great caution.
conda can also be called with a list of explicit conda package filenames
(e.g. ./lxml-3.2.0-py27_0.tar.bz2). Using conda in this mode implies the
--force option, and should likewise be used with great caution. Explicit
filenames and package specifications cannot be mixed in a single command.
Options:
positional arguments:
package_spec Packages to install into the conda environment.
optional arguments:
-h, --help Show this help message and exit.
--revision REVISION Revert to the specified REVISION.
-y, --yes Do not ask for confirmation.
--dry-run Only display what would have been done.
-f, --force Force install (even when package already installed), implies --no-deps.
--file FILE Read package versions from the given file. Repeated file specifications can be passed (e.g. --file=file1
--file=file2).
--no-deps Do not install dependencies.
-m, --mkdir Create the environment directory if necessary.
--use-index-cache Use cache of channel index files.
--use-local Use locally built packages.
--offline Offline mode, don't connect to the Internet.
--no-pin Ignore pinned file.
-c CHANNEL, --channel CHANNEL
Additional channel to search for packages. These are URLs searched in the order they are given (including
file:// for local directories). Then, the defaults or channels from .condarc are searched (unless
--override-channels is given). You can use 'defaults' to get the default packages for conda, and 'system'
to get the system packages, which also takes .condarc into account. You can also use any name and the
.condarc channel_alias value will be prepended. The default channel_alias is http://conda.anaconda.org/.
--override-channels Do not search default or .condarc channels. Requires --channel.
-n ENVIRONMENT, --name ENVIRONMENT
Name of environment (in /root/anaconda3/envs).
-p PATH, --prefix PATH
Full path to environment prefix (default: /root/anaconda3/envs/GISpark).
-q, --quiet Do not display progress bar.
--copy Install all packages using copies instead of hard- or soft-linking.
--alt-hint Use an alternate algorithm to generate an unsatisfiability hint.
--update-dependencies, --update-deps
Update dependencies (default: True).
--no-update-dependencies, --no-update-deps
Don't update dependencies (default: False).
--show-channel-urls Show channel urls (default: None).
--no-show-channel-urls
Don't show channel urls.
--json Report all output as json. Suitable for using conda programmatically.
Examples:
conda install -n myenv scipy
让conda install 指向本地源的方法很简单:
conda install -c file://path packages
但这个运行会出错。需要先安装conda-build建立包的索引。
source deactivate envs-xxx #退回根环境
conda install conda-build
condo index channel/linux-64 #包一定要放给定环境下,ubuntu-x64是linux-64
#将自己下载的*.bz2包放到自建的linux-64目录下,这个不能改(windows-64和Mac是osx-64)
source activate envs-xxx #再入自己的环境
然后再按照上面的方法运行conda install.
3、Spark的安装具体步骤
下载py4j和spark的包:
wget -c https://anaconda.org/anaconda-cluster/py4j/0.8.2.1/download/linux-64/py4j-0.8.2.1-py35_0.tar.bz2
wget -c
https://anaconda.org/anaconda-cluster/spark/1.6.0/download/linux-64/spark-1.6.0-py35_1.tar.bz2
上传到阿里云服务器
scp py4j-0.8.2.1-py35_0.tar.bz2 root@服务器IP地址:/root/GISpark/channel/linux-64
scp spark-1.6.0-py35_1.tar.bz2 root@服务器IP地址:/root/GISpark/channel/linux-64
ssh登陆进服务器安装:
#更新包索引
cd /root/GISpark/
conda index channel/linux-64
conda install —n GISpark -c file://channel py4j
conda install —n GISpark -c file://channel spark
如果使用python3,在.profile等环境文件设置环境变量即可:
export PYSPARK_PYTHON=python3
4、Spark在阿里云服务器运行问题
Exception Traceback (most recent call last)
<ipython-input-1-96a91924b452> in <module>()
2 from pyspark import SparkConf, SparkContext
3
----> 4 conf = (SparkConf()
5 .setMaster("local")
6 .setAppName("MyApp")
/root/anaconda3/envs/GISpark/lib/python3.5/site-packages/pyspark/conf.py in __init__(self, loadDefaults, _jvm, _jconf)
102 else:
103 from pyspark.context import SparkContext
--> 104 SparkContext._ensure_initialized()
105 _jvm = _jvm or SparkContext._jvm
106 self._jconf = _jvm.SparkConf(loadDefaults)
/root/anaconda3/envs/GISpark/lib/python3.5/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway)
243 with SparkContext._lock:
244 if not SparkContext._gateway:
--> 245 SparkContext._gateway = gateway or launch_gateway()
246 SparkContext._jvm = SparkContext._gateway.jvm
247
/root/anaconda3/envs/GISpark/lib/python3.5/site-packages/pyspark/java_gateway.py in launch_gateway()
92 callback_socket.close()
93 if gateway_port is None:
---> 94 raise Exception("Java gateway process exited before sending the driver its port number")
95
96 # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
5、解决阿里云的内存不足问题
原来以为是Aliyun的端口堵塞的问题,后来分析是内存不足。
参考下面的链接处理,圆满解决:
http://www.linuxdiyf.com/linux/19786.html
后来发现是阿里云的缺省swap-虚拟内存功能没有打开。
然后,Spark运行正常。