一,memcacheq的使用:
①,当使用set命令时,就向指定的消息队列中写入了一条新消息,也就是向BerkeleyDB中新insert了一条数据;
②,当使用get命令时,就从 指定队列中取出一条新消息,也就是向BerkeleyDB中delete了一条数据,这个时候再去get的话就返回空了,队列里没消息;
③,可以通过:stats,stats queue来查看。hadoop@hadoopslave1:~/bigdata/hbase/bin$ telnet 172.16.201.170 11212Trying 172.16.201.170... Connected to 172.16.201.170. Escape character is '^]'. stats STAT pid 31183 STAT uptime 605 STAT time 1372580856 STAT version 0.2.0 STAT pointer_size 64 STAT rusage_user 0.076004 STAT rusage_system 0.892055 STAT curr_connections 6 STAT total_connections 9 STAT connection_structures 7 STAT get_cmds 4 STAT get_hits 1 STAT set_cmds 1 STAT set_hits 1 STAT bytes_read 415 STAT bytes_written 2870 STAT threads 4 END stats queue STAT q4 1/1 END set q 0 0 5 hhhhh STORED set qq 0 0 5 qqqqq STORED stats STAT pid 31183 STAT uptime 661 STAT time 1372580912 STAT version 0.2.0 STAT pointer_size 64 STAT rusage_user 0.080005 STAT rusage_system 0.908056 STAT curr_connections 6 STAT total_connections 9 STAT connection_structures 7 STAT get_cmds 4 STAT get_hits 1 STAT set_cmds 3 STAT set_hits 3 STAT bytes_read 478 STAT bytes_written 3273 STAT threads 4 END stats queue STAT q 1/0 STAT qq 1/0 STAT q4 1/1 END get qq VALUE qq 0 5 qqqqq END stats queue STAT q 1/0 STAT qq 1/1 STAT q4 1/1 END stats queue STAT q 1/0 STAT qq 1/1 STAT q4 1/1 END get q VALUE q 0 5 hhhhh END stats queue STAT q 1/1 STAT qq 1/1 STAT q4 1/1 END get q END get qq END二,Ubuntu下安装Memcached 1,安装libevent wget http://www.monkey.org/~provos/libevent-2.0.13-stable.tar.gz tar xzvf libevent-2.0.13-stable.tar.gz ./configure make sudo make install 2,安装memcached wget http://memcached.googlecode.com/files/memcached-1.4.7.tar.gz tar xvzf memcached-1.4.7.tar.gz cd memcached-1.4.7 ./configure --prefix=/usr/local/memcached/ make make install 有时需要创建软链接 ln -s /usr/local/lib/libevent-2.0.so.5 /lib/libevent-2.0.so.5 memcached的启动 ./memcached -d -m 2048 -l 10.0.0.40 -p 11211 -d 守护进程的方式启动 内存设置------------------- -m 制定内存 -M 内存不够是禁止LRU,报错 -n 48 初始化chuck=key+suffix+value+32结构体,默认48字节 -f 增长银子,默认1.25 -L 启动大内存页,可以降低内存浪费,改进性能 链接设置------------------ -l 服务器名 -p TCP端口号,默认11211 -U UDP端口 默认11211 -c 设置最大并发的线程数 -t 线程数 默认4,由于memcached采用NIO,过多线程作用不大 -P 保存memcached的pid文件 -C 禁止使用那个CAS命令,可以禁止版本计数,减少开销 -R 每个event连接最大并发数,默认20 -h 显示帮助 连接到memcached服务器 telnet localhost 11211/telnet localhost 11211 stats /shell& echo stats | nc localhost 11211 watch "echo stats | nc 192.168.1.123 11200" (实时状态) STAT pid 22021 STAT uptime 78 STAT time 1366042693 STAT version 1.4.7 STAT libevent 2.0.13-stable STAT pointer_size 32 STAT rusage_user 0.000000 STAT rusage_system 0.000000 ----------连接数是否太多-------------------- STAT curr_connections 5 ------open连接数量 STAT total_connections 6 -----服务器运行以来接受的连接总数 ---------------------------- STAT connection_structures 6 STAT cmd_get 0 ----取回请求总数 STAT cmd_set 0 ----存储请求总数 STAT cmd_flush 0 ----------命中率+STAT cmd_get 0 ------------- STAT get_hits 0 ---------请求成功的总次数 STAT get_misses 0 -------请求失败的总次数 ------------------------------------------- STAT delete_misses 0 STAT delete_hits 0 STAT incr_misses 0 STAT incr_hits 0 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 0 STAT cas_hits 0 STAT cas_badval 0 STAT auth_cmds 0 STAT auth_errors 0 STAT bytes_read 7 ------------服务器从网络读取到的总字节数 STAT bytes_written 0 ---------服务器向网络发送的总字节 STAT limit_maxbytes 2147483648 STAT accepting_conns 1 STAT listen_disabled_num 0 STAT threads 4 STAT conn_yields 0 STAT bytes 0 STAT curr_items 0 STAT total_items 0 STAT evictions 0 STAT reclaimed 0 <command> <key> <flags> <expiretime> <bytes> [version] - <flags> 是在取回内容时,与数据和发送块一同保存服务器上的任意16位无符号整形(用十进制来书写)。客户端可以用它作为“位域”来存储一些特定的信息;它对服务器是不透明的。 <expiretime> 0表示永远,《30天60*60*24*30》 <bytes> value的字节数 <command name> is "set", "add" or "replace" set 意思是 “储存此数据” add 意思是 “储存此数据,只在服务器*未*保留此键值的数据时” (key不存在是保存) replace意思是 “储存此数据,只在服务器*曾*保留此键值的数据时” (key存在时replace) "set" means "store this data". "add" means "store this data, but only if the server *doesn't* already hold data for this key". "replace" means "store this data, but only if the server *does* already hold data for this key". set 无论如何都进行存储 add 只有key不存在时进行添加 repalce 只有数据存在时进行替换 cas操作,means:check and set 只有版本号相匹配是才能存取,否则返回EXISTS 目的:多客户端并发修改同一条记录的问题,防止使用改变了的key/valuee对 status slabs 区块数据统计 stats slabs STAT 1:chunk_size 80 STAT 1:chunks_per_page 13107 STAT 1:total_pages 1 STAT 1:total_chunks 13107 STAT 1:used_chunks 2 STAT 1:free_chunks 1 STAT 1:free_chunks_end 13104 STAT 1:mem_requested 111 STAT 1:get_hits 4 STAT 1:cmd_set 6 STAT 1:delete_hits 0 STAT 1:incr_hits 0 STAT 1:decr_hits 0 STAT 1:cas_hits 0 STAT 1:cas_badval 0 STAT active_slabs 1 STAT total_malloced 1048560 stats settings 设置查看 stats settings STAT maxbytes 2147483648 STAT maxconns 1024 STAT tcpport 11211 STAT udpport 11211 STAT inter 127.0.0.1 STAT verbosity 0 STAT oldest 0 STAT evictions on STAT domain_socket NULL STAT umask 700 STAT growth_factor 1.25 STAT chunk_size 48 STAT num_threads 4 STAT num_threads_per_udp 4 STAT stat_key_prefix : STAT detail_enabled no STAT reqs_per_event 20 STAT cas_enabled yes STAT tcp_backlog 1024 STAT binding_protocol auto-negotiate STAT auth_enabled_sasl no STAT item_size_max 1048576 stats items 数据项统计 stats items STAT items:1:number 2 STAT items:1:age 13982 STAT items:1:evicted 0 STAT items:1:evicted_nonzero 0 STAT items:1:evicted_time 0 STAT items:1:outofmemory 0 STAT items:1:tailrepairs 0 STAT items:1:reclaimed 0 END stats sizes 对象数据统计 stats sizes STAT 64 2 END ///////////// 注意点 1,必须长度适合才可以存取<bytes> 2,set 存取 3,add只能添加不存在的key 4,replace只能操作存在的key 5, gets版本书+1 6,cas check and set 多客户端并发 版本号匹配才可以存取 set mc 12 0 9 memcached STORED get mc VALUE mc 12 9 memcached END add mc 12 0 9 hadooperr NOT_STORED get mc VALUE mc 12 9 memcached END repalace mc 12 0 9 ERROR get mc VALUE mc 12 9 memcached END replace mc 12 0 9 memcachef STORED get mc VALUE mc 12 9 memcachef END replace kkk 0 0 5 mmmmm NOT_STORED gets mc VALUE mc 12 9 5 memcachef END get mc VALUE mc 12 9 memcachef END gets mc VALUE mc 12 9 5 memcachef END replace mc 12 0 9 memcacheg STORED get mc VALUE mc 12 9 memcacheg END gets mc VALUE mc 12 9 6 memcacheg END cas mc 21 0 9 6 gets mc STORED gets mc VALUE mc 21 9 7 gets mc END ---只有版本号匹配才可以存取,否则EXISTS cas mc 12 0 9 6 memcaches EXISTS gets mc VALUE mc 21 9 7 gets mc END
作者:luyee2010 发表于2013-6-30 15:58:19 原文链接阅读:58 评论:0 查看评论
httpclient是apache的一个项目:http://hc.apache.org/
文档比较完善:http://hc.apache.org/httpcomponents-client-ga/tutorial/html/
这里就不啰嗦了,主要是在做demo的时候遇到的一些问题在这里总结一下:
1、使用连接池所说http协议时无连接的,但毕竟是基于tcp的,底层上还是需要和服务器建立连接的。对于需要从同一个站点抓取大量网页的程序,应该使用连接池,否则每次抓取都和Web站点建立连接再释放,一方面效率不高,另一方面稍不小心就会疏忽了某些资源的释放、导致站点拒绝连接(很多站点会拒绝同一个ip的大量连接、防止DOS攻击)。连接池的例程如下:
SchemeRegistry schemeRegistry = new SchemeRegistry(); schemeRegistry.register(new Scheme("http", 80, PlainSocketFactory.getSocketFactory())); schemeRegistry.register(new Scheme("https", 443, SSLSocketFactory.getSocketFactory())); PoolingClientConnectionManager cm = new PoolingClientConnectionManager(schemeRegistry); cm.setMaxTotal(200); cm.setDefaultMaxPerRoute(2); HttpHost googleResearch = new HttpHost("research.google.com", 80); HttpHost wikipediaEn = new HttpHost("en.wikipedia.org", 80); cm.setMaxPerRoute(new HttpRoute(googleResearch), 30); cm.setMaxPerRoute(new HttpRoute(wikipediaEn), 50);
SchemaRegistry是注册协议的默认端口号的。PoolingClientConnectionManager是连接池,setMaxTotal设置连接池的最大连接数,setDefaultMaxPerRoute设置每个路由(http://hc.apache.org/httpcomponents-client-ga/tutorial/html/connmgmt.html#d5e467)上的默认连接个数,setMaxPerRoute则单独为某个站点设置最大连接个数。
从连接池中获取http client也很方面:
DefaultHttpClient client = new DefaultHttpClient(cm);
2、设置HttpClient参数
HttpClient需要设置合适的参数,才能更好地工作。默认的参数能够应付少量的抓取,但找到一组合适的参数往往能提高特定情况下的抓取效果。设置参数的例程如下:
DefaultHttpClient client = new DefaultHttpClient(cm); Integer socketTimeout = 10000; Integer connectionTimeout = 10000; final int retryTime = 3; client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, socketTimeout); client.getParams().setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, connectionTimeout); client.getParams().setParameter(CoreConnectionPNames.TCP_NODELAY, false); client.getParams().setParameter(CoreConnectionPNames.SOCKET_BUFFER_SIZE, 1024 * 1024); HttpRequestRetryHandler myRetryHandler = new HttpRequestRetryHandler() { @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= retryTime) { // Do not retry if over max retry count return false; } if (exception instanceof InterruptedIOException) { // Timeout return false; } if (exception instanceof UnknownHostException) { // Unknown host return false; } if (exception instanceof ConnectException) { // Connection refused return false; } if (exception instanceof SSLException) { // SSL handshake exception return false; } HttpRequest request = (HttpRequest) context.getAttribute(ExecutionContext.HTTP_REQUEST); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true; } return false; } }; client.setHttpRequestRetryHandler(myRetryHandler);
HttpResponse response = null; HttpGet get = new HttpGet(url); get.addHeader("Accept", "text/html"); get.addHeader("Accept-Charset", "utf-8"); get.addHeader("Accept-Encoding", "gzip"); get.addHeader("Accept-Language", "en-US,en"); get.addHeader("User-Agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22"); response = client.execute(get); HttpEntity entity = response.getEntity(); Header header = entity.getContentEncoding(); if (header != null) { HeaderElement[] codecs = header.getElements(); for (int i = 0; i < codecs.length; i++) { if (codecs[i].getName().equalsIgnoreCase("gzip")) { response.setEntity(new GzipDecompressingEntity(entity)); } } } return response;
各个header的含义参考http://kb.cnblogs.com/page/92320/
HDFS 分布式文件系统,高容错性,部署到成本低的硬件;
HDFS架构设计
典型的观察者结构,NameNode(1)-----(*)DataNode, NameNode存储元数据,DataNode存小数据,
HDFS设计目标
假设节点失效是常态,任何一个节点挂了,不影响使用(自动的备份,副本);
简单一致的模型,假设一次写-多次读 模式;
流式数据访问;
不支持文件并发写入
不支持文件修改
轻便的访问异构的平台
http://coderplay.iteye.com/blog/1067463
HDFS不适合存储小文件、随机读、文件修改
HDFS概念
NameNode:存储元数据,元数据存在内存和磁盘,保存文件—block DataNode之间的映射关系;
DataNode:存文件内容,磁盘上,维护block id 到datanode本地文件的映射
SecondaryNameNode:将NameNode的fsimage与edit log从NameNode复制到临时目录 将fsimage同edit log合并并产生新的fsimage 将产生的新的fsimage上传给NameNode 清除NameNode中的edit log
NameNode是单点,SecondaryNameNode可以认为是NameNode的备份,不是热切,可能会丢数据。
Block块
数据块,HDFS默认的最基本存储单位,默认64M
好处:抽象的,跟磁盘大小无关,可以存储到多个节点上;
HDFS命令行接口
hadoop fs命令
跟linux的类似
ls(显示一层) lsr(递归)
du(显示所有文件及大小) dus(文件夹大小) count(多少个文件、大小)
hadoop fs –help mv 帮助
mv(移动)
cp(复制)
rm(删除) rmr(递归删除)
mkdir(新文件夹)
put(从本地文件复制到hdfs)
get(从hdfs复制到本地)
getmerge(合并目录中到多个文件 到 本地系统)
copyFromLocal(从本地拷贝到hdfs)