[讨论] 我看www.csdn.net。问题1：为什么不用全文搜索？(只讨论技术)__the_fulltext_server_you_search_

当前位置: 技术问答>java相关

[讨论] 我看www.csdn.net。问题1：为什么不用全文搜索？(只讨论技术)

来源: 互联网发布时间：2015-06-27

本文导语: 很奇怪为什么CSDN没有全文搜索。现在CSDN上的搜索效率不高，也不太实用。明眼人都看出来：CSDN的搜索是基于数据库的，大概是个XXX Like %ABC% 的SQL语句。众所周知，这种查询在数据量增大后的效率会越来越低。而...

很奇怪为什么CSDN没有全文搜索。

现在CSDN上的搜索效率不高，也不太实用。明眼人都看出来：CSDN的搜索是基于数据库的，大概是个XXX Like %ABC% 的SQL语句。众所周知，这种查询在数据量增大后的效率会越来越低。而且，如果像是布尔查询等操作在现在的系统上都是不能做的。

在我的理解上，CSDN的数据应该被分成3部分：1、元数据(Metadata)，基本用户管理，帖子的一些元数据等数据当然应该存放在RDB中，这些数据将是网站的核心；2、内容(Content)，包括帖子的正文，新闻正文等，这些是CSDN的价值所在；3、索引(Index)：是指全文索引，专门为用户提供查询检索，对结构化的内容进行全文检索，CSDN的数据现在大部分都是XML，检索的效果会非常好，速度也会大大提升。

我的问题就是：为什么CSDN不用全文搜索？

请排除非技术因素，指考虑技术！！！！

这个帖子很有意思，我也很希望csdn不光有coding 还有design.

1.你手上的全文检索引擎是否支持中文关键词？这一点是比较重要的，因为全文检索技术实际上就是基于词的索引，即主要以预先准备的词库词汇为关键词对数据建立索引。中文的全文检索由于存在分词的问题，所以需要更多的技术，这一点我相信国内的一些公司其实做得更好。所不同的是这些公司的产品自成一体，产品化程度不似oracle 那样好，可以由第三方独立的开发应用。关于国内的产品，你可以参考www.wondertek.com.cn.

2. 相对rdb中一般索引，full text index显然在模糊检索中速度更快。但相应的带来插入慢的问题。这就是为什么通常这种index都做成异步更新，甚至需要手工维护的原因。

3. 你的问题的关键之处（我看了半天，终于看完，）应是希望对多个字段均建立全文索引，而这在一般数据库中会带来维护困难的问题。所以你希望改用XML结构化查询。其实多个子段和一个字段本质上没有什么区别，如果对一个自身拥有全文索引技术的公司来说，只是在核心的索引的上层加入一些新应用而已。我想你的这种需求其实是很普遍的，（我指多字段模糊查询），这些公司如果聪明的话，就会将这种应用转化为产品中的新的特性。

我的建议是，传统RDB+全文索引应用开发是一个比较好的手段，如果使用XML则带来问题：是否会保留关系数据库常规数据的处理性能。

Apache有一个叫Lucene的项目，是一个基于JAVA的全文检索工具。

"大概是个XXX Like %ABC% 的SQL语句"
how do you know?
where can you see it's not using any fulltext engine?

fulltext engine is fast, but requires extra management and mainenance. if people don't want to do that, this might be the reason.

by the way, if you already have a fulltext search engine, just use it. it'll be a waste of money if you buy it but not use it.
:)

just tried the search. do you mean it's only searching subject?
guesss that's true.
But, my friend, a good search engine may require substantial work. You'll support advanced search like google does. And you'll have to filter out too-common words like "is", "and" etc.
you'd better show the preview of the searched text(for example, I search "vector", you'd better display the setence containing "vector" in the search result) and highlight the searched text.
blah blah blah.
Don't know if DB2 fulltext search engine can do so. But seems Microsoft Search does not.

On the other hand, like "xxx" is very simpler to implement.

首先我很感谢 leonzhao (灯泡)的那份资料,同时也很佩服灯泡对技术的钻研精神!
1 ,我对全文搜索的性能问题,还不是很有信息!那样搜索真的性能明显好于SQL的查询吗!?
2,我看了看源代码,虽然里面很多的elements很适合做搜索,但这里有个问题就是当搜索到了帖子后并不能返回这贴的URL就是类似于http://www.csdn.net/expert/topic/744/744146.xml?temp=.6563532的地址(在XML源代码中并没有此element)那又怎么样列出所有帖子的连接呢!?(当然这点可能全文搜索引擎已经解决)
3,全文搜索引擎有免费的吗!?是不是要钱啊!?(这点可能是非技术问题,但可能是很重要的一点!)
4,排除以上的所有问题,我也觉得应该用全文搜索!

是呀!M$是该死!我去看了几个全文搜索的网站,很好的技术,怎么就不用呢!?

服务器无法抗的住。

这个曾经作过一个未公开的。但是数据库服务器两个表关联查询根本无法查了，去掉全文检索服务，就好了。

数据库服务器
sql server

MS has Microsoft Search as the search engine. It can be integrated with SQL Server. (not sure other rdbms though).
the fulltext index is stored outside of the rdbms. (but maybe cannot be on another independent machine) But of course, can be viewed as part of it.

fulltext index is certainly maintained asynchronously.

Till sql2k, no support for hit context, no support for xml-specific search (maybe I'm wrong, just cannot find it in Books Online). It supports predicates such as CONTAINS, CONTAINSTABLE, FREETEXT, FREETEXTTABLE.

leonzhao, if you store your fulltext engine on another physical machine, how do you do queries like "where id=xxx and contains(xmltext, 'xml')"

even if it's technically feasible, would it generate unnecessary network traffic if you want to return the hit context or even the whole hit document? (does the fulltext server have to return the document to the db server before returning?)

I would suggest using the same machine with multiple disk controllers.

"SQL Server这个据称是糟糕的很的数据库"? :)
many ms bashers like to say things like that. but it is at least the third best rdb product, closely following oracle and db2. It still has a long way to go compared to Oracle. But, compared to Oracle, sql server is like just born yesterday. what do you expect?

csdn 最大的几个表
用户表  238547
贴子表  692451
贴子回复表 4735193

当然，贴子表和论坛表根据时间拆分。

我在给贴子表建立了全文索引后。论坛回复表和论坛用户表建立关联的视图（要获得用户及时的信誉分，等级）就再也无法读出数据，但删除这个全文索引，就可以了。所以csdn就没开通全文检索。

我们用的数据库是 sql server 2000

dont understand either. you may go to google to search relavant issues. maybe it's a bug or misuse of something.

不去考虑自己系统的不去考虑自己系统的可用性?
don't know what you mean. But the several systems I've seen using sql server is having its 可用性. sql is cheap, easy to develop, easy to maintain. only performance maybe a drawback. but not too bad though.

"查询扔是向RDB Server发出的"
yes, the the rdb is responsible to return the hit context?
no matter the context is returned the first time or in a subsequent query, how do you avoid the network traffic?
in arch like web server- rdb+fulltext server
the only traffic is between web server and server.
but in your arch, web server- rdb server - fulltext server
how do you avoid the network traffic between fulltext and rdb?
the same text would have to be trasfered from fulltext to rdb, then from rdb to client.
unless the fulltext and return data directly to web server.
of course, you may argue that the traffic is not a big deal.

"影响检索性能一般只有CPU性能"
don't think so. if things're that easy, then buying several super cpu would be the easiest solution.
normally, all db app depends on the hard disk and algorithm. if you can load all data into memory, many algorithms will be useless. the whole db theory will have to be rewritten. :)

by the way, what was the problem you've got from sql7? i've been using it many years. Although it does have problems, normally you can work around it. can't imagine what can make you think it is
"太糟糕".

8 million is not too much yah. We are having a system with almost 20 million records in a table, it works well. normally we don't have any trouble with it.
maybe you've got lots of experience in db apps, but some theory would really help. :)
many rdb algorithms, indexes, blah blah blah are designed in the concern that hard disk speed is much much slower than memory. why B+ or B* tree is different than normal B tree in memory?
I would say 虽然CPU肯定对性能有影响，但决没有hard disk这么明显.
in short, rdb is not cpu intensive app, HARD DISK is almost always the bottle neck.

your comment on fulltext may be right. not sure.
would you give more info on DSG? 返回命中结果是不正确的, what do you mmean?

by the way, I interviewed with another company, they were having terabytes data in sql server, did not hear any complaint from the tech leader there. :)

sql server is based on nt, nt can never compete with unix on stability.
if you want 0 downtime for a month, better use unix+oracle.

good to know dsg, thank you! (what does the acronym stand for?)

if you have a 2G db, with 4G memory, you'll see the performance boost for sure. no swapping at all. (if you pin the whole db into memory). Of course, memory can never hold the full db in practice.
:)

支持把分析和查询建立在XML文档上，而不是建立在RDB的大字段上。
尽管，现在好多RDB都支持对字段的全文索引，但是，我想，
建立在XML文档上的全文检索会更有前途。

从我以前应用Oracle的全文检索来看，因为添加和更新数据时，
DB需要更新索引，所以大批处理数据时，还需要把全文检索引擎关掉，
比较麻烦。

建立在XML上的内容存储另外一个好处就是，将来比较容易为KB提供接口，
随着内容数量几何级地增加，建立在关键字上的全文检索必将被建立在语义
分析上的知识搜索所取代。本公司在国外的母公司的KB系列产品已经被大规模
地应用到政府，司法以及大型公司中，我们现在的汉化工作也做了很多，
但这个东西在国内得到青睐还需时日，等待海风习来吧。

至于软硬件的效率，我想任何推理和猜测都不如做个简单的原型，在一个将来
实际应用时的数据量上作实验。

祝灯泡好运

请问高手：
将帖子的标题，作者等信息存放在数据库中，
帖子的内容以xml的形式存放在硬盘文件目录中，
然后用win2k的Index Server作全文索引，
这种方法是否可行？

您可能感兴趣的文章:

本站(WWW.)旨在分享和传播互联网科技相关的资讯和技术，将尽最大努力为读者提供更好的信息聚合和浏览方式。
本站(WWW.)站内文章除注明原创外，均为转载、整理或搜集自网络。欢迎任何形式的转载，转载请注明出处。