(六)—Nutch抓取结果分析
本教程介绍如何使用Nutch的readdb,readlinkdb和readseg来对Nutch的数据进行分析
1 readdb
用于读取或者导出Nutch的抓取数据库,通常用于查看数据库的状态信息,查看readdb的用法:
bin/nutch readdb
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
<crawldb>directory name where crawldb is located
-stats [-sort] print overall statistics to System.out
[-sort]list status sorted by host
-dump <out_dir> [-format normal|csv|crawldb]dump the whole db to a text file in <out_dir>
[-format csv]dump in Csv format
[-format normal]dump in standard format (default option)
[-format crawldb]dump as CrawlDB
[-regex <expr>]filter records with expression
[-retry <num>]minimum retry count
[-status <status>]filter records by CrawlDatum status
-url <url>print information on <url> to System.out
-topN <nnnn> <out_dir> [<min>]dump top <nnnn> urls sorted by score to <out_dir>
[<min>]skip records with scores below this value.
This can significantly improve performance.
这里的crawldb即为保存URL信息的数据库,具体可参阅http://www.sanesee.com/article/step-by-step-nutch-crawl-by-step(Nutch 1.10入门教程(五)——分步抓取),-stats表示查看统计状态信息,-dump表示导出统计信息,url表示查看指定URL的信息,查看数据库状态信息:
bin/nutch readdb data/crawldb –stats
得到的统计结果如下:
Statistics for CrawlDb: data/crawldb
TOTAL urls: 59
retry 0: 59
min score: 0.001
avg score: 0.049677964
max score: 1.124
status 1 (db_unfetched): 34
status 2 (db_fetched): 25
CrawlDb statistics: done
TOTAL urls表示URL总数,retry表示重试次数,mins score为最低分数,max score为最高分数,status 1 (db_unfetched)为未抓取的数目,status 2 (db_fetched)为已抓取的数目。
导出crawldb信息:
bin/nutch readdb data/crawldb -dump crawldb_dump
将数据导入到crawldb_dump这个文件夹中,查看导出的数据信息:
cat crawldb_dump/*
可以看到,导出的信息类似以下格式:
http://www.sanesee.com/psy/pdp Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Aug 14 12:47:10 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.082285136
Signature: e567e99a1d008ae29266a7ef9ea43414
Metadata:
_pst_=success(1), lastModified=0
_rs_=205
Content-Type=text/html
我们就可以清楚地看到crawldb是如何保存我们的URL的。
2 readlinkdb
readlinkdb用于导出全部URL和锚文本,查看用法:
bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> (-dump <out_dir> [-regex <regex>]) | -url <url>
-dump <out_dir>dump whole link db to a text file in <out_dir>
-regex <regex>restrict to url's matching expression
-url <url>print information about <url> to System.out
这里的dump和url参数与readdb命令同理,导出数据:
bin/nutch readlinkdb data/linkdb -dump linkdb_dump
将数据导入到linkdb_dump这个文件夹中,查看导出的数据信息:
cat linkdb_dump /*
可以看到,导出的信息类似以下格式:
http://archive.apache.org/dist/nutch/ Inlinks:
fromUrl: http://www.sanesee.com/article/step-by-step-nutch-introduction anchor: http://archive.apache.org/dist/nutch/
即记录了来源URL。
3 readseg
readseg用于查看或导出segment里面的数据,查看使用方法:
bin/nutch readseg
Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]
* General options:
-nocontentignore content directory
-nofetchignore crawl_fetch directory
-nogenerateignore crawl_generate directory
-noparseignore crawl_parse directory
-noparsedataignore parse_data directory
-noparsetextignore parse_text directory
* SegmentReader -dump <segment_dir> <output> [general options]
Dumps content of a <segment_dir> as a text file to <output>.
<segment_dir>name of the segment directory.
<output>name of the (non-existent) output directory.
* SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]
List a synopsis of segments in specified directories, or all segments in
a directory <segments>, and print it on System.out
<segment_dir1> ...list of segment directories to process
-dir <segments>directory that contains multiple segments
* SegmentReader -get <segment_dir> <keyValue> [general options]
Get a specified record from a segment, and print it on System.out.
<segment_dir>name of the segment directory.
<keyValue>value of the key (url).
Note: put double-quotes around strings with spaces.
导出segment数据:
bin/nutch readseg -dump data/segments/20150715124521 segment_dump
将数据导入到segment_dump这个文件夹中,查看导出的数据信息:
cat segment_dump /*
可以看到,里面包含非常具体的网页信息。
到此,本教程对Nutch最主要的命令就介绍完了,其它的命令读者可以自己去研究一下。
更多建议: