Indexing and Searching on a Hadoop Distributed File System (如何在HDFS文件上建索引)

cloudeagle_bupt

浏览: 536457 次

最近访客更多访客>>

morelily

csmnjk

jnh

superich2008

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (2884)

社区版块

存档分类

转自:http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241

ByKashyap Santoki, July 29, 2010

In today's information-saturated world, the huge growth of geographically distributed data necessitates a system that facilitates fast parsing for the retrieval of meaningful results. A searchable index for distributed data would go a long way toward speeding the process. In this article, I demonstrate how to useLuceneand Java for basic data indexing and searching, how to use a RAM directory for indexing and searching, how to create an index on the data residing in HDF, and how to search those indexes. The development environment consists of Java 1.6, Eclipse 3.4.2, Lucene 2.4.0, and Hadoop 0.19.1 running on Microsoft Windows XP SP3.

For tackling this task, I've turned to Hadoop. TheApache Hadoop Projectdevelops open-source software for reliable, scalable, distributed computing, and theHadoop Distributed File System(HDFS) is designed for storing and sharing files across wide area networks. HDFS is built to run on commodity hardware and provides fault tolerance, resource management, and most importantly, high throughput access to application data.

Creating an Index on a Local File System

The first step is to create an index on the data stored in a local file system. Start by creating a project in Eclipse, creating a class in it, then adding all the required JAR files to the project. Take this example of data found in the web server log file of an application:

2010-04-21 02:24:01 GET /blank 200 120

This data is mapped to some fields:

2010-04-21 -- Date field
02:24:01 -- Time field
GET -- Method field (GET or POST) -- we will denote it as "cs-method"
/blank -- Requested URL field -- we will denote it as "cs-uri"
200 -- Status-code for the request -- we will denote it as "sc-status"
120 -- Time-taken field (time required to complete request)

The data present in our sample file is located in an "E:\DataFile" named "Test.txt" and is as follows:

2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:01 GET /US/registrationFrame 200 605
2010-04-21 02:24:02 GET /US/kids/boys 200 785
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /US/search 200 658
2010-04-21 02:24:05 POST /US/shop 200 768
2010-04-21 02:24:05 GET /blank 200 347

We want to create index for the data present in this "Test.txt" file and save the index to the local file system. The following Java code that does this. (Note the comments for details on what each part of code does).

// Creating IndexWriter object and specifying the path where Indexed
//files are to be stored.
IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true);
            
// Creating BufferReader object and specifying the path of the file
//whose data is required to be indexed.
BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt"));
            
String row=null;
        
// Reading each line present in the file.
while ((row=reader.readLine())!= null)
{
// Getting each field present in a row into an Array and file delimiter is "space separated"
String Arow[] = row.split(" ");
                
// For each row, creating a document and adding data to the document with the associated fields.
org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
                
document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
                
// Adding document to the index file.
indexWriter.addDocument(document);
}        
indexWriter.optimize();
indexWriter.close();
reader.close();

Once the Java code is executed, index files will be created and stored at the location "E://DataFile/IndexFiles."

Searching the Local Index Files

Now we can search for data in the index files that we just created. Basically, search is done on the "field" data. You can search using any various search semantics supported by the Lucene search engine, and you can perform searches on one particular field or a combination of fields. The following Java code searches the index:

// Creating Searcher object and specifying the path where Indexed files are stored.
Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles");
Analyzer analyzer = new StandardAnalyzer();

// Printing the total number of documents or entries present in the index file.
System.out.println("Total Documents = "+searcher.maxDoc()) ;
            
// Creating the QueryParser object and specifying the field name on 
//which search has to be done.
QueryParser parser = new QueryParser("cs-uri", analyzer);
            
// Creating the Query object and specifying the text for which search has to be done.
Query query = parser.parse("/blank");
            
// Below line performs the search on the index file and
Hits hits = searcher.search(query);
            
// Printing the number of documents or entries that match the search query.
System.out.println("Number of matching documents = "+ hits.length());

// Printing documents (or rows of file) that matched the search criteria.
for (int i = 0; i < hits.length(); i++)
{
    Document doc = hits.doc(i);
    System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
    doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
}

In this example, the search is done on the fieldcs-uriand the text that is searched inside thecs-urifield is/blank. So when the search code is run, all the documents (or rows) for whichcs-urifield contains/blank, are shown in the output. The output is as follows:

Total Documents = 11
Number of matching documents = 7
2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /blank 200 347

Memory-Based Indexing on HDFS

Now consider a case where data is located in a distributed file system like Hadoop DFS. The aforementioned code will not work for directly creating index on distributed data, so we'd have to complete a few steps before proceeding, such as copying data from HDFS to a local file system, creating an index of the data present on the local file system, and finally storing the index files back to HDFS. The same steps would be required for searches. But this approach is time-consuming and suboptimal, so instead, let's index and search our data using the memory of the HDFS node where data is residing.

Assume that the data file "Test.txt" used earlier is now residing on HDFS, inside a working directory folder called "/DataFile/Test.txt." Create another folder called "/IndexFiles" inside the HDFS working directory, where our generated index files will be stored. The following Java code creates index files in memory for files stored on HDFS:

//
 Path where the index files will be stored.

String
 Index_DIR="/IndexFiles/";

//
 Path where the data file is stored.

String
 File_DIR="/DataFile/test.txt";

//
 Creating FileSystem object, to be able to work with HDFS

Configuration
 config = new Configuration();

config.set("fs.default.name","hdfs://127.0.0.1:9000/");

FileSystem
 dfs = FileSystem.get(config);

//
 Creating a RAMDirectory (memory) object, to be able to create index in memory.

RAMDirectory
 rdir = new RAMDirectory();

//
 Creating IndexWriter object for the Ram Directory

IndexWriter
 indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true);

//
 Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS.

FSDataInputStream
 filereader = dfs.open(new Path(dfs.getWorkingDirectory()+ File_DIR));

String
 row=null;

//
 Reading each line present in the file.

while
 ((row=reader.readLine())!=null)

{

//
 Getting each field present in a row into an Array and file //delimiter is "space separated".

String
 Arow[]=row.split(" ");

//
 For each row, creating a document and adding data to the document 

//with
 the associated fields.

org.apache.lucene.document.Document
 document = new org.apache.lucene.document.Document();

document.add(new
 Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));

document.add(new
 Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));

document.add(new
 Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));

document.add(new
 Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));

document.add(new
 Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));

document.add(new
 Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));

//
 Adding document to the index file.

indexWriter.addDocument(document);

}

indexWriter.optimize();

indexWriter.close();

reader.close();

Thus, for the "Test.txt" file that is residing on HDFS, we now have index files created in memory. To store the index files in the HDFS folder:

//
 Getting files present in memory into an array.

String
 fileList[]=rdir.list();


//
 Reading index files from memory and storing them to HDFS.

for
 (int i = 0; I < fileList.length; i++)

{

IndexInput
 indxfile = rdir.openInput(fileList[i].trim());

long
 len = indxfile.length();

int
 len1 = (int) len;


//
 Reading data from file into a byte array.

byte[]
 bytarr = new byte[len1];

indxfile.readBytes(bytarr,
 0, len1);



//
 Creating file in HDFS directory with name same as that of 

//index
 file 

Path
 src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim());

dfs.createNewFile(src);


//
 Writing data from byte array to the file in HDFS

FSDataOutputStream
 fs = dfs.create(new Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true);

fs.write(bytarr);

fs.close();

}

dfs.closeAll();

Now we have the necessary index files created and stored in the HDFS directory for the "Test.txt" data file.

Memory-Based Searching on HDFS

We can now search the indexes stored in HDFS. First, we must make the HDFS index files available in memory for searching. Here is code is used for this process:

//
 Creating FileSystem object, to be able to work with HDFS

Configuration
 config = new Configuration();

config.set("fs.default.name","hdfs://127.0.0.1:9000/");

FileSystem
 dfs = FileSystem.get(config);

//
 Creating a RAMDirectory (memory) object, to be able to create index in memory.

RAMDirectory
 rdir = new RAMDirectory();

//
 Getting the list of index files present in the directory into an array.

Path
 pth = new Path(dfs.getWorkingDirectory()+Index_DIR);

FileSystemDirectory
 fsdir = new FileSystemDirectory(dfs,pth,false,config);

String
 filelst[] = fsdir.list();

FSDataInputStream
 filereader = null;

for
 (int i = 0; i<filelst.length; i++)

{

//
 Reading data from index files on HDFS directory into filereader object.

filereader
 = dfs.open(new Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i]));

int
 size = filereader.available();

//
 Reading data from file into a byte array. 

byte[]
 bytarr = new byte[size];

filereader.read(bytarr,
 0, size);

//
 Creating file in RAM directory with names same as that of 

//index
 files present in HDFS directory.

IndexOutput
 indxout = rdir.createOutput(filelst[i]);

//
 Writing data from byte array to the file in RAM directory

indxout.writeBytes(bytarr,bytarr.length);

indxout.flush();

indxout.close();

}

filereader.close();

Now we have all the required index files present in the RAM directory (or memory), so we can directly perform a search on the index files. The search code will be similar to that used for searching the local file system, the only change is that theSearcherobject will be now created using the RAM directory object (rdir), instead of using the local file system directory path.

Searcher
 searcher = new IndexSearcher(rdir);

Analyzer
 analyzer = new StandardAnalyzer();

System.out.println("Total
 Documents = "+searcher.maxDoc()) ;

QueryParser
 parser = new QueryParser("time", analyzer);

Query
 query = parser.parse("02\\:24\\:04");

Hits
 hits = searcher.search(query);

System.out.println("Number
 of matching documents = "+ hits.length());

for
 (int i = 0; i < hits.length(); i++)

{

Document
 doc = hits.doc(i);

System.out.println(doc.get("date")+"
 "+ doc.get("time")+ " "+

doc.get("cs-method")+
 " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));

}

For the following output, a search is done on the field "time" and the text that is searched inside the "time" field is "02:\\:24\\:04." So when the code is run, all the documents (or rows) for which the "time" field contains "02:\\:24\\:04," are shown in the output:

1

2

3

4

5

6

Total
 Documents = 11

Number
 of matching documents = 4

2010-04-21
 02:24:04 GET /blank 304 233

2010-04-21
 02:24:04 GET /blank 500 567

2010-04-21
 02:24:04 GET /blank 200 897

2010-04-21
 02:24:04 POST /blank 200 567

Conclusion

Distributed file systems like HDFS are a powerful tool for storing and accessing the vast amounts of data available to us today. With memory-based indexing and searching, accessing the data you really want to find amid mountains of data you don't care about gets a little bit easier.

分享到：

数据结构之索引技术（线性、倒排、动态 ... | 顶级技术网站博客汇总

2014-01-30 00:49
浏览 135
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论