转自:http://www.drdobbs.com/parallel/indexing-and-searching-on-a-hadoop-distr/226300241
ByKashyap
Santoki, July 29, 2010
In today's information-saturated world, the huge growth of geographically distributed data necessitates a system that facilitates fast parsing for the retrieval of meaningful results. A searchable index for distributed data would go a long way toward speeding
the process. In this article, I demonstrate how to useLuceneand Java for basic data indexing and searching, how to use a RAM directory
for indexing and searching, how to create an index on the data residing in HDF, and how to search those indexes. The development environment consists of Java 1.6, Eclipse 3.4.2, Lucene 2.4.0, and Hadoop 0.19.1 running on Microsoft Windows XP SP3.
For tackling this task, I've turned to Hadoop. TheApache Hadoop Projectdevelops open-source software for reliable, scalable, distributed computing, and theHadoop
Distributed File System(HDFS) is designed for storing and sharing files across wide area networks. HDFS is built to run on commodity hardware and provides fault tolerance, resource management, and most importantly, high throughput access to application
data.
Creating an Index on a Local File System
The first step is to create an index on the data stored in a local file system. Start by creating a project in Eclipse, creating a class in it, then adding all the required JAR files to the project. Take this example of data found in the web server log file
of an application:
2010-04-21 02:24:01 GET /blank 200 120
This data is mapped to some fields:
- 2010-04-21 -- Date field
- 02:24:01 -- Time field
- GET -- Method field (GET or POST) -- we will denote it as "cs-method"
- /blank -- Requested URL field -- we will denote it as "cs-uri"
- 200 -- Status-code for the request -- we will denote it as "sc-status"
- 120 -- Time-taken field (time required to complete request)
The data present in our sample file is located in an "E:\DataFile" named "Test.txt" and is as follows:
2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:01 GET /US/registrationFrame 200 605
2010-04-21 02:24:02 GET /US/kids/boys 200 785
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /US/search 200 658
2010-04-21 02:24:05 POST /US/shop 200 768
2010-04-21 02:24:05 GET /blank 200 347
We want to create index for the data present in this "Test.txt" file and save the index to the local file system. The following Java code that does this. (Note the comments for details on what each part of code does).
// Creating IndexWriter object and specifying the path where Indexed
//files are to be stored.
IndexWriter indexWriter = new IndexWriter("E://DataFile/IndexFiles", new StandardAnalyzer(), true);
// Creating BufferReader object and specifying the path of the file
//whose data is required to be indexed.
BufferedReader reader= new BufferedReader(new FileReader("E://DataFile/Test.txt"));
String row=null;
// Reading each line present in the file.
while ((row=reader.readLine())!= null)
{
// Getting each field present in a row into an Array and file delimiter is "space separated"
String Arow[] = row.split(" ");
// For each row, creating a document and adding data to the document with the associated fields.
org.apache.lucene.document.Document document = new org.apache.lucene.document.Document();
document.add(new Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
document.add(new Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
document.add(newField ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
// Adding document to the index file.
indexWriter.addDocument(document);
}
indexWriter.optimize();
indexWriter.close();
reader.close();
Once the Java code is executed, index files will be created and stored at the location "E://DataFile/IndexFiles."
Searching the Local Index Files
Now we can search for data in the index files that we just created. Basically, search is done on the "field" data. You can search using any various search semantics supported by the Lucene search engine, and you can perform
searches on one particular field or a combination of fields. The following Java code searches the index:
// Creating Searcher object and specifying the path where Indexed files are stored.
Searcher searcher = new IndexSearcher("E://DataFile/IndexFiles");
Analyzer analyzer = new StandardAnalyzer();
// Printing the total number of documents or entries present in the index file.
System.out.println("Total Documents = "+searcher.maxDoc()) ;
// Creating the QueryParser object and specifying the field name on
//which search has to be done.
QueryParser parser = new QueryParser("cs-uri", analyzer);
// Creating the Query object and specifying the text for which search has to be done.
Query query = parser.parse("/blank");
// Below line performs the search on the index file and
Hits hits = searcher.search(query);
// Printing the number of documents or entries that match the search query.
System.out.println("Number of matching documents = "+ hits.length());
// Printing documents (or rows of file) that matched the search criteria.
for (int i = 0; i < hits.length(); i++)
{
Document doc = hits.doc(i);
System.out.println(doc.get("date")+" "+ doc.get("time")+ " "+
doc.get("cs-method")+ " "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
}
In this example, the search is done on the fieldcs-uriand the text that is searched inside thecs-urifield is/blank. So when the search code is run, all the documents
(or rows) for whichcs-urifield contains/blank, are shown in the output. The output is as follows:
Total Documents = 11
Number of matching documents = 7
2010-04-21 02:24:01 GET /blank 200 120
2010-04-21 02:24:02 POST /blank 304 56
2010-04-21 02:24:04 GET /blank 304 233
2010-04-21 02:24:04 GET /blank 500 567
2010-04-21 02:24:04 GET /blank 200 897
2010-04-21 02:24:04 POST /blank 200 567
2010-04-21 02:24:05 GET /blank 200 347
Memory-Based Indexing on HDFS
Now consider a case where data is located in a distributed file system like Hadoop DFS. The aforementioned code will not work for directly creating index on distributed data, so we'd have to complete a few steps before proceeding, such as copying data from
HDFS to a local file system, creating an index of the data present on the local file system, and finally storing the index files back to HDFS. The same steps would be required for searches. But this approach is time-consuming and suboptimal, so instead, let's
index and search our data using the memory of the HDFS node where data is residing.
Assume that the data file "Test.txt" used earlier is now residing on HDFS, inside a working directory folder called "/DataFile/Test.txt." Create another folder called "/IndexFiles" inside the HDFS working directory, where our generated index files will be stored.
The following Java code creates index files in memory for files stored on HDFS:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
|
//
Path where the index files will be stored.
String
Index_DIR="/IndexFiles/";
//
Path where the data file is stored.
String
File_DIR="/DataFile/test.txt";
//
Creating FileSystem object, to be able to work with HDFS
Configuration
config = new Configuration();
FileSystem
dfs = FileSystem.get(config);
//
Creating a RAMDirectory (memory) object, to be able to create index in memory.
RAMDirectory
rdir = new RAMDirectory();
//
Creating IndexWriter object for the Ram Directory
IndexWriter
indexWriter = new IndexWriter (rdir, new StandardAnalyzer(), true);
//
Creating FSDataInputStream object, for reading the data from "Test.txt" file residing on HDFS.
FSDataInputStream
filereader = dfs.open(new Path(dfs.getWorkingDirectory()+ File_DIR));
String
row=null;
//
Reading each line present in the file.
while
((row=reader.readLine())!=null)
{
//
Getting each field present in a row into an Array and file //delimiter is "space separated".
String
Arow[]=row.split(" ");
//
For each row, creating a document and adding data to the document
//with
the associated fields.
org.apache.lucene.document.Document
document = new org.apache.lucene.document.Document();
document.add(new
Field("date",Arow[0],Field.Store.YES,Field.Index.ANALYZED));
document.add(new
Field("time",Arow[1],Field.Store.YES,Field.Index.ANALYZED));
document.add(new
Field ("cs-method",Arow[2],Field.Store.YES,Field.Index.ANALYZED));
document.add(new
Field ("cs-uri",Arow[3],Field.Store.YES,Field.Index.ANALYZED));
document.add(new
Field ("sc-status",Arow[4],Field.Store.YES,Field.Index.ANALYZED));
document.add(new
Field ("time-taken",Arow[5],Field.Store.YES,Field.Index.ANALYZED));
//
Adding document to the index file.
indexWriter.addDocument(document);
}
indexWriter.optimize();
indexWriter.close();
reader.close();
|
Thus, for the "Test.txt" file that is residing on HDFS, we now have index files created in memory. To store the index files in the HDFS folder:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
//
Getting files present in memory into an array.
String
fileList[]=rdir.list();
//
Reading index files from memory and storing them to HDFS.
for
(int i = 0; I < fileList.length; i++)
{
IndexInput
indxfile = rdir.openInput(fileList[i].trim());
long
len = indxfile.length();
int
len1 = (int) len;
//
Reading data from file into a byte array.
byte[]
bytarr = new byte[len1];
indxfile.readBytes(bytarr,
0, len1);
//
Creating file in HDFS directory with name same as that of
//index
file
Path
src = new Path(dfs.getWorkingDirectory()+Index_DIR+ fileList[i].trim());
dfs.createNewFile(src);
//
Writing data from byte array to the file in HDFS
FSDataOutputStream
fs = dfs.create(new Path(dfs.getWorkingDirectory()+Index_DIR+fileList[i].trim()),true);
fs.write(bytarr);
fs.close();
}
dfs.closeAll();
|
Now we have the necessary index files created and stored in the HDFS directory for the "Test.txt" data file.
Memory-Based Searching on HDFS
We can now search the indexes stored in HDFS. First, we must make the HDFS index files available in memory for searching. Here is code is used for this process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
//
Creating FileSystem object, to be able to work with HDFS
Configuration
config = new Configuration();
FileSystem
dfs = FileSystem.get(config);
//
Creating a RAMDirectory (memory) object, to be able to create index in memory.
RAMDirectory
rdir = new RAMDirectory();
//
Getting the list of index files present in the directory into an array.
Path
pth = new Path(dfs.getWorkingDirectory()+Index_DIR);
FileSystemDirectory
fsdir = new FileSystemDirectory(dfs,pth,false,config);
String
filelst[] = fsdir.list();
FSDataInputStream
filereader = null;
for
(int i = 0; i<filelst.length; i++)
{
//
Reading data from index files on HDFS directory into filereader object.
filereader
= dfs.open(new Path(dfs.getWorkingDirectory()+Index_DIR+filelst[i]));
int
size = filereader.available();
//
Reading data from file into a byte array.
byte[]
bytarr = new byte[size];
filereader.read(bytarr,
0, size);
//
Creating file in RAM directory with names same as that of
//index
files present in HDFS directory.
IndexOutput
indxout = rdir.createOutput(filelst[i]);
//
Writing data from byte array to the file in RAM directory
indxout.writeBytes(bytarr,bytarr.length);
indxout.flush();
indxout.close();
}
filereader.close();
|
Now we have all the required index files present in the RAM directory (or memory), so we can directly perform a search on the index files. The search code will be similar to that used for searching the local file system, the only change is that theSearcherobject
will be now created using the RAM directory object (rdir), instead of using the local file system directory path.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
Searcher
searcher = new IndexSearcher(rdir);
Analyzer
analyzer = new StandardAnalyzer();
System.out.println("Total
Documents = "+searcher.maxDoc()) ;
QueryParser
parser = new QueryParser("time", analyzer);
Query
query = parser.parse("02\\:24\\:04");
Hits
hits = searcher.search(query);
System.out.println("Number
of matching documents = "+ hits.length());
for
(int i = 0; i < hits.length(); i++)
{
Document
doc = hits.doc(i);
System.out.println(doc.get("date")+"
"+ doc.get("time")+ " "+
doc.get("cs-method")+
" "+ doc.get("cs-uri")+ " "+ doc.get("sc-status")+ " "+ doc.get("time-taken"));
}
|
For the following output, a search is done on the field "time" and the text that is searched inside the "time" field is "02:\\:24\\:04." So when the code is run, all the documents (or rows) for which the "time" field contains "02:\\:24\\:04," are shown in the
output:
1
2
3
4
5
6
|
Total
Documents = 11
Number
of matching documents = 4
2010-04-21
02:24:04 GET /blank 304 233
2010-04-21
02:24:04 GET /blank 500 567
2010-04-21
02:24:04 GET /blank 200 897
2010-04-21
02:24:04 POST /blank 200 567
|
Conclusion
Distributed file systems like HDFS are a powerful tool for storing and accessing the vast amounts of data available to us today. With memory-based indexing and searching, accessing the data you really want to find amid mountains of data you don't care about
gets a little bit easier.
分享到:
相关推荐
HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to solve analytics, classifications, online ...
* New sections on content-based index compression and distributed querying, with 2 new data structures for fast indexing * New coverage of image coding, including descriptions of de facto standards in...
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google ...
HDFS, and other Hadoop ecosystem components, with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to solve analytics, classifications, online ...
To resolve problems of globalization, random-write and duration in Hadoop, a data indexing approach on Hadoop using the Java Persistence API (JPA) is elaborated in the implementation of a KD-tree ...
Hadoop and HDFS by Apache is widely used for storing and managing Big Data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, ...
solve analytics, classifications, online marketing, recommendations, and data indexing and searching. You will learn how to take advantage of Hadoop ecosystem projects including Hive, HBase, Pig, ...
At the heart of any good-performing database lies a sound indexing strategy that makes appropriate use of indexing, and especially of the vendor-specific indexing features on offer. Few databases ...
MPEG 7 Audio and Beyond Audio Content Indexing and Retrieval 英文
At the heart of any good-performing database lies a sound indexing strategy that makes appropriate use of indexing, and especially of the vendor-specific indexing features on offer. Few databases ...
Compressed indexing and local alignment of DNA 生物信息论文
A Framework for Automated Driving System Testable Cases and Scenarios
logfile settings and Java environment settings, and monitor and control distributed configurations. : This section describes how Solr organizes its data Documents, Fields, and Schema Design for ...
Fingerprint Indexing Based on Principal Minutiae Supportive System
Managing gigabytes:compressing and indexing documents and images英文扫描版,共2部分。作者是Ian H. Witten/Alistair Moffat/Timothy C. Bell,应该是全面介绍信息检索最好的书了。
Web Searching and Mining (Cognitive Intelligence and Robotics) ISBN-10 书号: 9811330522 ISBN-13 书号: 9789811330520 Edition 版本: 1st ed. 2019 出版日期: 2019-02-04 pages 页数: (176) This book ...
The Optics Classification and Indexing Scheme (OCIS) provides a flexible, comprehensive classification system for all optical author input and user retrieval needs. OCIS has a two-level hierarchical ...
Oracle索引和访问路径专家,第二版
Data storage and indexing,介绍数据索引方法
file_btree_indexing 文件中BTREE索引的演示