The Mystery of the Missing Document

I had a great start with the year 2024. I decided to use Lucene for the full-text searching capability for the fishing site I was designing. After I got the latest version of lucene-core, and the work I have done in 2019, I could index a dummy document into the Lucene data repository. Then I ran into the problem of searching for the document. For certain text I entered, no results were returned.

It baffled me for a few nights. I was working an hour or two per night, so I didn't waste much time. But it still took me some time to figure out what was causing the issue. The issue is when I used certain words to search, the document returned. In other words, the document didn't return. For indexing the document, I created multiple fields, most of which were fields of the type StringField. One of the fields was of the type TextField. One field was of the type LongField, used for storing integer values, specifically for storing the date associated with the document. There was one more field that stored string values that would not be indexed. The problem was related to the fields of the type StringField. It took me a while to realize this was happening.

Once I realized what was happening, the first thing I tried was listing out all the tokens of the document. That failed. Somehow, I was able to find the tokens in one of the files of Lucene data repository. The file contains strings of text that was not broken into individual words, and indexed. The whole line of text was indexed as one token. These were happening for the fields with the type StringField.

The next day, I checked the documentation on the field type definitions. Turned out, the StringField type does not break apart the words into tokens for indexing. The whole string is treated as one token. It is called a Term. So, to get the fields properly tokenized for indexing, I need to change the fields to data type TextField, not StringField. I did a quick experiment with this change. And everything worked out as expected.

To summarize, the data type StringField stores string values that cannot be broken apart for indexing. Only the single words or phrases I want to keep as one piece should be kept in fields of StringField type. If I want the string to be broken apart to be indexed, I need to use TextField data type to store it. I made the mistake because I didn't read the documentation thoroughly. Next time, when I start learning something new, I should always read the documentation first, and understand it before trying it out.

Your Comment


Required
Required
Required

All Related Comments

Loading, please wait...
{{cmntForm.errorMsg}}
{{cmnt.guestName}} commented on {{cmnt.createDate}}.

There is no comments to this post/article.