Do Data Scientists Dream of Electric Sheep?

Yes!

Well I do, or I certainly did last night. Wondering how to do things, laying in bed in the middle of the night, is the time when you recap the days work; sometimes you think of the solution you were searching for just before you drop off. Yesterday I was deciding how to gather all the bits of unstructured data, used in the application I am working on, together.

Living in the foothills of the Pennines you get used to seeing sheep scattered across the hills. On significant occasions they are gathered together. Those of you who have kept, or, are familiar with sheep, will appreciate that gathering them together is made simpler due to the innate characteristic of sheep to stick together. If they break out of an enclosure by breaching the perimeter fence they will always use the same exit; unlike bullocks who go through the boundaries as if at random and leave many parts to be patched up, sheep stick together and follow each other.

Now, if I could get my data to be a bit more sheep like, it would make the part of the application I am currently designing much easier. In other words if the data stuck together in a flock I would be able to move and organize it much more easily. If the data was gregarious like the Ovis Aries I could shepherd it into the most useful places much more easily.

If the question I need to ask is;get me all captured data that deals with ‘structure’, ‘shared attributes’, ‘counting’ close to ‘attributes’, ‘different sources’, ‘flocking’, ‘characteristics’, ‘flocking’ very close to ‘characteristics’, ‘segments’, ‘text’bringing back answers that behave as a flock could be very useful.

The answer may be to search through the varied segments of data looking for the structural and textual  attributes that these segments share; by counting these attributes you could assign flocking characteristics.

Let me illustrate this by example. The application has captured data from different sources. Suppose we do a full text search on terms from one segment across all of the segments and then count the number of common terms. (Try Googling “The answer may be to search through the various segments of data looking for the structural and textual shared attributes by counting these attributes you could assign flocking characteristics. Let me illustrate this by example. The application has captured data from different sources.” to get a feel for this).

The result set is indexed, missing out stop words etc, and looking at the proximity between words, sorting the result set by occurrences and proximity of words it starts to join bits of data to the flock.

The next stage is to mark the segments as members of a flock and add attributes to describe the flock, in other words add meta data to the segments. Now when we move a segment into a particular area the rest of the data segments flock and follow it. We have electronic sheep!

 

I will comment on this further and in future post I will expand on why in particular; Philip K. Dick , Ridley Scott and Jean Baudrillard have some appeal for Data Scientists.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s