Help get this topic noticed by sharing it on Twitter, Facebook, or email.

Need guidance on crawling and processing data from message boards.

We need to a number of message boards and perform sentiment analysis and topic/idea extraction on the threads and replies. We need to crawl 300 to 400 message boards and look for posts related to approximately 1000 different companies.

Since we are new to this type of data collection we are not sure the best way to proceed. Would we crawl the message boards and scrape all posts since the last crawl. Then run queries against the results for each company. Or would we do separate crawls on the message boards for each company as a keyword.

Also, is there a way to set up a crawler that would be able to automatically crawl all of the boards for posts made since the last crawl or would we need to customize a crawler for each board?

Incidentally, we will not inappropriately scrape and use any boards data. We will display only a small portion of the content (one sentence if any at all) and link each entry to its original on the boards from which it came. Also, we will link any extracted ideas to the original posts/threads.

Thank you for any suggestions.
2 people have
this question
+1
Reply
  • There are a variety of ways you can go about this, but I would recommend the following:

    1. Group crawls from multiple (10-100) message boards into a single crawl.

    2. Build a custom 80app (http://wiki.80legs.com/80apps) that will follow the appropriate links to posts on each message board. Within that 80app, also use a list of keywords that represent the companies you want to track. Whenever you find that keyword, return 100 or more characters surrounding it to get a slice of the post relevant to your needs.

    3. Have the 80app also track various date formats on pages crawled (since presumably message boards may use different date representations). If you see a date that is too old, have the 80app stop crawling at that page.
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. kidding, amused, unsure, silly indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated

  • Thanks for the response. One thing I am still not sure of is how to crawl all posts on the selected boards that have been made over a period of time (such as last 24 hrs). I guess it would be possible to create an app that submits each company as a search term then crawl the results.

    Maybe the best idea is to create an account and play with the crawler a little to see what is possible.

    Thanks again.
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. kidding, amused, unsure, silly indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated

  • Hi Finn,

    Here I want to came here to share my knowledge and skill with you.

    Collecting data from on-line communities like discussion boards, blogs and social networks.

    We uses deep linguistic parsing, statistical natural language processing, and machine learning to analyse your content, extracting semantic meta-data: information about people, places, companies, topics, languages, and more.

    Short note about Platform overview.

    A single unified platform for all content types (consolidate to reduce development and maintenance costs)
    Flexible system which can support any new content type
    High automation (cut configuration costs)
    Real time coverage or as close as possible for each content type unique

    Some of our challenges

    Supporting many different types of content
    Automatically “understanding” millions of sites with different structures
    Over 6000 message boards
    Over 90 million blogs
    Handling large quantities of data
    Over 2 million new messages per day
    Over 1 million new blog posts per day
    Supporting data in different languages
    Attached sample for your perusal, Please review and give me you valuable feedback.

    We are pleasure to share some more samples, If you need from your input.

    Thanks,
    Suresh
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. kidding, amused, unsure, silly indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated