Email Ripper
Hi friends,After a long time yesterday, AamzillaA did something in C. I was
browsing through various forums, and I saw a post promising free
online brilliant tutorial's GATE CS material. I got interested in this
particular post since almost 700+ persons had responded to it giving
their email ids. I was just wondering, how difficult it would have
been for the person who posted the question to "copy-paste" all the
700+ emails. It was then I got this idea for "Email-Ripper" , a
program to automatically extract email ids from orkut forums.
I am posting the source code here in this blog, and any one who would
have made something like this please get in touch with me , for any
ideas to improve up-on what I have done...
Email-Ripper works as follows.
1. Down load all pages containing email ids and save them as 1.mhtml,
2.mhtml and so on.. (up to a maximum of 100 files)
2. Run Email Ripper in the same directory.
3. You get an output file named emails.txt containing all the emails
in these 100 files !!!
I was doing a thing like this first , and to find an algorithm to
extract email ids alone was little challenging. It because people type
a lot of things along with an email to make extracting them difficult.
Some will type abc@abc.com.....PLZZZ , abc at abc dot com, abc [at]
abc [dot] com , (abc) at (abc) .com etc etc...
Filtering all the junk out of the relevant part of the email id was
indeed difficult and still I have not achieved a nice method to remove
all unwanted things , a person would type in conjunction with the
email ids, while posting them to forums like, orkut requesting for
materials.
A lot of improvements are needed in Email Ripper to produce 100%
accurate results. Yet here is my version of this, hope it would be
useful for some one. If u have any ideas to make it better kindly send
me a mail.
Well, my algorithm to find an email id works something like this,
1. Start reading characters from a files, write them in an array and
check whether it is @.
2. If its @ step back until u find the beginning of the email id. it
could be a 'space' , comma, semicolon ... anything except a character,
. or _
3. Repeat the same thing for characters after the @ symbol.
Well, I know my code is not beautiful.. there could be a 101 ways to
optimize it. I am yet a newbie in programming. So any suggestions to
improve the style of programming will be most welcome. More-over, if u
could understand my code, then please contact me , and I will give u a
kit-kat as a treat :-) . I wonder whether even I would understand what
I had written , may be after a few months.
Can someone suggest some good and small books to learn good practices
in software engineering ?
So here comes the source code guys, download it, use it and give me
the credit if its credit worthy
:-)
I am planning to use this "Email Ripper" to find more friends from
all over the world, and also to invite more people to have a look at
this blog.
Thats it friends, stay tuned for more stuff regarding GATE-CS from
me. Will I ever be able to solve all previous GATE questions and post
it in this blog ? Well, for a long time I was searching the net to
find solutions to GATE problems , but without much luck.. So I just
thought, why don't i stop searching for solutions to GATE paper and
instead solve them myself ... It would benefit not only me but also
anyone who might be as eager as me to get free solutions to GATE
previous papers. So that is how this blog was born...
Oops.. tooo much ramblings today rt ?
Ok me signing off... just have a look at the code..
:-)
AamzillaA
// Email Ripper
// Copyright 2007, AamzillaA.
// 12-11-2007
// This program finds all email ids in any file
// and prints all of them in emails.txt
// this is the automatic version which assumes
// input file names as 1.mhtml to 100.mhtml
#include
#include
#include
#define BUFFER_SIZE 10000
#define NUMBER_OF_FILES 100
void main()
{
FILE *fptr,*out_file_ptr;
char ch,buffer[BUFFER_SIZE],email[40],file_name[30];
int i=0,j=0,k=0,l=0,count=0,automate_flag=1,file_name_index=0;
int email_flag=0,print_email_flag=0,at_marker=0,null_marker;
char default_file_names[NUMBER_OF_FILES][10] =
{"1.mhtml","2.mhtml",
"3.mhtml","4.mhtml","5.mhtml","6.mhtml","7.mhtml","8.mhtml","9.mhtml","10.mhtml",
"11.mhtml","12.mhtml","13.mhtml","14.mhtml","15.mhtml","16.mhtml","17.mhtml","18.mhtml","19.mhtml","20.mhtml",
"21.mhtml","22.mhtml","23.mhtml","24.mhtml","25.mhtml","26.mhtml","27.mhtml","28.mhtml","29.mhtml","30.mhtml",
"31.mhtml","32.mhtml","33.mhtml","34.mhtml","35.mhtml","36.mhtml","37.mhtml","38.mhtml","39.mhtml","40.mhtml",
"41.mhtml","42.mhtml","43.mhtml","44.mhtml","45.mhtml","46.mhtml","47.mhtml","48.mhtml","49.mhtml","50.mhtml",
"51.mhtml","52.mhtml","53.mhtml","54.mhtml","55.mhtml","56.mhtml","57.mhtml","58.mhtml","59.mhtml","60.mhtml",
"61.mhtml","62.mhtml","63.mhtml","64.mhtml","65.mhtml","66.mhtml","67.mhtml","68.mhtml","69.mhtml","70.mhtml",
"71.mhtml","72.mhtml",
"73.mhtml","74.mhtml","75.mhtml","76.mhtml","77.mhtml","78.mhtml","79.mhtml","80.mhtml",
"81.mhtml","82.mhtml",
"83.mhtml","84.mhtml","85.mhtml","86.mhtml","87.mhtml","88.mhtml","89.mhtml","90.mhtml",
"91.mhtml","92.mhtml",
"93.mhtml","94.mhtml","95.mhtml","96.mhtml","97.mhtml","98.mhtml","99.mhtml","100.mhtml",
};
//fptr=fopen("7.mhtml","rb");
clrscr();
printf("\n Email Ripper \n ");
printf("\n Email Ripper finds all emails in any file." );
printf("\n Email Ripper prints them all in emails.txt");
printf("\n\n\n Designer: AamzillaA");
printf("\n\n\n * Enter file name as 'automate' to automate entering
file names\n\n ");
printf("\n In automate mode input file names are assumed to be... \n ");
printf("\n 1.mhtml,2.mhmtl ... 100.mhtml\n");
printf("\n You can extract email ids from a maximum of 100 files in
automate mode \n\n\n ");
printf("\n\n Enter file name : ");
scanf("%s",file_name);
automate_flag = strcmp(file_name,"automate");
//printf("\n Automate flag = %d ",automate_flag);
file_name_index=0;
//getch();
clrscr();
start_automation:
// Checking whether to automate or not...
if(automate_flag==0)
{
strcpy(file_name,default_file_names[file_name_index]);
file_name_index++;
if(file_name_index > NUMBER_OF_FILES)
{
goto end_of_email_ripper;
}
}
printf("\n Email ids in file %s \n" ,file_name);
fptr=fopen(file_name,"rb");
out_file_ptr=fopen("emails.txt","ab"); // open file in append binary mode
buffer[i]=getc(fptr);
while( buffer[i] != EOF)
{
//printf("%c",buffer[i]);
//check 1
if( buffer[i]=='@' ) // first see whether @ has arrived
{
at_marker=i;
i++;
email_flag=1;
print_email_flag=0;
buffer[i]='\0';
// printf("\n\n %s \n\n",buffer);
i=0; // preparing to over write the buffer
// getch();
}
//check 2
//checking whether end of email has reached after seeing @
if( ( buffer[i] ==' ' || buffer[i]=='<' || buffer[i]== '>' ||
buffer[i] == '(') && (email_flag==1) )
{
email_flag=0;
buffer[i]='\0';
//copying part after @ to email
if(email[null_marker]=='\0')
{
k=null_marker-1; // removing null and extra @ from end of string email.
}
i=0;
while(buffer[i] != '\0')
{
email[k]=buffer[i];
i++;
k++;
}
email[k]='\0'; // copying null to email from buffer
i=0; // initializing buffer index to zero
k=0; // initializing string email...
//printing extention of email id
// printf(" \n Extention is %s \n",buffer);
// printing the complete email id
count++;
printf("\n ( %d ) %s ",count,email);
// writing email to out.txt
l=0;
while(email[l] != '\0')
{
putc(email[l],out_file_ptr);
l++;
}
putc(',',out_file_ptr);
//appending , to seperate email ids
// getch();
}
//check 3
if(i==BUFFER_SIZE-2) // code to prevent buffer over flow, array out of
bound error
{
i=0;
buffer[BUFFER_SIZE-1]='\0';
}
if(email_flag==1 && print_email_flag ==0 ) //rewinding
{
// printf("\n at marker :%d",at_marker);
j=at_marker;
//rewinding
while( buffer[j] != ' ' ) //rewinding backwards from @ till a space is seen
{
if(buffer[j]=='>')
{
break; // stop rewinding when u see a space or >
}
if(buffer[j]==';')
{
break;
}
if(buffer[j]=='<')
{
break;
}
if(buffer[j]==':')
{
break;
}
// printf("\n j: %d",j);
j--; // rewinding till u see a space if not seen a > already
}
//getch();
//copying first part till @ to string email.
while(buffer[j] != '@') //copying rewinded string to email
{
email[k]=buffer[j];
j++;
k++;
// printf(" \n j= %d k= %d ",j,k);
}
email[k]=buffer[j];
k++;
email[k]='\0';
// AamzillaA
null_marker=k;
k=0;
email[0]=' '; // to remove space ,>,< and other unwanted characters
//printf("\n\n Email id is %s ",email);
print_email_flag=1;
//getch();
} // end of rewinding if
i++;
buffer[i]=getc(fptr);
}// goto start of while and read next character ...
fclose(fptr);
fclose(out_file_ptr);
if(automate_flag==0)
{
printf("\n End of file %s \n",file_name);
//getch();
goto start_automation;
}
end_of_email_ripper:
printf(" \n*********************************************************\n" );
printf("\n Total number of email ids extracted to 'emails.txt' : %d \n",count);
printf("\n Thanks for using Email Ripper :-) \n");
printf("\n\n AamzillaA at gmail dot com \n");
printf(" \n*********************************************************\n" );
getch();
}
// Hope u enjoyed this :-)
browsing through various forums, and I saw a post promising free
online brilliant tutorial's GATE CS material. I got interested in this
particular post since almost 700+ persons had responded to it giving
their email ids. I was just wondering, how difficult it would have
been for the person who posted the question to "copy-paste" all the
700+ emails. It was then I got this idea for "Email-Ripper" , a
program to automatically extract email ids from orkut forums.
I am posting the source code here in this blog, and any one who would
have made something like this please get in touch with me , for any
ideas to improve up-on what I have done...
Email-Ripper works as follows.
1. Down load all pages containing email ids and save them as 1.mhtml,
2.mhtml and so on.. (up to a maximum of 100 files)
2. Run Email Ripper in the same directory.
3. You get an output file named emails.txt containing all the emails
in these 100 files !!!
I was doing a thing like this first , and to find an algorithm to
extract email ids alone was little challenging. It because people type
a lot of things along with an email to make extracting them difficult.
Some will type abc@abc.com.....PLZZZ , abc at abc dot com, abc [at]
abc [dot] com , (abc) at (abc) .com etc etc...
Filtering all the junk out of the relevant part of the email id was
indeed difficult and still I have not achieved a nice method to remove
all unwanted things , a person would type in conjunction with the
email ids, while posting them to forums like, orkut requesting for
materials.
A lot of improvements are needed in Email Ripper to produce 100%
accurate results. Yet here is my version of this, hope it would be
useful for some one. If u have any ideas to make it better kindly send
me a mail.
Well, my algorithm to find an email id works something like this,
1. Start reading characters from a files, write them in an array and
check whether it is @.
2. If its @ step back until u find the beginning of the email id. it
could be a 'space' , comma, semicolon ... anything except a character,
. or _
3. Repeat the same thing for characters after the @ symbol.
Well, I know my code is not beautiful.. there could be a 101 ways to
optimize it. I am yet a newbie in programming. So any suggestions to
improve the style of programming will be most welcome. More-over, if u
could understand my code, then please contact me , and I will give u a
kit-kat as a treat :-) . I wonder whether even I would understand what
I had written , may be after a few months.
Can someone suggest some good and small books to learn good practices
in software engineering ?
So here comes the source code guys, download it, use it and give me
the credit if its credit worthy
:-)
I am planning to use this "Email Ripper" to find more friends from
all over the world, and also to invite more people to have a look at
this blog.
Thats it friends, stay tuned for more stuff regarding GATE-CS from
me. Will I ever be able to solve all previous GATE questions and post
it in this blog ? Well, for a long time I was searching the net to
find solutions to GATE problems , but without much luck.. So I just
thought, why don't i stop searching for solutions to GATE paper and
instead solve them myself ... It would benefit not only me but also
anyone who might be as eager as me to get free solutions to GATE
previous papers. So that is how this blog was born...
Oops.. tooo much ramblings today rt ?
Ok me signing off... just have a look at the code..
:-)
AamzillaA
// Email Ripper
// Copyright 2007, AamzillaA.
// 12-11-2007
// This program finds all email ids in any file
// and prints all of them in emails.txt
// this is the automatic version which assumes
// input file names as 1.mhtml to 100.mhtml
#include
#include
#include
#define BUFFER_SIZE 10000
#define NUMBER_OF_FILES 100
void main()
{
FILE *fptr,*out_file_ptr;
char ch,buffer[BUFFER_SIZE],email[40],file_name[30];
int i=0,j=0,k=0,l=0,count=0,automate_flag=1,file_name_index=0;
int email_flag=0,print_email_flag=0,at_marker=0,null_marker;
char default_file_names[NUMBER_OF_FILES][10] =
{"1.mhtml","2.mhtml",
"3.mhtml","4.mhtml","5.mhtml","6.mhtml","7.mhtml","8.mhtml","9.mhtml","10.mhtml",
"11.mhtml","12.mhtml","13.mhtml","14.mhtml","15.mhtml","16.mhtml","17.mhtml","18.mhtml","19.mhtml","20.mhtml",
"21.mhtml","22.mhtml","23.mhtml","24.mhtml","25.mhtml","26.mhtml","27.mhtml","28.mhtml","29.mhtml","30.mhtml",
"31.mhtml","32.mhtml","33.mhtml","34.mhtml","35.mhtml","36.mhtml","37.mhtml","38.mhtml","39.mhtml","40.mhtml",
"41.mhtml","42.mhtml","43.mhtml","44.mhtml","45.mhtml","46.mhtml","47.mhtml","48.mhtml","49.mhtml","50.mhtml",
"51.mhtml","52.mhtml","53.mhtml","54.mhtml","55.mhtml","56.mhtml","57.mhtml","58.mhtml","59.mhtml","60.mhtml",
"61.mhtml","62.mhtml","63.mhtml","64.mhtml","65.mhtml","66.mhtml","67.mhtml","68.mhtml","69.mhtml","70.mhtml",
"71.mhtml","72.mhtml",
"73.mhtml","74.mhtml","75.mhtml","76.mhtml","77.mhtml","78.mhtml","79.mhtml","80.mhtml",
"81.mhtml","82.mhtml",
"83.mhtml","84.mhtml","85.mhtml","86.mhtml","87.mhtml","88.mhtml","89.mhtml","90.mhtml",
"91.mhtml","92.mhtml",
"93.mhtml","94.mhtml","95.mhtml","96.mhtml","97.mhtml","98.mhtml","99.mhtml","100.mhtml",
};
//fptr=fopen("7.mhtml","rb");
clrscr();
printf("\n Email Ripper \n ");
printf("\n Email Ripper finds all emails in any file." );
printf("\n Email Ripper prints them all in emails.txt");
printf("\n\n\n Designer: AamzillaA");
printf("\n\n\n * Enter file name as 'automate' to automate entering
file names\n\n ");
printf("\n In automate mode input file names are assumed to be... \n ");
printf("\n 1.mhtml,2.mhmtl ... 100.mhtml\n");
printf("\n You can extract email ids from a maximum of 100 files in
automate mode \n\n\n ");
printf("\n\n Enter file name : ");
scanf("%s",file_name);
automate_flag = strcmp(file_name,"automate");
//printf("\n Automate flag = %d ",automate_flag);
file_name_index=0;
//getch();
clrscr();
start_automation:
// Checking whether to automate or not...
if(automate_flag==0)
{
strcpy(file_name,default_file_names[file_name_index]);
file_name_index++;
if(file_name_index > NUMBER_OF_FILES)
{
goto end_of_email_ripper;
}
}
printf("\n Email ids in file %s \n" ,file_name);
fptr=fopen(file_name,"rb");
out_file_ptr=fopen("emails.txt","ab"); // open file in append binary mode
buffer[i]=getc(fptr);
while( buffer[i] != EOF)
{
//printf("%c",buffer[i]);
//check 1
if( buffer[i]=='@' ) // first see whether @ has arrived
{
at_marker=i;
i++;
email_flag=1;
print_email_flag=0;
buffer[i]='\0';
// printf("\n\n %s \n\n",buffer);
i=0; // preparing to over write the buffer
// getch();
}
//check 2
//checking whether end of email has reached after seeing @
if( ( buffer[i] ==' ' || buffer[i]=='<' || buffer[i]== '>' ||
buffer[i] == '(') && (email_flag==1) )
{
email_flag=0;
buffer[i]='\0';
//copying part after @ to email
if(email[null_marker]=='\0')
{
k=null_marker-1; // removing null and extra @ from end of string email.
}
i=0;
while(buffer[i] != '\0')
{
email[k]=buffer[i];
i++;
k++;
}
email[k]='\0'; // copying null to email from buffer
i=0; // initializing buffer index to zero
k=0; // initializing string email...
//printing extention of email id
// printf(" \n Extention is %s \n",buffer);
// printing the complete email id
count++;
printf("\n ( %d ) %s ",count,email);
// writing email to out.txt
l=0;
while(email[l] != '\0')
{
putc(email[l],out_file_ptr);
l++;
}
putc(',',out_file_ptr);
//appending , to seperate email ids
// getch();
}
//check 3
if(i==BUFFER_SIZE-2) // code to prevent buffer over flow, array out of
bound error
{
i=0;
buffer[BUFFER_SIZE-1]='\0';
}
if(email_flag==1 && print_email_flag ==0 ) //rewinding
{
// printf("\n at marker :%d",at_marker);
j=at_marker;
//rewinding
while( buffer[j] != ' ' ) //rewinding backwards from @ till a space is seen
{
if(buffer[j]=='>')
{
break; // stop rewinding when u see a space or >
}
if(buffer[j]==';')
{
break;
}
if(buffer[j]=='<')
{
break;
}
if(buffer[j]==':')
{
break;
}
// printf("\n j: %d",j);
j--; // rewinding till u see a space if not seen a > already
}
//getch();
//copying first part till @ to string email.
while(buffer[j] != '@') //copying rewinded string to email
{
email[k]=buffer[j];
j++;
k++;
// printf(" \n j= %d k= %d ",j,k);
}
email[k]=buffer[j];
k++;
email[k]='\0';
// AamzillaA
null_marker=k;
k=0;
email[0]=' '; // to remove space ,>,< and other unwanted characters
//printf("\n\n Email id is %s ",email);
print_email_flag=1;
//getch();
} // end of rewinding if
i++;
buffer[i]=getc(fptr);
}// goto start of while and read next character ...
fclose(fptr);
fclose(out_file_ptr);
if(automate_flag==0)
{
printf("\n End of file %s \n",file_name);
//getch();
goto start_automation;
}
end_of_email_ripper:
printf(" \n*********************************************************\n" );
printf("\n Total number of email ids extracted to 'emails.txt' : %d \n",count);
printf("\n Thanks for using Email Ripper :-) \n");
printf("\n\n AamzillaA at gmail dot com \n");
printf(" \n*********************************************************\n" );
getch();
}
// Hope u enjoyed this :-)

No comments:
Post a Comment