SAS统计软件如何读取凌乱数据,字符串转日期,并且再在字符串中截取字段? 10
我有一百万行向类似下面的文本数据:x1:小明,x2:1988-03-1712:00:47,x3:男,x4:公务员,x5:喜欢打篮球,x6:1:14,25:10,67:01...
我有一百万行向类似下面的文本数据:
x1:小明,x2:1988-03-17 12:00:47,x3:男,x4:公务员,x5:喜欢打篮球,x6:1:14,25:10,67:01,6:23,x7:25岁
x1:小红,x2:1982-01-17 11:04:24,x3:女,x4:程序员,x5:喜欢乒乓球球,x6:16:14,25:10,67:01,6:23,x7:26岁
x1:小朱**x2:1990-01-17 05:07:11****x3:男,x4:运动员,x5:喜欢游泳/*/-x6:10:10,05:18,77:06,6:23,x7:23岁
x1:小梁**&x2:1978-09-17 05:07:11***x3:男,x4:会计员,x5:喜欢跑步,x6:11:14,07:18,47:09,6:23,x7:35岁
…………
有几个问题:
1.读取以上数据,本来想以逗号为分隔符,可是看到*和&等乱码,发现不能这样读取数据,否则小朱的X1和X2会连在一起。现在希望读取每个Xi和X(i+1)之间的字符串或者数字(包括乱码但是不要逗号)。
2.读完数据以后,我希望将X2转换成为日期格式,比如小明X2这一列转换成1988/03/17 12:00:47。
3.读取完以后,发现X6这个字段需要再分割,而且只读每个取冒号和逗号之间的部分,比如小明分成X6a这一列是14,X6b是10,X6c是01,X6d是23。
4.最后做逻辑回归发现一百万数据,y=0的有930万,而y=1反应变量只有70万,比例是13:1,请问需要在y=0中进行抽样,使得1和0没有那么大差距,最佳比例又是多少?
先谢谢大家了! 展开
x1:小明,x2:1988-03-17 12:00:47,x3:男,x4:公务员,x5:喜欢打篮球,x6:1:14,25:10,67:01,6:23,x7:25岁
x1:小红,x2:1982-01-17 11:04:24,x3:女,x4:程序员,x5:喜欢乒乓球球,x6:16:14,25:10,67:01,6:23,x7:26岁
x1:小朱**x2:1990-01-17 05:07:11****x3:男,x4:运动员,x5:喜欢游泳/*/-x6:10:10,05:18,77:06,6:23,x7:23岁
x1:小梁**&x2:1978-09-17 05:07:11***x3:男,x4:会计员,x5:喜欢跑步,x6:11:14,07:18,47:09,6:23,x7:35岁
…………
有几个问题:
1.读取以上数据,本来想以逗号为分隔符,可是看到*和&等乱码,发现不能这样读取数据,否则小朱的X1和X2会连在一起。现在希望读取每个Xi和X(i+1)之间的字符串或者数字(包括乱码但是不要逗号)。
2.读完数据以后,我希望将X2转换成为日期格式,比如小明X2这一列转换成1988/03/17 12:00:47。
3.读取完以后,发现X6这个字段需要再分割,而且只读每个取冒号和逗号之间的部分,比如小明分成X6a这一列是14,X6b是10,X6c是01,X6d是23。
4.最后做逻辑回归发现一百万数据,y=0的有930万,而y=1反应变量只有70万,比例是13:1,请问需要在y=0中进行抽样,使得1和0没有那么大差距,最佳比例又是多少?
先谢谢大家了! 展开
1个回答
展开全部
*将数据存放在名字为baidu_pro的txt文档中,将整个一行数据读为一列,后用函数对该列数据进行拆分;
data clean.baidu_pro;
infile "f:\clean\baidu_pro.txt" truncover;
format ori_string $200.;
input ori_string 1-200;
run;
data clean.baidu_pro2;
set clean.baidu_pro;
ori_string=compress(ori_string,"*/-& ") ;
format name $10. date_time $14. sex $2. job $20. hobby $20. x6 $30. age $10.;
n1=find(ori_string,"x1");
n2=find(ori_string,"x2");
n3=find(ori_string,"x3");
n4=find(ori_string,"x4");
n5=find(ori_string,"x5");
n6=find(ori_string,"x6");
n7=find(ori_string,"x7");
name=compress(compress(substr(ori_string,n1+2,n2-1),"x123456789:,/ "),",");
date_time=substr(compress(substr(ori_string,n2+2,n3-n2-1),"x:*,&/ "),3,14);
sex=substr(compress(substr(ori_string,n3+2,n4-n3-1),"x:*,&/ "),3,2);
job=substr(compress(substr(ori_string,n4+2,n5-n4-1),"x:*,&/ "),3,find(compress(substr(ori_string,n4+2,n5-n4-1),"x:*,&/ "),",")-2);
hobby=substr(compress(substr(ori_string,n5+2,n6-n5-1),"x:*,&/ "),3);
age=substr(compress(substr(ori_string,n7+2),""),3,find(compress(substr(ori_string,n7+2),""),"岁")-2);
x6=translate(compress(substr(ori_string,n6+2,n7-n6-1),"x*&/ "),",",",");
if _n_=1 then prx=prxparse("/:\d+,/");
retain prx;
start=1;
stop=length(x6);
call prxnext(prx,start,stop,x6,position,length);
array x[4]$;
do i=1 to 4 while (position gt 0);
x[i]=substr(x6,position+1,length-2);
call prxnext(prx,start,stop,x6,position,length);
end;
keep name date_time sex job hobby x1-x4 age;
run;
data clean.baidu_pro;
infile "f:\clean\baidu_pro.txt" truncover;
format ori_string $200.;
input ori_string 1-200;
run;
data clean.baidu_pro2;
set clean.baidu_pro;
ori_string=compress(ori_string,"*/-& ") ;
format name $10. date_time $14. sex $2. job $20. hobby $20. x6 $30. age $10.;
n1=find(ori_string,"x1");
n2=find(ori_string,"x2");
n3=find(ori_string,"x3");
n4=find(ori_string,"x4");
n5=find(ori_string,"x5");
n6=find(ori_string,"x6");
n7=find(ori_string,"x7");
name=compress(compress(substr(ori_string,n1+2,n2-1),"x123456789:,/ "),",");
date_time=substr(compress(substr(ori_string,n2+2,n3-n2-1),"x:*,&/ "),3,14);
sex=substr(compress(substr(ori_string,n3+2,n4-n3-1),"x:*,&/ "),3,2);
job=substr(compress(substr(ori_string,n4+2,n5-n4-1),"x:*,&/ "),3,find(compress(substr(ori_string,n4+2,n5-n4-1),"x:*,&/ "),",")-2);
hobby=substr(compress(substr(ori_string,n5+2,n6-n5-1),"x:*,&/ "),3);
age=substr(compress(substr(ori_string,n7+2),""),3,find(compress(substr(ori_string,n7+2),""),"岁")-2);
x6=translate(compress(substr(ori_string,n6+2,n7-n6-1),"x*&/ "),",",",");
if _n_=1 then prx=prxparse("/:\d+,/");
retain prx;
start=1;
stop=length(x6);
call prxnext(prx,start,stop,x6,position,length);
array x[4]$;
do i=1 to 4 while (position gt 0);
x[i]=substr(x6,position+1,length-2);
call prxnext(prx,start,stop,x6,position,length);
end;
keep name date_time sex job hobby x1-x4 age;
run;
已赞过
已踩过<
评论
收起
你对这个回答的评价是?
推荐律师服务:
若未解决您的问题,请您详细描述您的问题,通过百度律临进行免费专业咨询